xpk 0.4.0__tar.gz → 0.5.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: xpk
3
- Version: 0.4.0
3
+ Version: 0.5.0
4
4
  Summary: xpk helps Cloud developers to orchestrate training jobs on accelerators on GKE.
5
5
  Author-email: Cloud TPU Team <cloud-tpu-eng@google.com>
6
6
  License: Apache-2.0
@@ -139,14 +139,6 @@ gcloud config set compute/zone $ZONE
139
139
  xpk .. --zone $ZONE --project $PROJECT_ID
140
140
  ```
141
141
 
142
- `Cluster Create` command will create a project-specific Service Account. Note that only one service
143
- account will be created per project. This service account will be attached to the node pools instead of default
144
- [Compute Engine Service Account](https://cloud.google.com/compute/docs/access/service-accounts#default_service_account).
145
- All the required permissions will be assigned to this service account by XPK. Make sure you have
146
- [Service Account Admin](https://cloud.google.com/iam/docs/understanding-roles#iam.serviceAccountAdmin) and
147
- [Project IAM Admin](https://cloud.google.com/iam/docs/understanding-roles#resourcemanager.projectIamAdmin)
148
- roles assigned to your user account.
149
-
150
142
  The cluster created is a regional cluster to enable the GKE control plane across
151
143
  all zones.
152
144
 
@@ -226,7 +218,9 @@ all zones.
226
218
  ```
227
219
 
228
220
  ### Create Vertex AI Tensorboard
229
- *Note: This feature is available in XPK >= 0.4.0. Enable [Vertex AI API](https://cloud.google.com/vertex-ai/docs/start/cloud-environment#enable_vertexai_apis) in your Google Cloud console to use this feature.*
221
+ *Note: This feature is available in XPK >= 0.4.0. Enable [Vertex AI API](https://cloud.google.com/vertex-ai/docs/start/cloud-environment#enable_vertexai_apis) in your Google Cloud console to use this feature. Make sure you have
222
+ [Vertex AI Administrator](https://cloud.google.com/vertex-ai/docs/general/access-control#aiplatform.admin) role
223
+ assigned to your user account.*
230
224
 
231
225
  Vertex AI Tensorboard is a fully managed version of open-source Tensorboard. To learn more about Vertex AI Tensorboard, visit [this](https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-introduction). Note that Vertex AI Tensorboard is only available in [these](https://cloud.google.com/vertex-ai/docs/general/locations#available-regions) regions.
232
226
 
@@ -386,7 +380,9 @@ checkpointing so the job restarts near where it was interrupted.
386
380
  ```
387
381
 
388
382
  ### Create Vertex AI Experiment to upload data to Vertex AI Tensorboard
389
- *Note: This feature is available in XPK >= 0.4.0. Enable [Vertex AI API](https://cloud.google.com/vertex-ai/docs/start/cloud-environment#enable_vertexai_apis) in your Google Cloud console to use this feature.*
383
+ *Note: This feature is available in XPK >= 0.4.0. Enable [Vertex AI API](https://cloud.google.com/vertex-ai/docs/start/cloud-environment#enable_vertexai_apis) in your Google Cloud console to use this feature. Make sure you have
384
+ [Vertex AI Administrator](https://cloud.google.com/vertex-ai/docs/general/access-control#aiplatform.admin) role
385
+ assigned to your user account and to the [Compute Engine Service account](https://cloud.google.com/compute/docs/access/service-accounts#default_service_account) attached to the node pools in the cluster.*
390
386
 
391
387
  Vertex AI Experiment is a tool that helps to track and analyze an experiment run on Vertex AI Tensorboard. To learn more about Vertex AI Experiments, visit [this](https://cloud.google.com/vertex-ai/docs/experiments/intro-vertex-ai-experiments).
392
388
 
@@ -120,14 +120,6 @@ gcloud config set compute/zone $ZONE
120
120
  xpk .. --zone $ZONE --project $PROJECT_ID
121
121
  ```
122
122
 
123
- `Cluster Create` command will create a project-specific Service Account. Note that only one service
124
- account will be created per project. This service account will be attached to the node pools instead of default
125
- [Compute Engine Service Account](https://cloud.google.com/compute/docs/access/service-accounts#default_service_account).
126
- All the required permissions will be assigned to this service account by XPK. Make sure you have
127
- [Service Account Admin](https://cloud.google.com/iam/docs/understanding-roles#iam.serviceAccountAdmin) and
128
- [Project IAM Admin](https://cloud.google.com/iam/docs/understanding-roles#resourcemanager.projectIamAdmin)
129
- roles assigned to your user account.
130
-
131
123
  The cluster created is a regional cluster to enable the GKE control plane across
132
124
  all zones.
133
125
 
@@ -207,7 +199,9 @@ all zones.
207
199
  ```
208
200
 
209
201
  ### Create Vertex AI Tensorboard
210
- *Note: This feature is available in XPK >= 0.4.0. Enable [Vertex AI API](https://cloud.google.com/vertex-ai/docs/start/cloud-environment#enable_vertexai_apis) in your Google Cloud console to use this feature.*
202
+ *Note: This feature is available in XPK >= 0.4.0. Enable [Vertex AI API](https://cloud.google.com/vertex-ai/docs/start/cloud-environment#enable_vertexai_apis) in your Google Cloud console to use this feature. Make sure you have
203
+ [Vertex AI Administrator](https://cloud.google.com/vertex-ai/docs/general/access-control#aiplatform.admin) role
204
+ assigned to your user account.*
211
205
 
212
206
  Vertex AI Tensorboard is a fully managed version of open-source Tensorboard. To learn more about Vertex AI Tensorboard, visit [this](https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-introduction). Note that Vertex AI Tensorboard is only available in [these](https://cloud.google.com/vertex-ai/docs/general/locations#available-regions) regions.
213
207
 
@@ -367,7 +361,9 @@ checkpointing so the job restarts near where it was interrupted.
367
361
  ```
368
362
 
369
363
  ### Create Vertex AI Experiment to upload data to Vertex AI Tensorboard
370
- *Note: This feature is available in XPK >= 0.4.0. Enable [Vertex AI API](https://cloud.google.com/vertex-ai/docs/start/cloud-environment#enable_vertexai_apis) in your Google Cloud console to use this feature.*
364
+ *Note: This feature is available in XPK >= 0.4.0. Enable [Vertex AI API](https://cloud.google.com/vertex-ai/docs/start/cloud-environment#enable_vertexai_apis) in your Google Cloud console to use this feature. Make sure you have
365
+ [Vertex AI Administrator](https://cloud.google.com/vertex-ai/docs/general/access-control#aiplatform.admin) role
366
+ assigned to your user account and to the [Compute Engine Service account](https://cloud.google.com/compute/docs/access/service-accounts#default_service_account) attached to the node pools in the cluster.*
371
367
 
372
368
  Vertex AI Experiment is a tool that helps to track and analyze an experiment run on Vertex AI Tensorboard. To learn more about Vertex AI Experiments, visit [this](https://cloud.google.com/vertex-ai/docs/experiments/intro-vertex-ai-experiments).
373
369
 
@@ -1,6 +1,6 @@
1
1
  Metadata-Version: 2.1
2
2
  Name: xpk
3
- Version: 0.4.0
3
+ Version: 0.5.0
4
4
  Summary: xpk helps Cloud developers to orchestrate training jobs on accelerators on GKE.
5
5
  Author-email: Cloud TPU Team <cloud-tpu-eng@google.com>
6
6
  License: Apache-2.0
@@ -139,14 +139,6 @@ gcloud config set compute/zone $ZONE
139
139
  xpk .. --zone $ZONE --project $PROJECT_ID
140
140
  ```
141
141
 
142
- `Cluster Create` command will create a project-specific Service Account. Note that only one service
143
- account will be created per project. This service account will be attached to the node pools instead of default
144
- [Compute Engine Service Account](https://cloud.google.com/compute/docs/access/service-accounts#default_service_account).
145
- All the required permissions will be assigned to this service account by XPK. Make sure you have
146
- [Service Account Admin](https://cloud.google.com/iam/docs/understanding-roles#iam.serviceAccountAdmin) and
147
- [Project IAM Admin](https://cloud.google.com/iam/docs/understanding-roles#resourcemanager.projectIamAdmin)
148
- roles assigned to your user account.
149
-
150
142
  The cluster created is a regional cluster to enable the GKE control plane across
151
143
  all zones.
152
144
 
@@ -226,7 +218,9 @@ all zones.
226
218
  ```
227
219
 
228
220
  ### Create Vertex AI Tensorboard
229
- *Note: This feature is available in XPK >= 0.4.0. Enable [Vertex AI API](https://cloud.google.com/vertex-ai/docs/start/cloud-environment#enable_vertexai_apis) in your Google Cloud console to use this feature.*
221
+ *Note: This feature is available in XPK >= 0.4.0. Enable [Vertex AI API](https://cloud.google.com/vertex-ai/docs/start/cloud-environment#enable_vertexai_apis) in your Google Cloud console to use this feature. Make sure you have
222
+ [Vertex AI Administrator](https://cloud.google.com/vertex-ai/docs/general/access-control#aiplatform.admin) role
223
+ assigned to your user account.*
230
224
 
231
225
  Vertex AI Tensorboard is a fully managed version of open-source Tensorboard. To learn more about Vertex AI Tensorboard, visit [this](https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-introduction). Note that Vertex AI Tensorboard is only available in [these](https://cloud.google.com/vertex-ai/docs/general/locations#available-regions) regions.
232
226
 
@@ -386,7 +380,9 @@ checkpointing so the job restarts near where it was interrupted.
386
380
  ```
387
381
 
388
382
  ### Create Vertex AI Experiment to upload data to Vertex AI Tensorboard
389
- *Note: This feature is available in XPK >= 0.4.0. Enable [Vertex AI API](https://cloud.google.com/vertex-ai/docs/start/cloud-environment#enable_vertexai_apis) in your Google Cloud console to use this feature.*
383
+ *Note: This feature is available in XPK >= 0.4.0. Enable [Vertex AI API](https://cloud.google.com/vertex-ai/docs/start/cloud-environment#enable_vertexai_apis) in your Google Cloud console to use this feature. Make sure you have
384
+ [Vertex AI Administrator](https://cloud.google.com/vertex-ai/docs/general/access-control#aiplatform.admin) role
385
+ assigned to your user account and to the [Compute Engine Service account](https://cloud.google.com/compute/docs/access/service-accounts#default_service_account) attached to the node pools in the cluster.*
390
386
 
391
387
  Vertex AI Experiment is a tool that helps to track and analyze an experiment run on Vertex AI Tensorboard. To learn more about Vertex AI Experiments, visit [this](https://cloud.google.com/vertex-ai/docs/experiments/intro-vertex-ai-experiments).
392
388
 
@@ -67,9 +67,8 @@ if (
67
67
 
68
68
  default_docker_image = 'python:3.10'
69
69
  default_script_dir = os.getcwd()
70
- default_gke_version = '1.29.1-gke.1589017'
71
70
  # This is the version for XPK PyPI package
72
- __version__ = '0.4.0'
71
+ __version__ = '0.5.0'
73
72
  xpk_current_version = __version__
74
73
 
75
74
  h100_device_type = 'h100-80gb-8'
@@ -85,10 +84,7 @@ _LOCAL_QUEUE_NAME = 'multislice-queue'
85
84
  _DEFAULT_POOL_NAME = 'default-pool'
86
85
  _CLUSTER_RESOURCES_CONFIGMAP = 'resources-configmap'
87
86
  _CLUSTER_METADATA_CONFIGMAP = 'metadata-configmap'
88
- _XPK_SERVICE_ACCOUNT = 'xpk-sa'
89
- # Set to True to attach a service account to cluster & node pools
90
- _SERVICE_ACCOUNT_FEATURE_FLAG = xpk_current_version >= '0.4.0'
91
- _VERTEX_TENSORBOARD_FEATURE_FLAG = _SERVICE_ACCOUNT_FEATURE_FLAG
87
+ _VERTEX_TENSORBOARD_FEATURE_FLAG = xpk_current_version >= '0.4.0'
92
88
  _DEFAULT_VERTEX_TENSORBOARD_NAME = 'tb-instance'
93
89
 
94
90
 
@@ -264,11 +260,12 @@ spec:
264
260
  labels:
265
261
  xpk.google.com/workload: {args.workload}
266
262
  spec:
267
- backoffLimit: 0
263
+ backoffLimit: 4
268
264
  completions: {system.vms_per_slice}
269
265
  parallelism: {system.vms_per_slice}
270
266
  template:
271
267
  spec:
268
+ terminationGracePeriodSeconds: {args.termination_grace_period_seconds}
272
269
  containers:
273
270
  - args:
274
271
  {pathways_worker_args}
@@ -611,7 +608,6 @@ management:
611
608
  autoprovisioningLocations:
612
609
  {zones}
613
610
  {resource_limits}
614
- {service_account}
615
611
  """
616
612
 
617
613
  autoprovisioning_resource_limits = """
@@ -629,16 +625,6 @@ autoprovisioning_custom_resource_type = """
629
625
  maximum: {maximum}
630
626
  """
631
627
 
632
- # Add IAM roles to attach to service account used by node pools in the cluster
633
- IAMRoles = {
634
- 'Kubernetes Engine Admin': 'roles/container.admin',
635
- 'Artifact Registry Writer': 'roles/artifactregistry.writer',
636
- 'Monitoring Admin': 'roles/monitoring.admin',
637
- 'Logging Admin': 'roles/logging.admin',
638
- 'Storage Admin': 'roles/storage.admin',
639
- 'Vertex AI Administrator': 'roles/aiplatform.admin',
640
- }
641
-
642
628
 
643
629
  AcceleratorType = {'TPU': 1, 'GPU': 2, 'CPU': 3}
644
630
 
@@ -1891,6 +1877,14 @@ the corresponding Map in MaxText/accelerator_to_spec_map.py """
1891
1877
  # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1892
1878
 
1893
1879
 
1880
+ PathwaysExpectedInstancesMap = {
1881
+ 'v5p': 'v5',
1882
+ 'v5litepod': 'v5e',
1883
+ 'v4': 'v4',
1884
+ 'v3': 'v3',
1885
+ }
1886
+
1887
+
1894
1888
  def chunks(lst, n):
1895
1889
  """Return a list of n-sized chunks from lst.
1896
1890
 
@@ -2162,7 +2156,12 @@ def append_temporary_file(payload, file):
2162
2156
 
2163
2157
 
2164
2158
  def run_command_for_value(
2165
- command, task, global_args, dry_run_return_val='0', print_timer=False
2159
+ command,
2160
+ task,
2161
+ global_args,
2162
+ dry_run_return_val='0',
2163
+ print_timer=False,
2164
+ hide_error=False,
2166
2165
  ) -> tuple[int, str]:
2167
2166
  """Runs the command and returns the error code and stdout.
2168
2167
 
@@ -2173,6 +2172,8 @@ def run_command_for_value(
2173
2172
  task: user provided task name for running the command.
2174
2173
  global_args: user provided arguments for running the command.
2175
2174
  dry_run_return_val: return value of this command for dry run.
2175
+ print_timer: print out the time the command is running.
2176
+ hide_error: hide the error from the command output upon success.
2176
2177
 
2177
2178
  Returns:
2178
2179
  tuple[int, str]
@@ -2216,7 +2217,7 @@ def run_command_for_value(
2216
2217
  output = subprocess.check_output(
2217
2218
  command,
2218
2219
  shell=True,
2219
- stderr=subprocess.STDOUT,
2220
+ stderr=subprocess.STDOUT if not hide_error else None,
2220
2221
  )
2221
2222
  except subprocess.CalledProcessError as e:
2222
2223
  xpk_print(f'Task {task} failed with {e.returncode}')
@@ -2397,142 +2398,6 @@ def zone_to_region(zone) -> str:
2397
2398
  return zone_terms[0] + '-' + zone_terms[1]
2398
2399
 
2399
2400
 
2400
- def get_service_account_name(args) -> str:
2401
- """Get the name for the service account.
2402
- Args:
2403
- args: user provided arguments.
2404
-
2405
- Returns:
2406
- the name of the service account.
2407
- """
2408
- return f'{args.project}-{_XPK_SERVICE_ACCOUNT}@{args.project}.iam.gserviceaccount.com'
2409
-
2410
-
2411
- def check_if_service_account_exists(args) -> bool:
2412
- """Check if a service account with the given name exists in the project.
2413
-
2414
- Args:
2415
- args: user provided arguments for running the command.
2416
-
2417
- Returns:
2418
- True if service account exist, False otherwise.
2419
- """
2420
- service_account_name = get_service_account_name(args)
2421
- command = f'gcloud iam service-accounts describe {service_account_name}'
2422
- return_code = run_command_with_updates(
2423
- command, 'Service Account Describe', args, verbose=False
2424
- )
2425
- if return_code != 0:
2426
- xpk_print(
2427
- 'Service Account Describe did not find the service account'
2428
- f' {service_account_name}.'
2429
- )
2430
- return False
2431
- return True
2432
-
2433
-
2434
- def create_service_account(args) -> int:
2435
- """Creates a service account in the project.
2436
-
2437
- Args:
2438
- args: user provided arguments for running the command.
2439
-
2440
- Returns:
2441
- 0 if successful and 1 otherwise.
2442
- """
2443
- command = (
2444
- 'gcloud iam service-accounts create'
2445
- f' {args.project}-{_XPK_SERVICE_ACCOUNT} --description="Service'
2446
- ' Account for XPK" '
2447
- f' --display-name="{args.project}-{_XPK_SERVICE_ACCOUNT}"'
2448
- )
2449
- return_code = run_command_with_updates(
2450
- command, 'Service Account Create', args
2451
- )
2452
- if return_code != 0:
2453
- xpk_print(f'Service Account Create request returned ERROR {return_code}')
2454
- xpk_print(
2455
- 'Make sure you have Service Account Admin Role attached to your user.'
2456
- )
2457
- return 1
2458
- return 0
2459
-
2460
-
2461
- def get_existing_roles_in_service_account(args) -> set:
2462
- """
2463
- Args:
2464
- args: user provided arguments for running the command.
2465
-
2466
- Returns:
2467
- set of IAM roles already attached to the service account.
2468
- """
2469
- roles = set()
2470
- service_account_name = get_service_account_name(args)
2471
- command = (
2472
- f'gcloud projects get-iam-policy {args.project}'
2473
- f' --filter="bindings.members:{service_account_name}"'
2474
- ' --flatten="bindings[].members" --format="table(bindings.role)"'
2475
- )
2476
- return_code, return_value = run_command_for_value(
2477
- command, 'Get IAM Roles For Service Account', args
2478
- )
2479
- if return_code != 0:
2480
- xpk_print(
2481
- 'Get IAM Roles For Service Account request returned ERROR'
2482
- f' {return_code}'
2483
- )
2484
- else:
2485
- return_value = return_value.strip()
2486
- roles = set(return_value.split('\n'))
2487
- """Format of return_value is:
2488
- ROLE
2489
- roles/storage.admin
2490
- roles/logging.admin
2491
- removing `ROLE` from the list
2492
- """
2493
- if 'ROLE' in roles:
2494
- roles.remove('ROLE')
2495
- return roles
2496
-
2497
-
2498
- def add_roles_to_service_account(args) -> int:
2499
- """Add IAM roles to service account.
2500
-
2501
- Args:
2502
- args: user provided arguments for running the command.
2503
-
2504
- Returns:
2505
- 0 if successful and 1 otherwise.
2506
- """
2507
- service_account_name = get_service_account_name(args)
2508
- existing_roles = get_existing_roles_in_service_account(args)
2509
- xpk_print(f'IAM roles already attached to service account: {existing_roles}')
2510
-
2511
- for name, role in IAMRoles.items():
2512
- if role in existing_roles:
2513
- continue
2514
-
2515
- xpk_print(f'Adding {name} role to service account: {service_account_name}.')
2516
- command = (
2517
- f'gcloud projects add-iam-policy-binding {args.project} '
2518
- f' --member="serviceAccount:{service_account_name}" '
2519
- f' --role="{role}" --condition=None'
2520
- )
2521
- return_code = run_command_with_updates(
2522
- command, 'Add IAM Role to Service Account', args, verbose=False
2523
- )
2524
- if return_code != 0:
2525
- xpk_print(
2526
- 'Add IAM Role to Service Account request returned ERROR'
2527
- f' {return_code}'
2528
- )
2529
- xpk_print(
2530
- 'Make sure you have Project IAM Admin Role attached to your user.'
2531
- )
2532
- return 1
2533
- return 0
2534
-
2535
-
2536
2401
  def get_total_chips_requested_from_args(
2537
2402
  args, system: SystemCharacteristics
2538
2403
  ) -> int:
@@ -2631,16 +2496,7 @@ def create_autoprovisioning_config(
2631
2496
  custom_resource_type=custom_resource_string,
2632
2497
  )
2633
2498
 
2634
- # Default service_account is the project's default service account.
2635
- service_account = ''
2636
- if _SERVICE_ACCOUNT_FEATURE_FLAG:
2637
- service_account_name = get_service_account_name(args)
2638
- service_account_exists = check_if_service_account_exists(args)
2639
- if service_account_exists:
2640
- service_account = f'serviceAccount: {service_account_name}'
2641
-
2642
2499
  yml_string = autoprovisioning_config_file.format(
2643
- service_account=service_account,
2644
2500
  resource_limits=resource_limits,
2645
2501
  zones=f'- {args.zone}',
2646
2502
  )
@@ -2743,11 +2599,12 @@ def enable_autoprovisioning_on_cluster(
2743
2599
  return autoprovisioning_config, return_code
2744
2600
 
2745
2601
 
2746
- def run_gke_cluster_create_command(args) -> int:
2602
+ def run_gke_cluster_create_command(args, gke_control_plane_version: str) -> int:
2747
2603
  """Run the Create GKE Cluster request.
2748
2604
 
2749
2605
  Args:
2750
2606
  args: user provided arguments for running the command.
2607
+ gke_control_plane_version: version used if creating the cluster.
2751
2608
 
2752
2609
  Returns:
2753
2610
  0 if successful and 1 otherwise.
@@ -2772,7 +2629,7 @@ def run_gke_cluster_create_command(args) -> int:
2772
2629
  f' {args.cluster} --project={args.project}'
2773
2630
  f' --region={zone_to_region(args.zone)}'
2774
2631
  f' --node-locations={args.zone}'
2775
- f' --cluster-version={args.gke_version}'
2632
+ f' --cluster-version={gke_control_plane_version}'
2776
2633
  f' --machine-type={machine_type}'
2777
2634
  ' --enable-autoscaling'
2778
2635
  ' --total-min-nodes 1 --total-max-nodes 1000'
@@ -2780,17 +2637,6 @@ def run_gke_cluster_create_command(args) -> int:
2780
2637
  f' {args.custom_cluster_arguments}'
2781
2638
  )
2782
2639
 
2783
- if _SERVICE_ACCOUNT_FEATURE_FLAG:
2784
- service_account_name = get_service_account_name(args)
2785
- service_account_exists = check_if_service_account_exists(args)
2786
- if service_account_exists:
2787
- command += f' --service-account={service_account_name}'
2788
- else:
2789
- xpk_print(
2790
- f'Service Account: {service_account_name} does not exist in the'
2791
- ' project. Will attach the default service account to the cluster.'
2792
- )
2793
-
2794
2640
  device_type = args.tpu_type if args.tpu_type else args.device_type
2795
2641
  if device_type == h100_device_type:
2796
2642
  command += (
@@ -3376,11 +3222,12 @@ def get_all_clusters_programmatic(args) -> tuple[list[str], int]:
3376
3222
  return cluster_names, 0
3377
3223
 
3378
3224
 
3379
- def create_cluster_if_necessary(args) -> int:
3225
+ def create_cluster_if_necessary(args, gke_control_plane_version: str) -> int:
3380
3226
  """Creates cluster if not present in the project.
3381
3227
 
3382
3228
  Args:
3383
3229
  args: user provided arguments for running the command.
3230
+ gke_control_plane_version: version used if creating the cluster.
3384
3231
 
3385
3232
  Returns:
3386
3233
  0 if successful and 1 otherwise.
@@ -3390,10 +3237,10 @@ def create_cluster_if_necessary(args) -> int:
3390
3237
  xpk_print('Listing all clusters failed!')
3391
3238
  return 1
3392
3239
  if args.cluster in all_clusters:
3393
- xpk_print('Skipping cluster creation since it already exists')
3240
+ xpk_print('Skipping cluster creation since it already exists.')
3394
3241
  return 0
3395
3242
  else:
3396
- return run_gke_cluster_create_command(args)
3243
+ return run_gke_cluster_create_command(args, gke_control_plane_version)
3397
3244
 
3398
3245
 
3399
3246
  def get_all_nodepools_programmatic(args) -> tuple[list[str], int]:
@@ -3499,12 +3346,15 @@ def get_user_input(input_msg):
3499
3346
  return user_input in ('y', 'yes')
3500
3347
 
3501
3348
 
3502
- def run_gke_node_pool_create_command(args, system) -> int:
3349
+ def run_gke_node_pool_create_command(
3350
+ args, system, gke_node_pool_version
3351
+ ) -> int:
3503
3352
  """Run the Create GKE Node Pool request.
3504
3353
 
3505
3354
  Args:
3506
3355
  args: user provided arguments for running the command.
3507
3356
  system: System characteristics based on device type/topology.
3357
+ gke_node_pool_version: GKE version to use to create node pools.
3508
3358
 
3509
3359
  Returns:
3510
3360
  0 if successful and 1 otherwise.
@@ -3574,10 +3424,12 @@ def run_gke_node_pool_create_command(args, system) -> int:
3574
3424
  f' {args.custom_nodepool_arguments}'
3575
3425
  )
3576
3426
  if system.accelerator_type == AcceleratorType['TPU']:
3577
- command += f' --node-version={args.gke_version}'
3427
+ command += f' --node-version={gke_node_pool_version}'
3578
3428
  command += f' --num-nodes={system.vms_per_slice}'
3579
3429
  command += ' --placement-type=COMPACT --max-pods-per-node 15'
3580
- command += ' --scopes=storage-full,gke-default'
3430
+ command += (
3431
+ ' --scopes=storage-full,gke-default,"https://www.googleapis.com/auth/cloud-platform"'
3432
+ )
3581
3433
  command += f' --tpu-topology={system.topology}'
3582
3434
  command += f' {args.custom_tpu_nodepool_arguments}'
3583
3435
  elif system.accelerator_type == AcceleratorType['GPU']:
@@ -3601,17 +3453,6 @@ def run_gke_node_pool_create_command(args, system) -> int:
3601
3453
  command += f' --num-nodes={system.vms_per_slice}'
3602
3454
  command += ' --scopes=storage-full,gke-default'
3603
3455
 
3604
- if _SERVICE_ACCOUNT_FEATURE_FLAG:
3605
- service_account_name = get_service_account_name(args)
3606
- service_account_exists = check_if_service_account_exists(args)
3607
- if service_account_exists:
3608
- command += f' --service-account={service_account_name}'
3609
- else:
3610
- xpk_print(
3611
- f'Service Account: {service_account_name} does not exist in the'
3612
- ' project. Will attach the default service account to the node'
3613
- ' pools.'
3614
- )
3615
3456
  task = f'NodepoolCreate-{node_pool_name}'
3616
3457
  commands.append(command)
3617
3458
  task_names.append(task)
@@ -3624,7 +3465,7 @@ def run_gke_node_pool_create_command(args, system) -> int:
3624
3465
  continue
3625
3466
  command = (
3626
3467
  'gcloud beta container node-pools create'
3627
- f' {node_pool_name} --node-version={args.gke_version}'
3468
+ f' {node_pool_name} --node-version={gke_node_pool_version}'
3628
3469
  f' --cluster={args.cluster}'
3629
3470
  f' --project={args.project} --node-locations={args.zone}'
3630
3471
  f' --region={zone_to_region(args.zone)}'
@@ -4060,6 +3901,184 @@ def install_nccl_on_cluster(args) -> int:
4060
3901
  return 0
4061
3902
 
4062
3903
 
3904
+ @dataclass
3905
+ class GkeServerConfig:
3906
+ """Stores the valid gke versions based on gcloud recommendations."""
3907
+
3908
+ default_rapid_gke_version: str
3909
+ valid_master_versions: set[str]
3910
+ valid_node_versions: set[str]
3911
+
3912
+
3913
+ def get_gke_server_config(args) -> tuple[int, GkeServerConfig | None]:
3914
+ """Determine the GKE versions supported by gcloud currently.
3915
+
3916
+ Args:
3917
+ args: user provided arguments for running the command.
3918
+
3919
+ Returns:
3920
+ Tuple of
3921
+ int: 0 if successful and 1 otherwise.
3922
+ GkeServerConfig: stores valid gke version to use in node pool and cluster.
3923
+ """
3924
+ base_command = (
3925
+ 'gcloud container get-server-config'
3926
+ f' --project={args.project} --region={zone_to_region(args.zone)}'
3927
+ )
3928
+ default_rapid_gke_version_cmd = (
3929
+ base_command
3930
+ + ' --flatten="channels" --filter="channels.channel=RAPID"'
3931
+ ' --format="value(channels.defaultVersion)"'
3932
+ )
3933
+ valid_master_versions_cmd = (
3934
+ base_command
3935
+ + ' --flatten="channels" --format="value(validMasterVersions)"'
3936
+ )
3937
+ valid_node_versions_cmd = (
3938
+ base_command + ' --flatten="channels" --format="value(validNodeVersions)"'
3939
+ )
3940
+ base_command_description = 'Determine server supported GKE versions for'
3941
+
3942
+ server_config_commands_and_descriptions = [
3943
+ (
3944
+ default_rapid_gke_version_cmd,
3945
+ base_command_description + 'default rapid gke version',
3946
+ ),
3947
+ (
3948
+ valid_master_versions_cmd,
3949
+ base_command_description + 'valid master versions',
3950
+ ),
3951
+ (
3952
+ valid_node_versions_cmd,
3953
+ base_command_description + 'valid node versions',
3954
+ ),
3955
+ ]
3956
+ command_outputs = []
3957
+
3958
+ for command, command_description in server_config_commands_and_descriptions:
3959
+ return_code, cmd_output = run_command_for_value(
3960
+ command,
3961
+ command_description,
3962
+ args,
3963
+ hide_error=True,
3964
+ )
3965
+ if return_code != 0:
3966
+ xpk_print(f'Unable to get server config for {command_description}.')
3967
+ return return_code, None
3968
+ command_outputs.append(cmd_output)
3969
+
3970
+ return 0, GkeServerConfig(
3971
+ default_rapid_gke_version=command_outputs[0].strip(),
3972
+ valid_master_versions=set(command_outputs[1].split(';')),
3973
+ valid_node_versions=set(command_outputs[2].split(';')),
3974
+ )
3975
+
3976
+
3977
+ def get_gke_control_plane_version(
3978
+ args, gke_server_config: GkeServerConfig
3979
+ ) -> tuple[int, str | None]:
3980
+ """Determine gke control plane version for cluster creation.
3981
+
3982
+ Args:
3983
+ args: user provided arguments for running the command.
3984
+ gke_server_config: holds valid gke versions and recommended default version.
3985
+
3986
+ Returns:
3987
+ Tuple of
3988
+ int: 0 if successful and 1 otherwise.
3989
+ str: gke control plane version to use.
3990
+ """
3991
+
3992
+ # Override with user provide gke version if specified.
3993
+ if args.gke_version is not None:
3994
+ master_gke_version = args.gke_version
3995
+ else:
3996
+ master_gke_version = gke_server_config.default_rapid_gke_version
3997
+
3998
+ is_valid_master_version = (
3999
+ master_gke_version in gke_server_config.valid_master_versions
4000
+ )
4001
+ is_valid_node_version = (
4002
+ master_gke_version in gke_server_config.valid_node_versions
4003
+ )
4004
+
4005
+ if not is_valid_master_version or not is_valid_node_version:
4006
+ xpk_print(
4007
+ f'Planned GKE Version: {master_gke_version}\n Valid Master'
4008
+ f' Versions:\n{gke_server_config.valid_master_versions}\nValid Node'
4009
+ f' Versions:\n{gke_server_config.valid_node_versions}\nRecommended GKE'
4010
+ f' Version: {gke_server_config.default_rapid_gke_version}'
4011
+ )
4012
+ xpk_print(
4013
+ f'Error: Planned GKE Version {master_gke_version} is not valid.'
4014
+ f'Checks failed: Is Master Valid: {is_valid_master_version}'
4015
+ f'\nIs Valid Node Version: {is_valid_node_version}'
4016
+ )
4017
+ xpk_print(
4018
+ 'Please select a gke version from the above list using --gke-version=x'
4019
+ ' argument or rely on the default gke version:'
4020
+ f' {gke_server_config.default_rapid_gke_version}'
4021
+ )
4022
+ return 1, None
4023
+
4024
+ return 0, master_gke_version
4025
+
4026
+
4027
+ def get_gke_node_pool_version(
4028
+ args, gke_server_config: GkeServerConfig
4029
+ ) -> tuple[int, str | None]:
4030
+ """Determine the gke node pool version for the node pool.
4031
+
4032
+ Args:
4033
+ args: user provided arguments for running the command.
4034
+ gke_server_config: holds valid gke versions and recommended default version.
4035
+
4036
+ Returns:
4037
+ Tuple of
4038
+ int: 0 if successful and 1 otherwise.
4039
+ str: gke control plane version to use.
4040
+ """
4041
+
4042
+ # By default use the current gke master version for creating node pools.
4043
+ command_description = 'Determine current gke master version'
4044
+ command = (
4045
+ f'gcloud beta container clusters describe {args.cluster}'
4046
+ f' --region {zone_to_region(args.zone)} --project {args.project}'
4047
+ ' --format="value(currentMasterVersion)"'
4048
+ )
4049
+
4050
+ return_code, current_gke_master_version = run_command_for_value(
4051
+ command, command_description, args
4052
+ )
4053
+ if return_code != 0:
4054
+ xpk_print(
4055
+ f'Unable to get server config for command: {command_description}.'
4056
+ )
4057
+ return return_code, None
4058
+
4059
+ # Override with user provide gke version if specified.
4060
+ if args.gke_version is not None:
4061
+ node_pool_gke_version = args.gke_version
4062
+ else:
4063
+ node_pool_gke_version = current_gke_master_version.strip()
4064
+
4065
+ is_supported_node_pool_version = (
4066
+ node_pool_gke_version in gke_server_config.valid_node_versions
4067
+ )
4068
+ # In rare cases, user's provided gke version may be invalid, but gke will return an error if so.
4069
+ # An example scenario is if the user provided gke version is greater than the master version.
4070
+ if not is_supported_node_pool_version:
4071
+ xpk_print(
4072
+ f'Planned node pool version {node_pool_gke_version} is not supported in'
4073
+ ' valid node_pool_gke_versions'
4074
+ f' {gke_server_config.valid_node_versions}Please adjust the gke version'
4075
+ ' using --gke-version=x or remove the arg and depend on xpk default of'
4076
+ f' {current_gke_master_version}'
4077
+ )
4078
+ return 1, None
4079
+ return 0, node_pool_gke_version
4080
+
4081
+
4063
4082
  ################### Subcommand Functions ###################
4064
4083
  def default_subcommand_function(
4065
4084
  _args,
@@ -4097,26 +4116,19 @@ def cluster_create(args) -> int:
4097
4116
  xpk_print(f'Starting cluster create for cluster {args.cluster}:', flush=True)
4098
4117
  add_zone_and_project(args)
4099
4118
 
4100
- if _SERVICE_ACCOUNT_FEATURE_FLAG:
4101
- service_account_name = get_service_account_name(args)
4102
- service_account_exists = check_if_service_account_exists(args)
4103
- if service_account_exists:
4104
- xpk_print(
4105
- f'Service Account: {service_account_name} already exist in the'
4106
- ' project. Will not create a new service account.'
4107
- )
4108
- else:
4109
- # create a service account in the project
4110
- create_service_account_code = create_service_account(args)
4111
- if create_service_account_code != 0:
4112
- xpk_exit(create_service_account_code)
4119
+ return_code, gke_server_config = get_gke_server_config(args)
4120
+ if return_code != 0:
4121
+ xpk_exit(return_code)
4113
4122
 
4114
- # add IAM roles to the service account
4115
- add_roles_to_service_account_code = add_roles_to_service_account(args)
4116
- if add_roles_to_service_account_code != 0:
4117
- xpk_exit(add_roles_to_service_account_code)
4123
+ return_code, gke_control_plane_version = get_gke_control_plane_version(
4124
+ args, gke_server_config
4125
+ )
4126
+ if return_code != 0:
4127
+ xpk_exit(return_code)
4118
4128
 
4119
- create_cluster_command_code = create_cluster_if_necessary(args)
4129
+ create_cluster_command_code = create_cluster_if_necessary(
4130
+ args, gke_control_plane_version
4131
+ )
4120
4132
  if create_cluster_command_code != 0:
4121
4133
  xpk_exit(create_cluster_command_code)
4122
4134
 
@@ -4144,8 +4156,16 @@ def cluster_create(args) -> int:
4144
4156
  if create_cluster_network_config_code != 0:
4145
4157
  xpk_exit(create_cluster_network_config_code)
4146
4158
 
4159
+ # Check the control plane version of the cluster and determine the node pool
4160
+ # version to use.
4161
+ return_code, gke_node_pool_version = get_gke_node_pool_version(
4162
+ args, gke_server_config
4163
+ )
4164
+ if return_code != 0:
4165
+ xpk_exit(return_code)
4166
+
4147
4167
  run_gke_node_pool_create_command_code = run_gke_node_pool_create_command(
4148
- args, system
4168
+ args, system, gke_node_pool_version
4149
4169
  )
4150
4170
  if run_gke_node_pool_create_command_code != 0:
4151
4171
  xpk_exit(run_gke_node_pool_create_command_code)
@@ -4818,7 +4838,7 @@ def get_volume_mounts(args, system: SystemCharacteristics) -> str:
4818
4838
  return regular_volume_mount_yaml
4819
4839
 
4820
4840
 
4821
- def get_pathways_rm_args(args) -> str:
4841
+ def get_pathways_rm_args(args, system: SystemCharacteristics) -> str:
4822
4842
  """Arguments for the Pathways resource manager.
4823
4843
  Args:
4824
4844
  args: user provided arguments for running the command.
@@ -4833,13 +4853,56 @@ def get_pathways_rm_args(args) -> str:
4833
4853
  - --pathways_persistent_compilation_cache=false
4834
4854
  - --pathways_compilation_mode=compile_at_worker
4835
4855
  - --pathways_tmp_dir_pattern={args.pathways_gcs_location}
4836
- - --pathways_resource_manager_expected_num_worker_jobs={args.num_slices}"""
4856
+ - --pathways_expected_instances={expected_instances}"""
4837
4857
  if args.use_pathways:
4838
- return yaml.format(args=args)
4858
+ return yaml.format(
4859
+ args=args,
4860
+ expected_instances=compute_pathways_expected_instances(args, system),
4861
+ )
4839
4862
  else:
4840
4863
  return ''
4841
4864
 
4842
4865
 
4866
+ def compute_pathways_expected_instances(
4867
+ args, system: SystemCharacteristics
4868
+ ) -> str:
4869
+ """Computes the expected instances from the system characteristics.
4870
+ Args:
4871
+ args: user provided args.
4872
+ system: system characteristics.
4873
+
4874
+ Returns:
4875
+ str: formatted string representing the expected instances (eg:
4876
+ "tpuv4:2x2x2,tpuv4:2x2x2" for 2 slices of v4-16).
4877
+ """
4878
+ expected_instances = ','.join([
4879
+ f'tpu{get_pathways_expected_tpu_type(system.device_type)}:{system.topology}'
4880
+ for _ in range(args.num_slices)
4881
+ ])
4882
+
4883
+ xpk_print(f'Pathways expected instances are: {expected_instances}')
4884
+ return expected_instances
4885
+
4886
+
4887
+ def get_pathways_expected_tpu_type(device_type: str) -> str:
4888
+ """Returns the device type expected by Pathways
4889
+ Args:
4890
+ device_type: the system characteristic device type
4891
+
4892
+ Returns:
4893
+ str: the device type expected by pathways.
4894
+ """
4895
+ raw_type = device_type.split('-')[0].lower()
4896
+ pathways_expected_instance = PathwaysExpectedInstancesMap[raw_type]
4897
+ if not pathways_expected_instance:
4898
+ xpk_print(
4899
+ f'Passed in device_type {device_type} is incorrect. Please pass in a'
4900
+ ' valid device type'
4901
+ )
4902
+ xpk_exit(1)
4903
+ return pathways_expected_instance
4904
+
4905
+
4843
4906
  def get_pathways_worker_args(args) -> str:
4844
4907
  """Arguments for the Pathways workers.
4845
4908
  Args:
@@ -5516,7 +5579,7 @@ def workload_create(args) -> int:
5516
5579
  system.accelerator_type, system
5517
5580
  ),
5518
5581
  machine_label=create_machine_label(system.accelerator_type, system),
5519
- pathways_rm_args=get_pathways_rm_args(args),
5582
+ pathways_rm_args=get_pathways_rm_args(args, system),
5520
5583
  pathways_worker_args=get_pathways_worker_args(args),
5521
5584
  pathways_proxy_args=get_pathways_proxy_args(args),
5522
5585
  resource_type=resource_type,
@@ -5761,6 +5824,7 @@ def get_workload_list(args) -> tuple[int, str]:
5761
5824
  f' with filter-by-jobs={args.filter_by_job}',
5762
5825
  args,
5763
5826
  )
5827
+
5764
5828
  return return_code, return_value
5765
5829
 
5766
5830
 
@@ -5869,7 +5933,7 @@ def workload_list(args) -> int:
5869
5933
  if return_code != 0:
5870
5934
  xpk_print(f'List Job request returned ERROR {return_code}')
5871
5935
  xpk_exit(return_code)
5872
- xpk_print(return_value)
5936
+ xpk_print(f'Workload List Output:\n{return_value}')
5873
5937
  xpk_exit(0)
5874
5938
 
5875
5939
 
@@ -6380,10 +6444,10 @@ cluster_create_optional_arguments.add_argument(
6380
6444
  cluster_create_optional_arguments.add_argument(
6381
6445
  '--gke-version',
6382
6446
  type=str,
6383
- default=default_gke_version,
6384
6447
  help=(
6385
- 'The GKE version of the cluster and respective clusters. The default is'
6386
- f' "{default_gke_version}".'
6448
+ 'The GKE version of the cluster and respective clusters. The'
6449
+ ' default is'
6450
+ ' determined dynamically based on RAPID channel recommended version.'
6387
6451
  ),
6388
6452
  )
6389
6453
  cluster_create_optional_arguments.add_argument(
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes
File without changes