pathways-cli 0.1.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,122 @@
1
+ Metadata-Version: 2.4
2
+ Name: pathways-cli
3
+ Version: 0.1.0
4
+ Summary: Pathways CLI to easily bring up pathways clusters.
5
+ Author-email: Sam Stoelinga <sammiestoel@gmail.com>
6
+ Requires-Python: >=3.12
7
+ Requires-Dist: click>=8.4.1
8
+ Requires-Dist: python-dotenv>=1.2.2
9
+ Description-Content-Type: text/markdown
10
+
11
+ # `pwy`: Standalone Pathways GKE Cluster CLI Tool
12
+
13
+ `pwy` is a lightweight, standalone Python CLI utility designed to generate, apply, and manage interactive Pathways workloads on Google Kubernetes Engine (GKE) using Kubernetes JobSets.
14
+
15
+ ---
16
+
17
+ ## Features
18
+
19
+ - **Automated TPU Topology Calculations**: Translates simple TPU resource types (`v6e-4`, `v6e-16`, etc.) into GKE topologies, VM counts, and instance settings.
20
+ - **Spot VM Support**: Dynamically injects GKE node selectors and tolerations for running workloads on cost-effective Spot VMs.
21
+ - **Colocated Python Support**: Simplifies distributed checkpointing (e.g. via Orbax) by configuring and enabling colocated host CPU sidecars and proxy endpoints automatically.
22
+ - **Interactive & Batch Execution**: Supports spinning up pathways servers with infinite sleep drivers for interactive debugging, or executing training scripts directly.
23
+ - **Dry-run Manifest Generation**: Preview and inspect the GKE JobSet manifest without applying it to the cluster.
24
+
25
+ ---
26
+
27
+ ## Installation
28
+
29
+ This project utilizes [uv](https://github.com/astral-sh/uv) for fast, modern Python package and dependency management.
30
+
31
+ To sync the environment and install `pwy`:
32
+
33
+ ```bash
34
+ uv sync
35
+ ```
36
+
37
+ ---
38
+
39
+ ## Usage
40
+
41
+ You can invoke `pwy` commands directly using `uv run`:
42
+
43
+ ### 1. Provision / Preview a Cluster (`pwy up`)
44
+
45
+ Starts a Pathways JobSet or dry-runs the configuration.
46
+
47
+ ```bash
48
+ uv run pwy up \
49
+ --tpu-type v6e-16 \
50
+ --gcs-scratch-location gs://my-bucket/pathways-staging \
51
+ --num-slices 1 \
52
+ --dry-run
53
+ ```
54
+
55
+ #### Key Options:
56
+ - `--tpu-type`: **(Required)** TPU type (e.g., `v6e-4`, `v6e-8`, `v6e-16`, `v6e-32`, `v6e-64`).
57
+ - `--gcs-scratch-location`: **(Required)** GCS scratch path for pathways synchronization.
58
+ - `--num-slices`: Number of TPU slices to run (default: `1`).
59
+ - `--jax-client-image`: Custom client container image (default: `python:3.12-slim`).
60
+ - `--command`: Run a custom training/eval script in the client container. If omitted, defaults to `sleep infinity` (interactive mode).
61
+ - `--enable-spot`: Add node affinity and toleration settings for Spot VMs.
62
+ - `--colocated-python`: Enables colocated CPU Python sidecar/init containers on GKE workers and enables external proxy routing.
63
+ - `--dry-run`: Prints the generated YAML to stdout instead of calling `kubectl apply`.
64
+ - `--name`: Name of the Kubernetes JobSet resource (default: `pathways-interactive`).
65
+ - `--namespace`: Target Kubernetes namespace (default: `default`).
66
+
67
+ ---
68
+
69
+ ### 2. Teardown a Cluster (`pwy down`)
70
+
71
+ Deletes the running Pathways JobSet.
72
+
73
+ ```bash
74
+ uv run pwy down --name pathways-interactive --namespace default
75
+ ```
76
+
77
+ ---
78
+
79
+ ### 3. Verification Example
80
+
81
+ Once the interactive cluster is running, you can verify execution by `exec`ing into the client container:
82
+
83
+ 1. **Find the client pod name**:
84
+ ```bash
85
+ POD_NAME=$(kubectl get pods -l jobset.sigs.k8s.io/jobset-name=pathways-interactive -o jsonpath='{.items[?(@.metadata.labels.jobset\\.sigs\\.k8s\\.io/replicatedjob-name=="pwhd")].metadata.name}')
86
+ ```
87
+
88
+ 2. **Install JAX and Pathways utils**:
89
+ ```bash
90
+ kubectl exec $POD_NAME -c client -- pip install jax pathwaysutils
91
+ ```
92
+
93
+ 3. **Run a Python snippet to initialize and list devices**:
94
+ ```bash
95
+ kubectl exec $POD_NAME -c client -- python3 -c "import pathwaysutils; pathwaysutils.initialize(); import jax; print(jax.devices())"
96
+ ```
97
+
98
+ The command output should print the available virtual TPU devices (e.g., coordinates and memory spaces of the allocated chips).
99
+
100
+ ---
101
+
102
+ ## TPU Type Mappings
103
+
104
+ `pwy` handles all resource-limit math and topologies automatically according to the following matrix:
105
+
106
+ | TPU Type | GKE Topology | VMs Per Slice | RM Instance Type |
107
+ | :--- | :--- | :--- | :--- |
108
+ | `v6e-4` | `2x2` | 1 | `tpuv6e:2x2` |
109
+ | `v6e-8` | `2x4` | 2 | `tpuv6e:2x4` |
110
+ | `v6e-16` | `4x4` | 4 | `tpuv6e:4x4` |
111
+ | `v6e-32` | `4x8` | 8 | `tpuv6e:4x8` |
112
+ | `v6e-64` | `8x8` | 16 | `tpuv6e:8x8` |
113
+
114
+ ---
115
+
116
+ ## Running Tests
117
+
118
+ To execute the unit test suite:
119
+
120
+ ```bash
121
+ uv run pytest
122
+ ```
@@ -0,0 +1,9 @@
1
+ pwy/__init__.py,sha256=FjXvQgxYlSpOMIL8zRR5_XdoZtveGIRjiMiae3yO3aE,45
2
+ pwy/cli.py,sha256=jLyjjz68kdDiZ2Mqt63wm0Ty8XmKcc63Uatt9WiXsKE,3258
3
+ pwy/generator.py,sha256=ZSelmMDOAcwI-ooD458ILPh42BGs3bOESmFT--Hr84k,4915
4
+ pwy/kubernetes.py,sha256=yxxCHVaQVwvZkppgxRMSoZNRv8RC-V8lRKupmtD1GJM,646
5
+ pwy/templates.py,sha256=g6NlFzhbBzcMudLrYmJsF0aboCUulgYAUON0GSTbygk,7750
6
+ pathways_cli-0.1.0.dist-info/METADATA,sha256=DphQjqzCWOD1FkeWH6VqUE9JDwN-hJcT3_uA0896nBI,4288
7
+ pathways_cli-0.1.0.dist-info/WHEEL,sha256=QccIxa26bgl1E6uMy58deGWi-0aeIkkangHcxk2kWfw,87
8
+ pathways_cli-0.1.0.dist-info/entry_points.txt,sha256=2GvpxKRrXF7mDFvGxcsR_2ut2eZHzAlLkgAo_qi5NME,33
9
+ pathways_cli-0.1.0.dist-info/RECORD,,
@@ -0,0 +1,4 @@
1
+ Wheel-Version: 1.0
2
+ Generator: hatchling 1.29.0
3
+ Root-Is-Purelib: true
4
+ Tag: py3-none-any
@@ -0,0 +1,2 @@
1
+ [console_scripts]
2
+ pwy = pwy:main
pwy/__init__.py ADDED
@@ -0,0 +1,3 @@
1
+ from pwy.cli import main
2
+
3
+ __all__ = ["main"]
pwy/cli.py ADDED
@@ -0,0 +1,68 @@
1
+ import sys
2
+ import click
3
+ from pwy.generator import generate_yaml
4
+ from pwy.kubernetes import apply_manifest, delete_jobset
5
+
6
+ @click.group()
7
+ def main():
8
+ """pwy: Standalone Pathways GKE Cluster CLI Tool"""
9
+ pass
10
+
11
+ @main.command()
12
+ @click.option("--tpu-type", required=True, help="TPU type (e.g., v6e-4, v6e-8, v6e-16, v6e-32, v6e-64)")
13
+ @click.option("--gcs-scratch-location", required=True, help="GCS scratch location (e.g., gs://bucket/staging)")
14
+ @click.option("--num-slices", default=1, type=int, show_default=True, help="Number of TPU slices")
15
+ @click.option("--jax-client-image", default="python:3.12-slim", show_default=True, help="Image for the JAX client container")
16
+ @click.option("--command", default=None, help="Command to run in the JAX client container (defaults to sleep infinity)")
17
+ @click.option("--enable-spot", is_flag=True, default=False, help="Enable spot VM scheduling")
18
+ @click.option("--colocated-python", is_flag=True, default=False, help="Enable colocated python sidecars")
19
+ @click.option("--dry-run", is_flag=True, default=False, help="Dry run: print generated YAML to stdout instead of applying it")
20
+ @click.option("--name", default="pathways-interactive", show_default=True, help="Name of the JobSet resource")
21
+ @click.option("--namespace", default="default", show_default=True, help="Kubernetes namespace")
22
+ def up(tpu_type, gcs_scratch_location, num_slices, jax_client_image, command, enable_spot, colocated_python, dry_run, name, namespace):
23
+ """Starts the Pathways cluster or dry-runs the configuration."""
24
+ try:
25
+ yaml_content = generate_yaml(
26
+ name=name,
27
+ namespace=namespace,
28
+ tpu_type=tpu_type,
29
+ gcs_scratch_location=gcs_scratch_location,
30
+ num_slices=num_slices,
31
+ jax_client_image=jax_client_image,
32
+ command=command,
33
+ enable_spot=enable_spot,
34
+ colocated_python=colocated_python,
35
+ )
36
+ except ValueError as e:
37
+ click.secho(f"Error: {e}", fg="red", err=True)
38
+ sys.exit(1)
39
+
40
+ if dry_run:
41
+ click.echo(yaml_content)
42
+ return
43
+
44
+ click.echo(f"Applying Pathways JobSet manifest for '{name}' in namespace '{namespace}'...")
45
+ process = apply_manifest(yaml_content)
46
+ if process.returncode != 0:
47
+ click.secho("Failed to apply JobSet manifest.", fg="red", err=True)
48
+ click.echo(process.stderr.decode("utf-8"), err=True)
49
+ sys.exit(process.returncode)
50
+
51
+ click.secho(f"Successfully applied JobSet '{name}'!", fg="green")
52
+
53
+ @main.command()
54
+ @click.option("--name", default="pathways-interactive", show_default=True, help="Name of the JobSet resource")
55
+ @click.option("--namespace", default="default", show_default=True, help="Kubernetes namespace")
56
+ def down(name, namespace):
57
+ """Tears down the Pathways cluster JobSet resource."""
58
+ click.echo(f"Deleting Pathways JobSet '{name}' in namespace '{namespace}'...")
59
+ process = delete_jobset(name, namespace)
60
+ if process.returncode != 0:
61
+ click.secho("Failed to delete JobSet.", fg="red", err=True)
62
+ click.echo(process.stderr.decode("utf-8"), err=True)
63
+ sys.exit(process.returncode)
64
+
65
+ click.secho(f"Successfully deleted JobSet '{name}'!", fg="green")
66
+
67
+ if __name__ == "__main__":
68
+ main()
pwy/generator.py ADDED
@@ -0,0 +1,122 @@
1
+ from pwy.templates import YAML_TEMPLATE
2
+
3
+ TPU_MAPPINGS = {
4
+ "v6e-4": {"topology": "2x2", "vms_per_slice": 1, "rm_type": "tpuv6e:2x2"},
5
+ "v6e-8": {"topology": "2x4", "vms_per_slice": 2, "rm_type": "tpuv6e:2x4"},
6
+ "v6e-16": {"topology": "4x4", "vms_per_slice": 4, "rm_type": "tpuv6e:4x4"},
7
+ "v6e-32": {"topology": "4x8", "vms_per_slice": 8, "rm_type": "tpuv6e:4x8"},
8
+ "v6e-64": {"topology": "8x8", "vms_per_slice": 16, "rm_type": "tpuv6e:8x8"},
9
+ }
10
+
11
+ def get_colocated_python_image(client_image: str) -> str:
12
+ if "/" in client_image and ":" in client_image:
13
+ try:
14
+ path, tag = client_image.rsplit(":", 1)
15
+ repo, _ = path.rsplit("/", 1)
16
+ return f"{repo}/colocated-python:{tag}"
17
+ except Exception:
18
+ pass
19
+ return "us-docker.pkg.dev/cloud-tpu-v2-images/pathways/colocated-python:jax-0.10.0"
20
+
21
+ def generate_yaml(
22
+ name: str,
23
+ namespace: str,
24
+ tpu_type: str,
25
+ gcs_scratch_location: str,
26
+ num_slices: int = 1,
27
+ jax_client_image: str = "python:3.12-slim",
28
+ command: str = None,
29
+ enable_spot: bool = False,
30
+ colocated_python: bool = False,
31
+ ) -> str:
32
+ if tpu_type not in TPU_MAPPINGS:
33
+ raise ValueError(
34
+ f"Unsupported TPU type: {tpu_type}. Supported types: {list(TPU_MAPPINGS.keys())}"
35
+ )
36
+
37
+ mapping = TPU_MAPPINGS[tpu_type]
38
+ gke_topology = mapping["topology"]
39
+ vms_per_slice = mapping["vms_per_slice"]
40
+ rm_instance_type = mapping["rm_type"]
41
+
42
+ # Format client execution command
43
+ if not command:
44
+ client_command = "sleep infinity"
45
+ else:
46
+ client_command = command
47
+
48
+ # Format Spot VM Node Selector and Tolerations
49
+ if enable_spot:
50
+ spot_toleration_head = (
51
+ ' - key: "cloud.google.com/gke-spot"\n'
52
+ ' operator: "Equal"\n'
53
+ ' value: "true"\n'
54
+ ' effect: "NoSchedule"'
55
+ )
56
+ spot_node_selector_worker = ' cloud.google.com/gke-spot: "true"'
57
+ spot_toleration_worker = (
58
+ ' - key: "cloud.google.com/gke-spot"\n'
59
+ ' operator: "Equal"\n'
60
+ ' value: "true"\n'
61
+ ' effect: "NoSchedule"'
62
+ )
63
+ else:
64
+ spot_toleration_head = ""
65
+ spot_node_selector_worker = ""
66
+ spot_toleration_worker = ""
67
+
68
+ # Format colocated python options
69
+ if colocated_python:
70
+ proxy_sidecar_arg = "\n - --sidecar_name=external"
71
+ tpu_premapped_buffer_size = 34359738368 # 32 GiB
72
+ colocated_img = get_colocated_python_image(jax_client_image)
73
+ worker_init_containers = (
74
+ " initContainers:\n"
75
+ " - name: colocated-python\n"
76
+ f" image: {colocated_img}\n"
77
+ " imagePullPolicy: Always\n"
78
+ " restartPolicy: Always\n"
79
+ " ports:\n"
80
+ " - containerPort: 50051\n"
81
+ " protocol: TCP\n"
82
+ " env:\n"
83
+ " - name: CLOUD_PATHWAYS_SIDECAR_SHM_DIRECTORY\n"
84
+ " value: /tmp/ifrt_proxy\n"
85
+ " - name: GRPC_SERVER_ADDRESS\n"
86
+ " value: 0.0.0.0:50051\n"
87
+ " volumeMounts:\n"
88
+ " - name: shared-memory\n"
89
+ " mountPath: /tmp/ifrt_proxy"
90
+ )
91
+ else:
92
+ proxy_sidecar_arg = ""
93
+ tpu_premapped_buffer_size = 274877906944 # 256 GiB
94
+ worker_init_containers = ""
95
+
96
+ # Interpolate variables in the template
97
+ yaml_content = YAML_TEMPLATE.format(
98
+ NAME=name,
99
+ NAMESPACE=namespace,
100
+ CLIENT_IMAGE=jax_client_image,
101
+ CLIENT_EXECUTION_COMMAND=client_command,
102
+ TPU_TYPE=tpu_type,
103
+ NUM_SLICES=num_slices,
104
+ RM_INSTANCE_TYPE=rm_instance_type,
105
+ GCS_SCRATCH_LOCATION=gcs_scratch_location,
106
+ GKE_TOPOLOGY=gke_topology,
107
+ VMS_PER_SLICE=vms_per_slice,
108
+ SPOT_TOLERATION_HEAD=spot_toleration_head,
109
+ SPOT_NODE_SELECTOR_WORKER=spot_node_selector_worker,
110
+ SPOT_TOLERATION_WORKER=spot_toleration_worker,
111
+ PROXY_SIDECAR_ARG=proxy_sidecar_arg,
112
+ TPU_PREMAPPED_BUFFER_SIZE=tpu_premapped_buffer_size,
113
+ WORKER_INIT_CONTAINERS=worker_init_containers,
114
+ )
115
+
116
+ # Clean up empty lines caused by optional block placeholders
117
+ # (specifically ensuring there are no lines with only whitespace or empty lines where placeholders were)
118
+ lines = []
119
+ for line in yaml_content.splitlines():
120
+ if line.strip() or line == "":
121
+ lines.append(line)
122
+ return "\n".join(lines) + "\n"
pwy/kubernetes.py ADDED
@@ -0,0 +1,18 @@
1
+ import subprocess
2
+
3
+ def apply_manifest(yaml_content: str) -> subprocess.CompletedProcess:
4
+ """Applies the YAML manifest using kubectl apply -f -."""
5
+ process = subprocess.run(
6
+ ["kubectl", "apply", "-f", "-"],
7
+ input=yaml_content.encode("utf-8"),
8
+ capture_output=True,
9
+ )
10
+ return process
11
+
12
+ def delete_jobset(name: str, namespace: str) -> subprocess.CompletedProcess:
13
+ """Deletes the JobSet using kubectl delete jobset <name> --namespace=<namespace>."""
14
+ process = subprocess.run(
15
+ ["kubectl", "delete", "jobset", name, f"--namespace={namespace}"],
16
+ capture_output=True,
17
+ )
18
+ return process
pwy/templates.py ADDED
@@ -0,0 +1,197 @@
1
+ YAML_TEMPLATE = """apiVersion: jobset.x-k8s.io/v1alpha2
2
+ kind: JobSet
3
+ metadata:
4
+ name: {NAME}
5
+ namespace: {NAMESPACE}
6
+ spec:
7
+ failurePolicy:
8
+ maxRestarts: 0
9
+ restartStrategy: BlockingRecreate
10
+ replicatedJobs:
11
+ # -------------------------------------------------------------------------
12
+ # 1. Pathways Head (Client Pod)
13
+ # -------------------------------------------------------------------------
14
+ - name: pwhd
15
+ replicas: 1
16
+ template:
17
+ spec:
18
+ parallelism: 1
19
+ completions: 1
20
+ backoffLimit: 32
21
+ template:
22
+ metadata:
23
+ annotations:
24
+ cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
25
+ spec:
26
+ terminationGracePeriodSeconds: 60
27
+ restartPolicy: Never
28
+ hostAliases:
29
+ - ip: 169.254.169.254
30
+ hostnames:
31
+ - metadata
32
+ - metadata.google.internal
33
+ tolerations:
34
+ - key: google.com/tpu
35
+ operator: Equal
36
+ value: "present"
37
+ effect: NoSchedule
38
+ {SPOT_TOLERATION_HEAD}
39
+ containers:
40
+ - name: client
41
+ image: {CLIENT_IMAGE}
42
+ command:
43
+ - bash
44
+ - -c
45
+ - |
46
+ {CLIENT_EXECUTION_COMMAND}
47
+ resources:
48
+ requests:
49
+ cpu: "1000m"
50
+ memory: "16Gi"
51
+ limits:
52
+ cpu: "1000m"
53
+ memory: "16Gi"
54
+ env:
55
+ - name: TPU_TYPE
56
+ value: {TPU_TYPE}
57
+ - name: NUM_TPU_SLICES
58
+ valueFrom:
59
+ fieldRef:
60
+ fieldPath: metadata.labels['jobset.sigs.k8s.io/replicatedjob-replicas']
61
+ - name: JAX_BACKEND_TARGET
62
+ value: grpc://localhost:29000
63
+ - name: XCLOUD_ENVIRONMENT
64
+ value: GCP
65
+ - name: JAX_PLATFORMS
66
+ value: proxy
67
+ - name: ENABLE_PATHWAYS_PERSISTENCE
68
+ value: "1"
69
+ - name: TPU_SKIP_MDS_QUERY
70
+ value: "true"
71
+ - name: PYTHONUNBUFFERED
72
+ value: "1"
73
+ - name: TEST_UNDECLARED_OUTPUTS_DIR
74
+ value: "true"
75
+ - name: IFRT_PROXY_LARGE_TRANSFER_THRESHOLD
76
+ value: "1"
77
+ - name: IFRT_PROXY_LARGE_TRANSFER_OPTIMIZATION_DIRECTORY
78
+ value: /tmp/ifrt_proxy
79
+ volumeMounts:
80
+ - name: shared-memory
81
+ mountPath: /tmp/ifrt_proxy
82
+ imagePullPolicy: Always
83
+ initContainers:
84
+ - name: pathways-proxy
85
+ image: us-docker.pkg.dev/cloud-tpu-v2-images/pathways/proxy_server:jax-0.10.0
86
+ restartPolicy: Always
87
+ ports:
88
+ - containerPort: 29000
89
+ env:
90
+ - name: IFRT_PROXY_USE_INSECURE_GRPC_CREDENTIALS
91
+ value: "true"
92
+ - name: IFRT_PROXY_LARGE_TRANSFER_OPTIMIZATION_DIRECTORY
93
+ value: /tmp/ifrt_proxy
94
+ args:
95
+ - --resource_manager_address=localhost:29001
96
+ - --server_port=29000{PROXY_SIDECAR_ARG}
97
+ volumeMounts:
98
+ - name: shared-memory
99
+ mountPath: /tmp/ifrt_proxy
100
+ - name: pathways-rm
101
+ image: us-docker.pkg.dev/cloud-tpu-v2-images/pathways/server:jax-0.10.0
102
+ restartPolicy: Always
103
+ env:
104
+ - name: TPU_SKIP_MDS_QUERY
105
+ value: "true"
106
+ args:
107
+ - --server_port=29001
108
+ - --node_type=resource_manager
109
+ - --instance_count={NUM_SLICES}
110
+ - --instance_type={RM_INSTANCE_TYPE}
111
+ - --gcs_scratch_location={GCS_SCRATCH_LOCATION}
112
+ - --enforce_kernel_ipv6_support=false
113
+ volumes:
114
+ - name: shared-memory
115
+ emptyDir:
116
+ medium: Memory
117
+ serviceAccountName: default
118
+ dnsPolicy: ClusterFirstWithHostNet
119
+
120
+ # -------------------------------------------------------------------------
121
+ # 2. Pathways Workers (TPU Pods)
122
+ # -------------------------------------------------------------------------
123
+ - name: pwwk
124
+ replicas: {NUM_SLICES}
125
+ template:
126
+ metadata:
127
+ annotations:
128
+ alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
129
+ spec:
130
+ parallelism: {VMS_PER_SLICE}
131
+ completions: {VMS_PER_SLICE}
132
+ backoffLimit: 32
133
+ template:
134
+ spec:
135
+ terminationGracePeriodSeconds: 60
136
+ hostAliases:
137
+ - ip: 169.254.169.254
138
+ hostnames:
139
+ - metadata
140
+ - metadata.google.internal
141
+ nodeSelector:
142
+ cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
143
+ cloud.google.com/gke-tpu-topology: {GKE_TOPOLOGY}
144
+ {SPOT_NODE_SELECTOR_WORKER}
145
+ tolerations:
146
+ - key: google.com/tpu
147
+ operator: Equal
148
+ value: "present"
149
+ effect: NoSchedule
150
+ {SPOT_TOLERATION_WORKER}
151
+ {WORKER_INIT_CONTAINERS}
152
+ containers:
153
+ - name: worker
154
+ image: us-docker.pkg.dev/cloud-tpu-v2-images/pathways/server:jax-0.10.0
155
+ imagePullPolicy: Always
156
+ ports:
157
+ - containerPort: 8471
158
+ - containerPort: 8080
159
+ - containerPort: 8431
160
+ - containerPort: 9000
161
+ - containerPort: 29001
162
+ securityContext:
163
+ privileged: true
164
+ resources:
165
+ limits:
166
+ google.com/tpu: 4
167
+ env:
168
+ - name: TPU_TYPE
169
+ value: {TPU_TYPE}
170
+ - name: NUM_TPU_SLICES
171
+ valueFrom:
172
+ fieldRef:
173
+ fieldPath: metadata.labels['jobset.sigs.k8s.io/replicatedjob-replicas']
174
+ - name: MEGASCALE_COORDINATOR_ADDRESS
175
+ value: {NAME}-pwhd-0-0.{NAME}
176
+ - name: MEGASCALE_NUM_SLICES
177
+ valueFrom:
178
+ fieldRef:
179
+ fieldPath: metadata.labels['jobset.sigs.k8s.io/replicatedjob-replicas']
180
+ - name: MEGASCALE_SLICE_ID
181
+ valueFrom:
182
+ fieldRef:
183
+ fieldPath: metadata.labels['jobset.sigs.k8s.io/job-index']
184
+ args:
185
+ - --server_port=29001
186
+ - --resource_manager_address={NAME}-pwhd-0-0.{NAME}:29001
187
+ - --gcs_scratch_location={GCS_SCRATCH_LOCATION}
188
+ - --tpu_pinned_host_allocation_recycle=true
189
+ - --tpu_premapped_buffer_size={TPU_PREMAPPED_BUFFER_SIZE}
190
+ - --enforce_kernel_ipv6_support=false
191
+ serviceAccountName: default
192
+ dnsPolicy: ClusterFirstWithHostNet
193
+ successPolicy:
194
+ operator: All
195
+ targetReplicatedJobs:
196
+ - pwhd
197
+ """