pathways-cli 0.1.0__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- pathways_cli-0.1.0.dist-info/METADATA +122 -0
- pathways_cli-0.1.0.dist-info/RECORD +9 -0
- pathways_cli-0.1.0.dist-info/WHEEL +4 -0
- pathways_cli-0.1.0.dist-info/entry_points.txt +2 -0
- pwy/__init__.py +3 -0
- pwy/cli.py +68 -0
- pwy/generator.py +122 -0
- pwy/kubernetes.py +18 -0
- pwy/templates.py +197 -0
|
@@ -0,0 +1,122 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: pathways-cli
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Pathways CLI to easily bring up pathways clusters.
|
|
5
|
+
Author-email: Sam Stoelinga <sammiestoel@gmail.com>
|
|
6
|
+
Requires-Python: >=3.12
|
|
7
|
+
Requires-Dist: click>=8.4.1
|
|
8
|
+
Requires-Dist: python-dotenv>=1.2.2
|
|
9
|
+
Description-Content-Type: text/markdown
|
|
10
|
+
|
|
11
|
+
# `pwy`: Standalone Pathways GKE Cluster CLI Tool
|
|
12
|
+
|
|
13
|
+
`pwy` is a lightweight, standalone Python CLI utility designed to generate, apply, and manage interactive Pathways workloads on Google Kubernetes Engine (GKE) using Kubernetes JobSets.
|
|
14
|
+
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
## Features
|
|
18
|
+
|
|
19
|
+
- **Automated TPU Topology Calculations**: Translates simple TPU resource types (`v6e-4`, `v6e-16`, etc.) into GKE topologies, VM counts, and instance settings.
|
|
20
|
+
- **Spot VM Support**: Dynamically injects GKE node selectors and tolerations for running workloads on cost-effective Spot VMs.
|
|
21
|
+
- **Colocated Python Support**: Simplifies distributed checkpointing (e.g. via Orbax) by configuring and enabling colocated host CPU sidecars and proxy endpoints automatically.
|
|
22
|
+
- **Interactive & Batch Execution**: Supports spinning up pathways servers with infinite sleep drivers for interactive debugging, or executing training scripts directly.
|
|
23
|
+
- **Dry-run Manifest Generation**: Preview and inspect the GKE JobSet manifest without applying it to the cluster.
|
|
24
|
+
|
|
25
|
+
---
|
|
26
|
+
|
|
27
|
+
## Installation
|
|
28
|
+
|
|
29
|
+
This project utilizes [uv](https://github.com/astral-sh/uv) for fast, modern Python package and dependency management.
|
|
30
|
+
|
|
31
|
+
To sync the environment and install `pwy`:
|
|
32
|
+
|
|
33
|
+
```bash
|
|
34
|
+
uv sync
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
---
|
|
38
|
+
|
|
39
|
+
## Usage
|
|
40
|
+
|
|
41
|
+
You can invoke `pwy` commands directly using `uv run`:
|
|
42
|
+
|
|
43
|
+
### 1. Provision / Preview a Cluster (`pwy up`)
|
|
44
|
+
|
|
45
|
+
Starts a Pathways JobSet or dry-runs the configuration.
|
|
46
|
+
|
|
47
|
+
```bash
|
|
48
|
+
uv run pwy up \
|
|
49
|
+
--tpu-type v6e-16 \
|
|
50
|
+
--gcs-scratch-location gs://my-bucket/pathways-staging \
|
|
51
|
+
--num-slices 1 \
|
|
52
|
+
--dry-run
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
#### Key Options:
|
|
56
|
+
- `--tpu-type`: **(Required)** TPU type (e.g., `v6e-4`, `v6e-8`, `v6e-16`, `v6e-32`, `v6e-64`).
|
|
57
|
+
- `--gcs-scratch-location`: **(Required)** GCS scratch path for pathways synchronization.
|
|
58
|
+
- `--num-slices`: Number of TPU slices to run (default: `1`).
|
|
59
|
+
- `--jax-client-image`: Custom client container image (default: `python:3.12-slim`).
|
|
60
|
+
- `--command`: Run a custom training/eval script in the client container. If omitted, defaults to `sleep infinity` (interactive mode).
|
|
61
|
+
- `--enable-spot`: Add node affinity and toleration settings for Spot VMs.
|
|
62
|
+
- `--colocated-python`: Enables colocated CPU Python sidecar/init containers on GKE workers and enables external proxy routing.
|
|
63
|
+
- `--dry-run`: Prints the generated YAML to stdout instead of calling `kubectl apply`.
|
|
64
|
+
- `--name`: Name of the Kubernetes JobSet resource (default: `pathways-interactive`).
|
|
65
|
+
- `--namespace`: Target Kubernetes namespace (default: `default`).
|
|
66
|
+
|
|
67
|
+
---
|
|
68
|
+
|
|
69
|
+
### 2. Teardown a Cluster (`pwy down`)
|
|
70
|
+
|
|
71
|
+
Deletes the running Pathways JobSet.
|
|
72
|
+
|
|
73
|
+
```bash
|
|
74
|
+
uv run pwy down --name pathways-interactive --namespace default
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
---
|
|
78
|
+
|
|
79
|
+
### 3. Verification Example
|
|
80
|
+
|
|
81
|
+
Once the interactive cluster is running, you can verify execution by `exec`ing into the client container:
|
|
82
|
+
|
|
83
|
+
1. **Find the client pod name**:
|
|
84
|
+
```bash
|
|
85
|
+
POD_NAME=$(kubectl get pods -l jobset.sigs.k8s.io/jobset-name=pathways-interactive -o jsonpath='{.items[?(@.metadata.labels.jobset\\.sigs\\.k8s\\.io/replicatedjob-name=="pwhd")].metadata.name}')
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
2. **Install JAX and Pathways utils**:
|
|
89
|
+
```bash
|
|
90
|
+
kubectl exec $POD_NAME -c client -- pip install jax pathwaysutils
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
3. **Run a Python snippet to initialize and list devices**:
|
|
94
|
+
```bash
|
|
95
|
+
kubectl exec $POD_NAME -c client -- python3 -c "import pathwaysutils; pathwaysutils.initialize(); import jax; print(jax.devices())"
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
The command output should print the available virtual TPU devices (e.g., coordinates and memory spaces of the allocated chips).
|
|
99
|
+
|
|
100
|
+
---
|
|
101
|
+
|
|
102
|
+
## TPU Type Mappings
|
|
103
|
+
|
|
104
|
+
`pwy` handles all resource-limit math and topologies automatically according to the following matrix:
|
|
105
|
+
|
|
106
|
+
| TPU Type | GKE Topology | VMs Per Slice | RM Instance Type |
|
|
107
|
+
| :--- | :--- | :--- | :--- |
|
|
108
|
+
| `v6e-4` | `2x2` | 1 | `tpuv6e:2x2` |
|
|
109
|
+
| `v6e-8` | `2x4` | 2 | `tpuv6e:2x4` |
|
|
110
|
+
| `v6e-16` | `4x4` | 4 | `tpuv6e:4x4` |
|
|
111
|
+
| `v6e-32` | `4x8` | 8 | `tpuv6e:4x8` |
|
|
112
|
+
| `v6e-64` | `8x8` | 16 | `tpuv6e:8x8` |
|
|
113
|
+
|
|
114
|
+
---
|
|
115
|
+
|
|
116
|
+
## Running Tests
|
|
117
|
+
|
|
118
|
+
To execute the unit test suite:
|
|
119
|
+
|
|
120
|
+
```bash
|
|
121
|
+
uv run pytest
|
|
122
|
+
```
|
|
@@ -0,0 +1,9 @@
|
|
|
1
|
+
pwy/__init__.py,sha256=FjXvQgxYlSpOMIL8zRR5_XdoZtveGIRjiMiae3yO3aE,45
|
|
2
|
+
pwy/cli.py,sha256=jLyjjz68kdDiZ2Mqt63wm0Ty8XmKcc63Uatt9WiXsKE,3258
|
|
3
|
+
pwy/generator.py,sha256=ZSelmMDOAcwI-ooD458ILPh42BGs3bOESmFT--Hr84k,4915
|
|
4
|
+
pwy/kubernetes.py,sha256=yxxCHVaQVwvZkppgxRMSoZNRv8RC-V8lRKupmtD1GJM,646
|
|
5
|
+
pwy/templates.py,sha256=g6NlFzhbBzcMudLrYmJsF0aboCUulgYAUON0GSTbygk,7750
|
|
6
|
+
pathways_cli-0.1.0.dist-info/METADATA,sha256=DphQjqzCWOD1FkeWH6VqUE9JDwN-hJcT3_uA0896nBI,4288
|
|
7
|
+
pathways_cli-0.1.0.dist-info/WHEEL,sha256=QccIxa26bgl1E6uMy58deGWi-0aeIkkangHcxk2kWfw,87
|
|
8
|
+
pathways_cli-0.1.0.dist-info/entry_points.txt,sha256=2GvpxKRrXF7mDFvGxcsR_2ut2eZHzAlLkgAo_qi5NME,33
|
|
9
|
+
pathways_cli-0.1.0.dist-info/RECORD,,
|
pwy/__init__.py
ADDED
pwy/cli.py
ADDED
|
@@ -0,0 +1,68 @@
|
|
|
1
|
+
import sys
|
|
2
|
+
import click
|
|
3
|
+
from pwy.generator import generate_yaml
|
|
4
|
+
from pwy.kubernetes import apply_manifest, delete_jobset
|
|
5
|
+
|
|
6
|
+
@click.group()
|
|
7
|
+
def main():
|
|
8
|
+
"""pwy: Standalone Pathways GKE Cluster CLI Tool"""
|
|
9
|
+
pass
|
|
10
|
+
|
|
11
|
+
@main.command()
|
|
12
|
+
@click.option("--tpu-type", required=True, help="TPU type (e.g., v6e-4, v6e-8, v6e-16, v6e-32, v6e-64)")
|
|
13
|
+
@click.option("--gcs-scratch-location", required=True, help="GCS scratch location (e.g., gs://bucket/staging)")
|
|
14
|
+
@click.option("--num-slices", default=1, type=int, show_default=True, help="Number of TPU slices")
|
|
15
|
+
@click.option("--jax-client-image", default="python:3.12-slim", show_default=True, help="Image for the JAX client container")
|
|
16
|
+
@click.option("--command", default=None, help="Command to run in the JAX client container (defaults to sleep infinity)")
|
|
17
|
+
@click.option("--enable-spot", is_flag=True, default=False, help="Enable spot VM scheduling")
|
|
18
|
+
@click.option("--colocated-python", is_flag=True, default=False, help="Enable colocated python sidecars")
|
|
19
|
+
@click.option("--dry-run", is_flag=True, default=False, help="Dry run: print generated YAML to stdout instead of applying it")
|
|
20
|
+
@click.option("--name", default="pathways-interactive", show_default=True, help="Name of the JobSet resource")
|
|
21
|
+
@click.option("--namespace", default="default", show_default=True, help="Kubernetes namespace")
|
|
22
|
+
def up(tpu_type, gcs_scratch_location, num_slices, jax_client_image, command, enable_spot, colocated_python, dry_run, name, namespace):
|
|
23
|
+
"""Starts the Pathways cluster or dry-runs the configuration."""
|
|
24
|
+
try:
|
|
25
|
+
yaml_content = generate_yaml(
|
|
26
|
+
name=name,
|
|
27
|
+
namespace=namespace,
|
|
28
|
+
tpu_type=tpu_type,
|
|
29
|
+
gcs_scratch_location=gcs_scratch_location,
|
|
30
|
+
num_slices=num_slices,
|
|
31
|
+
jax_client_image=jax_client_image,
|
|
32
|
+
command=command,
|
|
33
|
+
enable_spot=enable_spot,
|
|
34
|
+
colocated_python=colocated_python,
|
|
35
|
+
)
|
|
36
|
+
except ValueError as e:
|
|
37
|
+
click.secho(f"Error: {e}", fg="red", err=True)
|
|
38
|
+
sys.exit(1)
|
|
39
|
+
|
|
40
|
+
if dry_run:
|
|
41
|
+
click.echo(yaml_content)
|
|
42
|
+
return
|
|
43
|
+
|
|
44
|
+
click.echo(f"Applying Pathways JobSet manifest for '{name}' in namespace '{namespace}'...")
|
|
45
|
+
process = apply_manifest(yaml_content)
|
|
46
|
+
if process.returncode != 0:
|
|
47
|
+
click.secho("Failed to apply JobSet manifest.", fg="red", err=True)
|
|
48
|
+
click.echo(process.stderr.decode("utf-8"), err=True)
|
|
49
|
+
sys.exit(process.returncode)
|
|
50
|
+
|
|
51
|
+
click.secho(f"Successfully applied JobSet '{name}'!", fg="green")
|
|
52
|
+
|
|
53
|
+
@main.command()
|
|
54
|
+
@click.option("--name", default="pathways-interactive", show_default=True, help="Name of the JobSet resource")
|
|
55
|
+
@click.option("--namespace", default="default", show_default=True, help="Kubernetes namespace")
|
|
56
|
+
def down(name, namespace):
|
|
57
|
+
"""Tears down the Pathways cluster JobSet resource."""
|
|
58
|
+
click.echo(f"Deleting Pathways JobSet '{name}' in namespace '{namespace}'...")
|
|
59
|
+
process = delete_jobset(name, namespace)
|
|
60
|
+
if process.returncode != 0:
|
|
61
|
+
click.secho("Failed to delete JobSet.", fg="red", err=True)
|
|
62
|
+
click.echo(process.stderr.decode("utf-8"), err=True)
|
|
63
|
+
sys.exit(process.returncode)
|
|
64
|
+
|
|
65
|
+
click.secho(f"Successfully deleted JobSet '{name}'!", fg="green")
|
|
66
|
+
|
|
67
|
+
if __name__ == "__main__":
|
|
68
|
+
main()
|
pwy/generator.py
ADDED
|
@@ -0,0 +1,122 @@
|
|
|
1
|
+
from pwy.templates import YAML_TEMPLATE
|
|
2
|
+
|
|
3
|
+
TPU_MAPPINGS = {
|
|
4
|
+
"v6e-4": {"topology": "2x2", "vms_per_slice": 1, "rm_type": "tpuv6e:2x2"},
|
|
5
|
+
"v6e-8": {"topology": "2x4", "vms_per_slice": 2, "rm_type": "tpuv6e:2x4"},
|
|
6
|
+
"v6e-16": {"topology": "4x4", "vms_per_slice": 4, "rm_type": "tpuv6e:4x4"},
|
|
7
|
+
"v6e-32": {"topology": "4x8", "vms_per_slice": 8, "rm_type": "tpuv6e:4x8"},
|
|
8
|
+
"v6e-64": {"topology": "8x8", "vms_per_slice": 16, "rm_type": "tpuv6e:8x8"},
|
|
9
|
+
}
|
|
10
|
+
|
|
11
|
+
def get_colocated_python_image(client_image: str) -> str:
|
|
12
|
+
if "/" in client_image and ":" in client_image:
|
|
13
|
+
try:
|
|
14
|
+
path, tag = client_image.rsplit(":", 1)
|
|
15
|
+
repo, _ = path.rsplit("/", 1)
|
|
16
|
+
return f"{repo}/colocated-python:{tag}"
|
|
17
|
+
except Exception:
|
|
18
|
+
pass
|
|
19
|
+
return "us-docker.pkg.dev/cloud-tpu-v2-images/pathways/colocated-python:jax-0.10.0"
|
|
20
|
+
|
|
21
|
+
def generate_yaml(
|
|
22
|
+
name: str,
|
|
23
|
+
namespace: str,
|
|
24
|
+
tpu_type: str,
|
|
25
|
+
gcs_scratch_location: str,
|
|
26
|
+
num_slices: int = 1,
|
|
27
|
+
jax_client_image: str = "python:3.12-slim",
|
|
28
|
+
command: str = None,
|
|
29
|
+
enable_spot: bool = False,
|
|
30
|
+
colocated_python: bool = False,
|
|
31
|
+
) -> str:
|
|
32
|
+
if tpu_type not in TPU_MAPPINGS:
|
|
33
|
+
raise ValueError(
|
|
34
|
+
f"Unsupported TPU type: {tpu_type}. Supported types: {list(TPU_MAPPINGS.keys())}"
|
|
35
|
+
)
|
|
36
|
+
|
|
37
|
+
mapping = TPU_MAPPINGS[tpu_type]
|
|
38
|
+
gke_topology = mapping["topology"]
|
|
39
|
+
vms_per_slice = mapping["vms_per_slice"]
|
|
40
|
+
rm_instance_type = mapping["rm_type"]
|
|
41
|
+
|
|
42
|
+
# Format client execution command
|
|
43
|
+
if not command:
|
|
44
|
+
client_command = "sleep infinity"
|
|
45
|
+
else:
|
|
46
|
+
client_command = command
|
|
47
|
+
|
|
48
|
+
# Format Spot VM Node Selector and Tolerations
|
|
49
|
+
if enable_spot:
|
|
50
|
+
spot_toleration_head = (
|
|
51
|
+
' - key: "cloud.google.com/gke-spot"\n'
|
|
52
|
+
' operator: "Equal"\n'
|
|
53
|
+
' value: "true"\n'
|
|
54
|
+
' effect: "NoSchedule"'
|
|
55
|
+
)
|
|
56
|
+
spot_node_selector_worker = ' cloud.google.com/gke-spot: "true"'
|
|
57
|
+
spot_toleration_worker = (
|
|
58
|
+
' - key: "cloud.google.com/gke-spot"\n'
|
|
59
|
+
' operator: "Equal"\n'
|
|
60
|
+
' value: "true"\n'
|
|
61
|
+
' effect: "NoSchedule"'
|
|
62
|
+
)
|
|
63
|
+
else:
|
|
64
|
+
spot_toleration_head = ""
|
|
65
|
+
spot_node_selector_worker = ""
|
|
66
|
+
spot_toleration_worker = ""
|
|
67
|
+
|
|
68
|
+
# Format colocated python options
|
|
69
|
+
if colocated_python:
|
|
70
|
+
proxy_sidecar_arg = "\n - --sidecar_name=external"
|
|
71
|
+
tpu_premapped_buffer_size = 34359738368 # 32 GiB
|
|
72
|
+
colocated_img = get_colocated_python_image(jax_client_image)
|
|
73
|
+
worker_init_containers = (
|
|
74
|
+
" initContainers:\n"
|
|
75
|
+
" - name: colocated-python\n"
|
|
76
|
+
f" image: {colocated_img}\n"
|
|
77
|
+
" imagePullPolicy: Always\n"
|
|
78
|
+
" restartPolicy: Always\n"
|
|
79
|
+
" ports:\n"
|
|
80
|
+
" - containerPort: 50051\n"
|
|
81
|
+
" protocol: TCP\n"
|
|
82
|
+
" env:\n"
|
|
83
|
+
" - name: CLOUD_PATHWAYS_SIDECAR_SHM_DIRECTORY\n"
|
|
84
|
+
" value: /tmp/ifrt_proxy\n"
|
|
85
|
+
" - name: GRPC_SERVER_ADDRESS\n"
|
|
86
|
+
" value: 0.0.0.0:50051\n"
|
|
87
|
+
" volumeMounts:\n"
|
|
88
|
+
" - name: shared-memory\n"
|
|
89
|
+
" mountPath: /tmp/ifrt_proxy"
|
|
90
|
+
)
|
|
91
|
+
else:
|
|
92
|
+
proxy_sidecar_arg = ""
|
|
93
|
+
tpu_premapped_buffer_size = 274877906944 # 256 GiB
|
|
94
|
+
worker_init_containers = ""
|
|
95
|
+
|
|
96
|
+
# Interpolate variables in the template
|
|
97
|
+
yaml_content = YAML_TEMPLATE.format(
|
|
98
|
+
NAME=name,
|
|
99
|
+
NAMESPACE=namespace,
|
|
100
|
+
CLIENT_IMAGE=jax_client_image,
|
|
101
|
+
CLIENT_EXECUTION_COMMAND=client_command,
|
|
102
|
+
TPU_TYPE=tpu_type,
|
|
103
|
+
NUM_SLICES=num_slices,
|
|
104
|
+
RM_INSTANCE_TYPE=rm_instance_type,
|
|
105
|
+
GCS_SCRATCH_LOCATION=gcs_scratch_location,
|
|
106
|
+
GKE_TOPOLOGY=gke_topology,
|
|
107
|
+
VMS_PER_SLICE=vms_per_slice,
|
|
108
|
+
SPOT_TOLERATION_HEAD=spot_toleration_head,
|
|
109
|
+
SPOT_NODE_SELECTOR_WORKER=spot_node_selector_worker,
|
|
110
|
+
SPOT_TOLERATION_WORKER=spot_toleration_worker,
|
|
111
|
+
PROXY_SIDECAR_ARG=proxy_sidecar_arg,
|
|
112
|
+
TPU_PREMAPPED_BUFFER_SIZE=tpu_premapped_buffer_size,
|
|
113
|
+
WORKER_INIT_CONTAINERS=worker_init_containers,
|
|
114
|
+
)
|
|
115
|
+
|
|
116
|
+
# Clean up empty lines caused by optional block placeholders
|
|
117
|
+
# (specifically ensuring there are no lines with only whitespace or empty lines where placeholders were)
|
|
118
|
+
lines = []
|
|
119
|
+
for line in yaml_content.splitlines():
|
|
120
|
+
if line.strip() or line == "":
|
|
121
|
+
lines.append(line)
|
|
122
|
+
return "\n".join(lines) + "\n"
|
pwy/kubernetes.py
ADDED
|
@@ -0,0 +1,18 @@
|
|
|
1
|
+
import subprocess
|
|
2
|
+
|
|
3
|
+
def apply_manifest(yaml_content: str) -> subprocess.CompletedProcess:
|
|
4
|
+
"""Applies the YAML manifest using kubectl apply -f -."""
|
|
5
|
+
process = subprocess.run(
|
|
6
|
+
["kubectl", "apply", "-f", "-"],
|
|
7
|
+
input=yaml_content.encode("utf-8"),
|
|
8
|
+
capture_output=True,
|
|
9
|
+
)
|
|
10
|
+
return process
|
|
11
|
+
|
|
12
|
+
def delete_jobset(name: str, namespace: str) -> subprocess.CompletedProcess:
|
|
13
|
+
"""Deletes the JobSet using kubectl delete jobset <name> --namespace=<namespace>."""
|
|
14
|
+
process = subprocess.run(
|
|
15
|
+
["kubectl", "delete", "jobset", name, f"--namespace={namespace}"],
|
|
16
|
+
capture_output=True,
|
|
17
|
+
)
|
|
18
|
+
return process
|
pwy/templates.py
ADDED
|
@@ -0,0 +1,197 @@
|
|
|
1
|
+
YAML_TEMPLATE = """apiVersion: jobset.x-k8s.io/v1alpha2
|
|
2
|
+
kind: JobSet
|
|
3
|
+
metadata:
|
|
4
|
+
name: {NAME}
|
|
5
|
+
namespace: {NAMESPACE}
|
|
6
|
+
spec:
|
|
7
|
+
failurePolicy:
|
|
8
|
+
maxRestarts: 0
|
|
9
|
+
restartStrategy: BlockingRecreate
|
|
10
|
+
replicatedJobs:
|
|
11
|
+
# -------------------------------------------------------------------------
|
|
12
|
+
# 1. Pathways Head (Client Pod)
|
|
13
|
+
# -------------------------------------------------------------------------
|
|
14
|
+
- name: pwhd
|
|
15
|
+
replicas: 1
|
|
16
|
+
template:
|
|
17
|
+
spec:
|
|
18
|
+
parallelism: 1
|
|
19
|
+
completions: 1
|
|
20
|
+
backoffLimit: 32
|
|
21
|
+
template:
|
|
22
|
+
metadata:
|
|
23
|
+
annotations:
|
|
24
|
+
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
|
|
25
|
+
spec:
|
|
26
|
+
terminationGracePeriodSeconds: 60
|
|
27
|
+
restartPolicy: Never
|
|
28
|
+
hostAliases:
|
|
29
|
+
- ip: 169.254.169.254
|
|
30
|
+
hostnames:
|
|
31
|
+
- metadata
|
|
32
|
+
- metadata.google.internal
|
|
33
|
+
tolerations:
|
|
34
|
+
- key: google.com/tpu
|
|
35
|
+
operator: Equal
|
|
36
|
+
value: "present"
|
|
37
|
+
effect: NoSchedule
|
|
38
|
+
{SPOT_TOLERATION_HEAD}
|
|
39
|
+
containers:
|
|
40
|
+
- name: client
|
|
41
|
+
image: {CLIENT_IMAGE}
|
|
42
|
+
command:
|
|
43
|
+
- bash
|
|
44
|
+
- -c
|
|
45
|
+
- |
|
|
46
|
+
{CLIENT_EXECUTION_COMMAND}
|
|
47
|
+
resources:
|
|
48
|
+
requests:
|
|
49
|
+
cpu: "1000m"
|
|
50
|
+
memory: "16Gi"
|
|
51
|
+
limits:
|
|
52
|
+
cpu: "1000m"
|
|
53
|
+
memory: "16Gi"
|
|
54
|
+
env:
|
|
55
|
+
- name: TPU_TYPE
|
|
56
|
+
value: {TPU_TYPE}
|
|
57
|
+
- name: NUM_TPU_SLICES
|
|
58
|
+
valueFrom:
|
|
59
|
+
fieldRef:
|
|
60
|
+
fieldPath: metadata.labels['jobset.sigs.k8s.io/replicatedjob-replicas']
|
|
61
|
+
- name: JAX_BACKEND_TARGET
|
|
62
|
+
value: grpc://localhost:29000
|
|
63
|
+
- name: XCLOUD_ENVIRONMENT
|
|
64
|
+
value: GCP
|
|
65
|
+
- name: JAX_PLATFORMS
|
|
66
|
+
value: proxy
|
|
67
|
+
- name: ENABLE_PATHWAYS_PERSISTENCE
|
|
68
|
+
value: "1"
|
|
69
|
+
- name: TPU_SKIP_MDS_QUERY
|
|
70
|
+
value: "true"
|
|
71
|
+
- name: PYTHONUNBUFFERED
|
|
72
|
+
value: "1"
|
|
73
|
+
- name: TEST_UNDECLARED_OUTPUTS_DIR
|
|
74
|
+
value: "true"
|
|
75
|
+
- name: IFRT_PROXY_LARGE_TRANSFER_THRESHOLD
|
|
76
|
+
value: "1"
|
|
77
|
+
- name: IFRT_PROXY_LARGE_TRANSFER_OPTIMIZATION_DIRECTORY
|
|
78
|
+
value: /tmp/ifrt_proxy
|
|
79
|
+
volumeMounts:
|
|
80
|
+
- name: shared-memory
|
|
81
|
+
mountPath: /tmp/ifrt_proxy
|
|
82
|
+
imagePullPolicy: Always
|
|
83
|
+
initContainers:
|
|
84
|
+
- name: pathways-proxy
|
|
85
|
+
image: us-docker.pkg.dev/cloud-tpu-v2-images/pathways/proxy_server:jax-0.10.0
|
|
86
|
+
restartPolicy: Always
|
|
87
|
+
ports:
|
|
88
|
+
- containerPort: 29000
|
|
89
|
+
env:
|
|
90
|
+
- name: IFRT_PROXY_USE_INSECURE_GRPC_CREDENTIALS
|
|
91
|
+
value: "true"
|
|
92
|
+
- name: IFRT_PROXY_LARGE_TRANSFER_OPTIMIZATION_DIRECTORY
|
|
93
|
+
value: /tmp/ifrt_proxy
|
|
94
|
+
args:
|
|
95
|
+
- --resource_manager_address=localhost:29001
|
|
96
|
+
- --server_port=29000{PROXY_SIDECAR_ARG}
|
|
97
|
+
volumeMounts:
|
|
98
|
+
- name: shared-memory
|
|
99
|
+
mountPath: /tmp/ifrt_proxy
|
|
100
|
+
- name: pathways-rm
|
|
101
|
+
image: us-docker.pkg.dev/cloud-tpu-v2-images/pathways/server:jax-0.10.0
|
|
102
|
+
restartPolicy: Always
|
|
103
|
+
env:
|
|
104
|
+
- name: TPU_SKIP_MDS_QUERY
|
|
105
|
+
value: "true"
|
|
106
|
+
args:
|
|
107
|
+
- --server_port=29001
|
|
108
|
+
- --node_type=resource_manager
|
|
109
|
+
- --instance_count={NUM_SLICES}
|
|
110
|
+
- --instance_type={RM_INSTANCE_TYPE}
|
|
111
|
+
- --gcs_scratch_location={GCS_SCRATCH_LOCATION}
|
|
112
|
+
- --enforce_kernel_ipv6_support=false
|
|
113
|
+
volumes:
|
|
114
|
+
- name: shared-memory
|
|
115
|
+
emptyDir:
|
|
116
|
+
medium: Memory
|
|
117
|
+
serviceAccountName: default
|
|
118
|
+
dnsPolicy: ClusterFirstWithHostNet
|
|
119
|
+
|
|
120
|
+
# -------------------------------------------------------------------------
|
|
121
|
+
# 2. Pathways Workers (TPU Pods)
|
|
122
|
+
# -------------------------------------------------------------------------
|
|
123
|
+
- name: pwwk
|
|
124
|
+
replicas: {NUM_SLICES}
|
|
125
|
+
template:
|
|
126
|
+
metadata:
|
|
127
|
+
annotations:
|
|
128
|
+
alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
|
|
129
|
+
spec:
|
|
130
|
+
parallelism: {VMS_PER_SLICE}
|
|
131
|
+
completions: {VMS_PER_SLICE}
|
|
132
|
+
backoffLimit: 32
|
|
133
|
+
template:
|
|
134
|
+
spec:
|
|
135
|
+
terminationGracePeriodSeconds: 60
|
|
136
|
+
hostAliases:
|
|
137
|
+
- ip: 169.254.169.254
|
|
138
|
+
hostnames:
|
|
139
|
+
- metadata
|
|
140
|
+
- metadata.google.internal
|
|
141
|
+
nodeSelector:
|
|
142
|
+
cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
|
|
143
|
+
cloud.google.com/gke-tpu-topology: {GKE_TOPOLOGY}
|
|
144
|
+
{SPOT_NODE_SELECTOR_WORKER}
|
|
145
|
+
tolerations:
|
|
146
|
+
- key: google.com/tpu
|
|
147
|
+
operator: Equal
|
|
148
|
+
value: "present"
|
|
149
|
+
effect: NoSchedule
|
|
150
|
+
{SPOT_TOLERATION_WORKER}
|
|
151
|
+
{WORKER_INIT_CONTAINERS}
|
|
152
|
+
containers:
|
|
153
|
+
- name: worker
|
|
154
|
+
image: us-docker.pkg.dev/cloud-tpu-v2-images/pathways/server:jax-0.10.0
|
|
155
|
+
imagePullPolicy: Always
|
|
156
|
+
ports:
|
|
157
|
+
- containerPort: 8471
|
|
158
|
+
- containerPort: 8080
|
|
159
|
+
- containerPort: 8431
|
|
160
|
+
- containerPort: 9000
|
|
161
|
+
- containerPort: 29001
|
|
162
|
+
securityContext:
|
|
163
|
+
privileged: true
|
|
164
|
+
resources:
|
|
165
|
+
limits:
|
|
166
|
+
google.com/tpu: 4
|
|
167
|
+
env:
|
|
168
|
+
- name: TPU_TYPE
|
|
169
|
+
value: {TPU_TYPE}
|
|
170
|
+
- name: NUM_TPU_SLICES
|
|
171
|
+
valueFrom:
|
|
172
|
+
fieldRef:
|
|
173
|
+
fieldPath: metadata.labels['jobset.sigs.k8s.io/replicatedjob-replicas']
|
|
174
|
+
- name: MEGASCALE_COORDINATOR_ADDRESS
|
|
175
|
+
value: {NAME}-pwhd-0-0.{NAME}
|
|
176
|
+
- name: MEGASCALE_NUM_SLICES
|
|
177
|
+
valueFrom:
|
|
178
|
+
fieldRef:
|
|
179
|
+
fieldPath: metadata.labels['jobset.sigs.k8s.io/replicatedjob-replicas']
|
|
180
|
+
- name: MEGASCALE_SLICE_ID
|
|
181
|
+
valueFrom:
|
|
182
|
+
fieldRef:
|
|
183
|
+
fieldPath: metadata.labels['jobset.sigs.k8s.io/job-index']
|
|
184
|
+
args:
|
|
185
|
+
- --server_port=29001
|
|
186
|
+
- --resource_manager_address={NAME}-pwhd-0-0.{NAME}:29001
|
|
187
|
+
- --gcs_scratch_location={GCS_SCRATCH_LOCATION}
|
|
188
|
+
- --tpu_pinned_host_allocation_recycle=true
|
|
189
|
+
- --tpu_premapped_buffer_size={TPU_PREMAPPED_BUFFER_SIZE}
|
|
190
|
+
- --enforce_kernel_ipv6_support=false
|
|
191
|
+
serviceAccountName: default
|
|
192
|
+
dnsPolicy: ClusterFirstWithHostNet
|
|
193
|
+
successPolicy:
|
|
194
|
+
operator: All
|
|
195
|
+
targetReplicatedJobs:
|
|
196
|
+
- pwhd
|
|
197
|
+
"""
|