pathways-cli 0.1.0__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- pathways_cli-0.1.0/.gitignore +16 -0
- pathways_cli-0.1.0/.python-version +1 -0
- pathways_cli-0.1.0/GEMINI.md +52 -0
- pathways_cli-0.1.0/PKG-INFO +122 -0
- pathways_cli-0.1.0/README.md +112 -0
- pathways_cli-0.1.0/plan.md +352 -0
- pathways_cli-0.1.0/pyproject.toml +33 -0
- pathways_cli-0.1.0/src/pwy/__init__.py +3 -0
- pathways_cli-0.1.0/src/pwy/cli.py +68 -0
- pathways_cli-0.1.0/src/pwy/generator.py +122 -0
- pathways_cli-0.1.0/src/pwy/kubernetes.py +18 -0
- pathways_cli-0.1.0/src/pwy/templates.py +197 -0
- pathways_cli-0.1.0/tests/__init__.py +1 -0
- pathways_cli-0.1.0/tests/test_cli.py +97 -0
- pathways_cli-0.1.0/tests/test_e2e.py +105 -0
- pathways_cli-0.1.0/tests/test_generator.py +114 -0
- pathways_cli-0.1.0/uv.lock +108 -0
|
@@ -0,0 +1 @@
|
|
|
1
|
+
3.12
|
|
@@ -0,0 +1,52 @@
|
|
|
1
|
+
# Gemini Agent & Developer Guide: `pathways-cli`
|
|
2
|
+
|
|
3
|
+
This document details the codebase design, architecture, key lessons, and integration verification guides for the `pwy` CLI tool, serving as a developer-facing companion to the user-facing `README.md`.
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## 1. Codebase Architecture
|
|
8
|
+
|
|
9
|
+
The project is structured around a standard PEP 621 package layout:
|
|
10
|
+
|
|
11
|
+
```
|
|
12
|
+
/Users/stoelinga/workspace/pathways-cli/
|
|
13
|
+
├── pyproject.toml # Package configurations, CLI scripts, and Pytest options
|
|
14
|
+
├── .gitignore # Excludes local environments, caches, and secrets
|
|
15
|
+
├── README.md # User documentation and example verification steps
|
|
16
|
+
├── GEMINI.md # Codebase design and developer/agent context (this file)
|
|
17
|
+
├── src/
|
|
18
|
+
│ └── pwy/
|
|
19
|
+
│ ├── __init__.py # Exposes cli entry points
|
|
20
|
+
│ ├── cli.py # click CLI definition: up, down commands
|
|
21
|
+
│ ├── generator.py # Topology math, spot VM toggles, colocated python configurations
|
|
22
|
+
│ ├── templates.py # Complete GKE JobSet multi-line YAML manifest template
|
|
23
|
+
│ └── kubernetes.py # Wrapper invoking kubectl subprocesses
|
|
24
|
+
└── tests/
|
|
25
|
+
├── __init__.py
|
|
26
|
+
├── test_cli.py # CLI option validations & mocks
|
|
27
|
+
├── test_generator.py # Mappings and string-formatting unit tests
|
|
28
|
+
└── test_e2e.py # Real GKE cluster integration execution verifying JAX setup
|
|
29
|
+
```
|
|
30
|
+
|
|
31
|
+
---
|
|
32
|
+
|
|
33
|
+
## 2. Testing Workflows
|
|
34
|
+
|
|
35
|
+
Verify changes using one of the three testing scopes:
|
|
36
|
+
|
|
37
|
+
### 1. Unit Tests (Mocked)
|
|
38
|
+
Tests calculations and YAML generation without cluster access.
|
|
39
|
+
```bash
|
|
40
|
+
uv run pytest tests/test_generator.py tests/test_cli.py
|
|
41
|
+
```
|
|
42
|
+
|
|
43
|
+
### 2. End-to-End Integration Tests (Active Cluster)
|
|
44
|
+
Runs actual deployments on a running TPU nodepool, installs JAX, executes verification scripts, and tears the setup down.
|
|
45
|
+
1. Configure your GCS path in a local `.env` file:
|
|
46
|
+
```env
|
|
47
|
+
PWY_E2E_GCS_SCRATCH_LOCATION=gs://my-staging-bucket/pathways
|
|
48
|
+
```
|
|
49
|
+
2. Run pytest targeting the `e2e` mark:
|
|
50
|
+
```bash
|
|
51
|
+
uv run pytest tests/test_e2e.py -m e2e -s
|
|
52
|
+
```
|
|
@@ -0,0 +1,122 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: pathways-cli
|
|
3
|
+
Version: 0.1.0
|
|
4
|
+
Summary: Pathways CLI to easily bring up pathways clusters.
|
|
5
|
+
Author-email: Sam Stoelinga <sammiestoel@gmail.com>
|
|
6
|
+
Requires-Python: >=3.12
|
|
7
|
+
Requires-Dist: click>=8.4.1
|
|
8
|
+
Requires-Dist: python-dotenv>=1.2.2
|
|
9
|
+
Description-Content-Type: text/markdown
|
|
10
|
+
|
|
11
|
+
# `pwy`: Standalone Pathways GKE Cluster CLI Tool
|
|
12
|
+
|
|
13
|
+
`pwy` is a lightweight, standalone Python CLI utility designed to generate, apply, and manage interactive Pathways workloads on Google Kubernetes Engine (GKE) using Kubernetes JobSets.
|
|
14
|
+
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
## Features
|
|
18
|
+
|
|
19
|
+
- **Automated TPU Topology Calculations**: Translates simple TPU resource types (`v6e-4`, `v6e-16`, etc.) into GKE topologies, VM counts, and instance settings.
|
|
20
|
+
- **Spot VM Support**: Dynamically injects GKE node selectors and tolerations for running workloads on cost-effective Spot VMs.
|
|
21
|
+
- **Colocated Python Support**: Simplifies distributed checkpointing (e.g. via Orbax) by configuring and enabling colocated host CPU sidecars and proxy endpoints automatically.
|
|
22
|
+
- **Interactive & Batch Execution**: Supports spinning up pathways servers with infinite sleep drivers for interactive debugging, or executing training scripts directly.
|
|
23
|
+
- **Dry-run Manifest Generation**: Preview and inspect the GKE JobSet manifest without applying it to the cluster.
|
|
24
|
+
|
|
25
|
+
---
|
|
26
|
+
|
|
27
|
+
## Installation
|
|
28
|
+
|
|
29
|
+
This project utilizes [uv](https://github.com/astral-sh/uv) for fast, modern Python package and dependency management.
|
|
30
|
+
|
|
31
|
+
To sync the environment and install `pwy`:
|
|
32
|
+
|
|
33
|
+
```bash
|
|
34
|
+
uv sync
|
|
35
|
+
```
|
|
36
|
+
|
|
37
|
+
---
|
|
38
|
+
|
|
39
|
+
## Usage
|
|
40
|
+
|
|
41
|
+
You can invoke `pwy` commands directly using `uv run`:
|
|
42
|
+
|
|
43
|
+
### 1. Provision / Preview a Cluster (`pwy up`)
|
|
44
|
+
|
|
45
|
+
Starts a Pathways JobSet or dry-runs the configuration.
|
|
46
|
+
|
|
47
|
+
```bash
|
|
48
|
+
uv run pwy up \
|
|
49
|
+
--tpu-type v6e-16 \
|
|
50
|
+
--gcs-scratch-location gs://my-bucket/pathways-staging \
|
|
51
|
+
--num-slices 1 \
|
|
52
|
+
--dry-run
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
#### Key Options:
|
|
56
|
+
- `--tpu-type`: **(Required)** TPU type (e.g., `v6e-4`, `v6e-8`, `v6e-16`, `v6e-32`, `v6e-64`).
|
|
57
|
+
- `--gcs-scratch-location`: **(Required)** GCS scratch path for pathways synchronization.
|
|
58
|
+
- `--num-slices`: Number of TPU slices to run (default: `1`).
|
|
59
|
+
- `--jax-client-image`: Custom client container image (default: `python:3.12-slim`).
|
|
60
|
+
- `--command`: Run a custom training/eval script in the client container. If omitted, defaults to `sleep infinity` (interactive mode).
|
|
61
|
+
- `--enable-spot`: Add node affinity and toleration settings for Spot VMs.
|
|
62
|
+
- `--colocated-python`: Enables colocated CPU Python sidecar/init containers on GKE workers and enables external proxy routing.
|
|
63
|
+
- `--dry-run`: Prints the generated YAML to stdout instead of calling `kubectl apply`.
|
|
64
|
+
- `--name`: Name of the Kubernetes JobSet resource (default: `pathways-interactive`).
|
|
65
|
+
- `--namespace`: Target Kubernetes namespace (default: `default`).
|
|
66
|
+
|
|
67
|
+
---
|
|
68
|
+
|
|
69
|
+
### 2. Teardown a Cluster (`pwy down`)
|
|
70
|
+
|
|
71
|
+
Deletes the running Pathways JobSet.
|
|
72
|
+
|
|
73
|
+
```bash
|
|
74
|
+
uv run pwy down --name pathways-interactive --namespace default
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
---
|
|
78
|
+
|
|
79
|
+
### 3. Verification Example
|
|
80
|
+
|
|
81
|
+
Once the interactive cluster is running, you can verify execution by `exec`ing into the client container:
|
|
82
|
+
|
|
83
|
+
1. **Find the client pod name**:
|
|
84
|
+
```bash
|
|
85
|
+
POD_NAME=$(kubectl get pods -l jobset.sigs.k8s.io/jobset-name=pathways-interactive -o jsonpath='{.items[?(@.metadata.labels.jobset\\.sigs\\.k8s\\.io/replicatedjob-name=="pwhd")].metadata.name}')
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
2. **Install JAX and Pathways utils**:
|
|
89
|
+
```bash
|
|
90
|
+
kubectl exec $POD_NAME -c client -- pip install jax pathwaysutils
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
3. **Run a Python snippet to initialize and list devices**:
|
|
94
|
+
```bash
|
|
95
|
+
kubectl exec $POD_NAME -c client -- python3 -c "import pathwaysutils; pathwaysutils.initialize(); import jax; print(jax.devices())"
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
The command output should print the available virtual TPU devices (e.g., coordinates and memory spaces of the allocated chips).
|
|
99
|
+
|
|
100
|
+
---
|
|
101
|
+
|
|
102
|
+
## TPU Type Mappings
|
|
103
|
+
|
|
104
|
+
`pwy` handles all resource-limit math and topologies automatically according to the following matrix:
|
|
105
|
+
|
|
106
|
+
| TPU Type | GKE Topology | VMs Per Slice | RM Instance Type |
|
|
107
|
+
| :--- | :--- | :--- | :--- |
|
|
108
|
+
| `v6e-4` | `2x2` | 1 | `tpuv6e:2x2` |
|
|
109
|
+
| `v6e-8` | `2x4` | 2 | `tpuv6e:2x4` |
|
|
110
|
+
| `v6e-16` | `4x4` | 4 | `tpuv6e:4x4` |
|
|
111
|
+
| `v6e-32` | `4x8` | 8 | `tpuv6e:4x8` |
|
|
112
|
+
| `v6e-64` | `8x8` | 16 | `tpuv6e:8x8` |
|
|
113
|
+
|
|
114
|
+
---
|
|
115
|
+
|
|
116
|
+
## Running Tests
|
|
117
|
+
|
|
118
|
+
To execute the unit test suite:
|
|
119
|
+
|
|
120
|
+
```bash
|
|
121
|
+
uv run pytest
|
|
122
|
+
```
|
|
@@ -0,0 +1,112 @@
|
|
|
1
|
+
# `pwy`: Standalone Pathways GKE Cluster CLI Tool
|
|
2
|
+
|
|
3
|
+
`pwy` is a lightweight, standalone Python CLI utility designed to generate, apply, and manage interactive Pathways workloads on Google Kubernetes Engine (GKE) using Kubernetes JobSets.
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## Features
|
|
8
|
+
|
|
9
|
+
- **Automated TPU Topology Calculations**: Translates simple TPU resource types (`v6e-4`, `v6e-16`, etc.) into GKE topologies, VM counts, and instance settings.
|
|
10
|
+
- **Spot VM Support**: Dynamically injects GKE node selectors and tolerations for running workloads on cost-effective Spot VMs.
|
|
11
|
+
- **Colocated Python Support**: Simplifies distributed checkpointing (e.g. via Orbax) by configuring and enabling colocated host CPU sidecars and proxy endpoints automatically.
|
|
12
|
+
- **Interactive & Batch Execution**: Supports spinning up pathways servers with infinite sleep drivers for interactive debugging, or executing training scripts directly.
|
|
13
|
+
- **Dry-run Manifest Generation**: Preview and inspect the GKE JobSet manifest without applying it to the cluster.
|
|
14
|
+
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
## Installation
|
|
18
|
+
|
|
19
|
+
This project utilizes [uv](https://github.com/astral-sh/uv) for fast, modern Python package and dependency management.
|
|
20
|
+
|
|
21
|
+
To sync the environment and install `pwy`:
|
|
22
|
+
|
|
23
|
+
```bash
|
|
24
|
+
uv sync
|
|
25
|
+
```
|
|
26
|
+
|
|
27
|
+
---
|
|
28
|
+
|
|
29
|
+
## Usage
|
|
30
|
+
|
|
31
|
+
You can invoke `pwy` commands directly using `uv run`:
|
|
32
|
+
|
|
33
|
+
### 1. Provision / Preview a Cluster (`pwy up`)
|
|
34
|
+
|
|
35
|
+
Starts a Pathways JobSet or dry-runs the configuration.
|
|
36
|
+
|
|
37
|
+
```bash
|
|
38
|
+
uv run pwy up \
|
|
39
|
+
--tpu-type v6e-16 \
|
|
40
|
+
--gcs-scratch-location gs://my-bucket/pathways-staging \
|
|
41
|
+
--num-slices 1 \
|
|
42
|
+
--dry-run
|
|
43
|
+
```
|
|
44
|
+
|
|
45
|
+
#### Key Options:
|
|
46
|
+
- `--tpu-type`: **(Required)** TPU type (e.g., `v6e-4`, `v6e-8`, `v6e-16`, `v6e-32`, `v6e-64`).
|
|
47
|
+
- `--gcs-scratch-location`: **(Required)** GCS scratch path for pathways synchronization.
|
|
48
|
+
- `--num-slices`: Number of TPU slices to run (default: `1`).
|
|
49
|
+
- `--jax-client-image`: Custom client container image (default: `python:3.12-slim`).
|
|
50
|
+
- `--command`: Run a custom training/eval script in the client container. If omitted, defaults to `sleep infinity` (interactive mode).
|
|
51
|
+
- `--enable-spot`: Add node affinity and toleration settings for Spot VMs.
|
|
52
|
+
- `--colocated-python`: Enables colocated CPU Python sidecar/init containers on GKE workers and enables external proxy routing.
|
|
53
|
+
- `--dry-run`: Prints the generated YAML to stdout instead of calling `kubectl apply`.
|
|
54
|
+
- `--name`: Name of the Kubernetes JobSet resource (default: `pathways-interactive`).
|
|
55
|
+
- `--namespace`: Target Kubernetes namespace (default: `default`).
|
|
56
|
+
|
|
57
|
+
---
|
|
58
|
+
|
|
59
|
+
### 2. Teardown a Cluster (`pwy down`)
|
|
60
|
+
|
|
61
|
+
Deletes the running Pathways JobSet.
|
|
62
|
+
|
|
63
|
+
```bash
|
|
64
|
+
uv run pwy down --name pathways-interactive --namespace default
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
---
|
|
68
|
+
|
|
69
|
+
### 3. Verification Example
|
|
70
|
+
|
|
71
|
+
Once the interactive cluster is running, you can verify execution by `exec`ing into the client container:
|
|
72
|
+
|
|
73
|
+
1. **Find the client pod name**:
|
|
74
|
+
```bash
|
|
75
|
+
POD_NAME=$(kubectl get pods -l jobset.sigs.k8s.io/jobset-name=pathways-interactive -o jsonpath='{.items[?(@.metadata.labels.jobset\\.sigs\\.k8s\\.io/replicatedjob-name=="pwhd")].metadata.name}')
|
|
76
|
+
```
|
|
77
|
+
|
|
78
|
+
2. **Install JAX and Pathways utils**:
|
|
79
|
+
```bash
|
|
80
|
+
kubectl exec $POD_NAME -c client -- pip install jax pathwaysutils
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
3. **Run a Python snippet to initialize and list devices**:
|
|
84
|
+
```bash
|
|
85
|
+
kubectl exec $POD_NAME -c client -- python3 -c "import pathwaysutils; pathwaysutils.initialize(); import jax; print(jax.devices())"
|
|
86
|
+
```
|
|
87
|
+
|
|
88
|
+
The command output should print the available virtual TPU devices (e.g., coordinates and memory spaces of the allocated chips).
|
|
89
|
+
|
|
90
|
+
---
|
|
91
|
+
|
|
92
|
+
## TPU Type Mappings
|
|
93
|
+
|
|
94
|
+
`pwy` handles all resource-limit math and topologies automatically according to the following matrix:
|
|
95
|
+
|
|
96
|
+
| TPU Type | GKE Topology | VMs Per Slice | RM Instance Type |
|
|
97
|
+
| :--- | :--- | :--- | :--- |
|
|
98
|
+
| `v6e-4` | `2x2` | 1 | `tpuv6e:2x2` |
|
|
99
|
+
| `v6e-8` | `2x4` | 2 | `tpuv6e:2x4` |
|
|
100
|
+
| `v6e-16` | `4x4` | 4 | `tpuv6e:4x4` |
|
|
101
|
+
| `v6e-32` | `4x8` | 8 | `tpuv6e:4x8` |
|
|
102
|
+
| `v6e-64` | `8x8` | 16 | `tpuv6e:8x8` |
|
|
103
|
+
|
|
104
|
+
---
|
|
105
|
+
|
|
106
|
+
## Running Tests
|
|
107
|
+
|
|
108
|
+
To execute the unit test suite:
|
|
109
|
+
|
|
110
|
+
```bash
|
|
111
|
+
uv run pytest
|
|
112
|
+
```
|
|
@@ -0,0 +1,352 @@
|
|
|
1
|
+
# Implementation Plan: Standalone Pathways Cluster CLI Tool (`pwy`)
|
|
2
|
+
|
|
3
|
+
This document outlines the final design and detailed implementation specifications for a standalone Python CLI tool to generate, apply, and manage interactive Pathways GKE cluster manifests.
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
## 1. CLI Commands & Arguments
|
|
8
|
+
|
|
9
|
+
The CLI binary/entry point will be named `pwy`.
|
|
10
|
+
|
|
11
|
+
### Commands
|
|
12
|
+
|
|
13
|
+
#### 1. `pwy up`
|
|
14
|
+
Starts the cluster or dry-runs the configuration.
|
|
15
|
+
|
|
16
|
+
```bash
|
|
17
|
+
pwy up \
|
|
18
|
+
--tpu-type=v6e-4 \
|
|
19
|
+
--gcs-scratch-location=gs://my-bucket/staging \
|
|
20
|
+
[--num-slices=1] \
|
|
21
|
+
[--jax-client-image=python:3.12-slim] \
|
|
22
|
+
[--command="python my_script.py"] \
|
|
23
|
+
[--enable-spot] \
|
|
24
|
+
[--colocated-python] \
|
|
25
|
+
[--dry-run] \
|
|
26
|
+
[--name=pathways-interactive] \
|
|
27
|
+
[--namespace=default]
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
* **Behavior**:
|
|
31
|
+
* Calculates cluster configuration based on `--tpu-type` and `--num-slices`.
|
|
32
|
+
* Generates the JobSet YAML.
|
|
33
|
+
* If `--dry-run` is set: Prints the generated YAML to stdout and exits.
|
|
34
|
+
* Otherwise: Pipes the YAML directly to `kubectl apply -f -`.
|
|
35
|
+
|
|
36
|
+
#### 2. `pwy down`
|
|
37
|
+
Tears down the cluster.
|
|
38
|
+
|
|
39
|
+
```bash
|
|
40
|
+
pwy down [--name=pathways-interactive] [--namespace=default]
|
|
41
|
+
```
|
|
42
|
+
* **Behavior**: Runs `kubectl delete jobset <name> --namespace=<namespace>`.
|
|
43
|
+
|
|
44
|
+
---
|
|
45
|
+
|
|
46
|
+
## 2. Automated TPU Mappings & Configuration Math
|
|
47
|
+
|
|
48
|
+
The tool automatically maps the user-provided `--tpu-type` to GKE resources, topologies, and arguments.
|
|
49
|
+
|
|
50
|
+
### Mappings Database
|
|
51
|
+
|
|
52
|
+
| TPU Type | GKE Topology | VMs Per Slice (`vms_per_slice`) | RM Instance Type (`rm_instance_type`) |
|
|
53
|
+
| :--- | :--- | :--- | :--- |
|
|
54
|
+
| `v6e-4` | `2x2` | 1 | `tpuv6e:2x2` |
|
|
55
|
+
| `v6e-8` | `2x4` | 2 | `tpuv6e:2x4` |
|
|
56
|
+
| `v6e-16` | `4x4` | 4 | `tpuv6e:4x4` |
|
|
57
|
+
| `v6e-32` | `4x8` | 8 | `tpuv6e:4x8` |
|
|
58
|
+
| `v6e-64` | `8x8` | 16 | `tpuv6e:8x8` |
|
|
59
|
+
|
|
60
|
+
### Configuration Math
|
|
61
|
+
|
|
62
|
+
For any given run with `tpu-type` and `num-slices`:
|
|
63
|
+
|
|
64
|
+
* **Pathways Head (`pwhd` replicated job)**:
|
|
65
|
+
* `replicas` is always `1`.
|
|
66
|
+
* **Pathways Worker (`pwwk` replicated job)**:
|
|
67
|
+
* `replicas` (number of slices) = `--num-slices`
|
|
68
|
+
* `spec.parallelism` (VMs per slice) = `vms_per_slice`
|
|
69
|
+
* `spec.completions` (VMs per slice) = `vms_per_slice`
|
|
70
|
+
* `resources.limits["google.com/tpu"]` = `4` (always 4 for v6e)
|
|
71
|
+
* `nodeSelector["cloud.google.com/gke-tpu-topology"]` = `GKE Topology`
|
|
72
|
+
* `nodeSelector["cloud.google.com/gke-tpu-accelerator"]` = `tpu-v6e-slice`
|
|
73
|
+
* **Resource Manager container (`pathways-rm` sidecar)**:
|
|
74
|
+
* `--instance_count` = `--num-slices`
|
|
75
|
+
* `--instance_type` = `rm_instance_type`
|
|
76
|
+
|
|
77
|
+
---
|
|
78
|
+
|
|
79
|
+
## 3. Client Command & Image Execution Logic
|
|
80
|
+
|
|
81
|
+
The client container `command` field is generated dynamically based on the `--jax-client-image`, `--command`, and `--colocated-python` flags:
|
|
82
|
+
|
|
83
|
+
1. **JAX Client Image**:
|
|
84
|
+
* Uses `--jax-client-image` (defaulting to `python:3.12-slim`).
|
|
85
|
+
2. **Command Executed**:
|
|
86
|
+
* **If `--command` is NOT provided**:
|
|
87
|
+
```bash
|
|
88
|
+
bash -c "sleep infinity"
|
|
89
|
+
```
|
|
90
|
+
* **If `--command` IS provided** (e.g. `--command="python training.py"`):
|
|
91
|
+
The tool boots up the environment, initializes pathways, and then executes the custom command directly:
|
|
92
|
+
```bash
|
|
93
|
+
bash -c "python training.py"
|
|
94
|
+
```
|
|
95
|
+
|
|
96
|
+
---
|
|
97
|
+
|
|
98
|
+
## 4. Full Manifest Reference Template
|
|
99
|
+
|
|
100
|
+
Below is the complete reference JobSet YAML template populated with default/baseline variables for a `v6e-4` single-slice run. The CLI generator code should output a structure identical to this:
|
|
101
|
+
|
|
102
|
+
```yaml
|
|
103
|
+
apiVersion: jobset.x-k8s.io/v1alpha2
|
|
104
|
+
kind: JobSet
|
|
105
|
+
metadata:
|
|
106
|
+
name: {NAME}
|
|
107
|
+
namespace: {NAMESPACE}
|
|
108
|
+
spec:
|
|
109
|
+
failurePolicy:
|
|
110
|
+
maxRestarts: 0
|
|
111
|
+
restartStrategy: BlockingRecreate
|
|
112
|
+
replicatedJobs:
|
|
113
|
+
# -------------------------------------------------------------------------
|
|
114
|
+
# 1. Pathways Head (Client Pod)
|
|
115
|
+
# -------------------------------------------------------------------------
|
|
116
|
+
- name: pwhd
|
|
117
|
+
replicas: 1
|
|
118
|
+
template:
|
|
119
|
+
spec:
|
|
120
|
+
parallelism: 1
|
|
121
|
+
completions: 1
|
|
122
|
+
backoffLimit: 32
|
|
123
|
+
template:
|
|
124
|
+
metadata:
|
|
125
|
+
annotations:
|
|
126
|
+
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
|
|
127
|
+
spec:
|
|
128
|
+
terminationGracePeriodSeconds: 60
|
|
129
|
+
restartPolicy: Never
|
|
130
|
+
hostAliases:
|
|
131
|
+
- ip: 169.254.169.254
|
|
132
|
+
hostnames:
|
|
133
|
+
- metadata
|
|
134
|
+
- metadata.google.internal
|
|
135
|
+
tolerations:
|
|
136
|
+
- key: google.com/tpu
|
|
137
|
+
operator: Equal
|
|
138
|
+
value: "present"
|
|
139
|
+
effect: NoSchedule
|
|
140
|
+
# Spot toleration only added if --enable-spot is True
|
|
141
|
+
{SPOT_TOLERATION_HEAD}
|
|
142
|
+
containers:
|
|
143
|
+
- name: client
|
|
144
|
+
image: {CLIENT_IMAGE}
|
|
145
|
+
command:
|
|
146
|
+
- bash
|
|
147
|
+
- -c
|
|
148
|
+
- |
|
|
149
|
+
{CLIENT_EXECUTION_COMMAND}
|
|
150
|
+
resources:
|
|
151
|
+
requests:
|
|
152
|
+
cpu: "1000m"
|
|
153
|
+
memory: "16Gi"
|
|
154
|
+
limits:
|
|
155
|
+
cpu: "1000m"
|
|
156
|
+
memory: "16Gi"
|
|
157
|
+
env:
|
|
158
|
+
- name: TPU_TYPE
|
|
159
|
+
value: {TPU_TYPE}
|
|
160
|
+
- name: NUM_TPU_SLICES
|
|
161
|
+
valueFrom:
|
|
162
|
+
fieldRef:
|
|
163
|
+
fieldPath: metadata.labels['jobset.sigs.k8s.io/replicatedjob-replicas']
|
|
164
|
+
- name: JAX_BACKEND_TARGET
|
|
165
|
+
value: grpc://localhost:29000
|
|
166
|
+
- name: XCLOUD_ENVIRONMENT
|
|
167
|
+
value: GCP
|
|
168
|
+
- name: JAX_PLATFORMS
|
|
169
|
+
value: proxy
|
|
170
|
+
- name: ENABLE_PATHWAYS_PERSISTENCE
|
|
171
|
+
value: "1"
|
|
172
|
+
- name: TPU_SKIP_MDS_QUERY
|
|
173
|
+
value: "true"
|
|
174
|
+
- name: PYTHONUNBUFFERED
|
|
175
|
+
value: "1"
|
|
176
|
+
- name: TEST_UNDECLARED_OUTPUTS_DIR
|
|
177
|
+
value: "true"
|
|
178
|
+
- name: IFRT_PROXY_LARGE_TRANSFER_THRESHOLD
|
|
179
|
+
value: "1"
|
|
180
|
+
- name: IFRT_PROXY_LARGE_TRANSFER_OPTIMIZATION_DIRECTORY
|
|
181
|
+
value: /tmp/ifrt_proxy
|
|
182
|
+
volumeMounts:
|
|
183
|
+
- name: shared-memory
|
|
184
|
+
mountPath: /tmp/ifrt_proxy
|
|
185
|
+
imagePullPolicy: Always
|
|
186
|
+
initContainers:
|
|
187
|
+
- name: pathways-proxy
|
|
188
|
+
image: us-docker.pkg.dev/cloud-tpu-v2-images/pathways/proxy_server:jax-0.9.2
|
|
189
|
+
restartPolicy: Always
|
|
190
|
+
ports:
|
|
191
|
+
- containerPort: 29000
|
|
192
|
+
env:
|
|
193
|
+
- name: IFRT_PROXY_USE_INSECURE_GRPC_CREDENTIALS
|
|
194
|
+
value: "true"
|
|
195
|
+
- name: IFRT_PROXY_LARGE_TRANSFER_OPTIMIZATION_DIRECTORY
|
|
196
|
+
value: /tmp/ifrt_proxy
|
|
197
|
+
args:
|
|
198
|
+
- --resource_manager_address=localhost:29001
|
|
199
|
+
- --server_port=29000
|
|
200
|
+
volumeMounts:
|
|
201
|
+
- name: shared-memory
|
|
202
|
+
mountPath: /tmp/ifrt_proxy
|
|
203
|
+
- name: pathways-rm
|
|
204
|
+
image: us-docker.pkg.dev/cloud-tpu-v2-images/pathways/server:jax-0.9.2
|
|
205
|
+
restartPolicy: Always
|
|
206
|
+
env:
|
|
207
|
+
- name: TPU_SKIP_MDS_QUERY
|
|
208
|
+
value: "true"
|
|
209
|
+
args:
|
|
210
|
+
- --server_port=29001
|
|
211
|
+
- --node_type=resource_manager
|
|
212
|
+
- --instance_count={NUM_SLICES}
|
|
213
|
+
- --instance_type={RM_INSTANCE_TYPE}
|
|
214
|
+
- --gcs_scratch_location={GCS_SCRATCH_LOCATION}
|
|
215
|
+
volumes:
|
|
216
|
+
- name: shared-memory
|
|
217
|
+
emptyDir:
|
|
218
|
+
medium: Memory
|
|
219
|
+
serviceAccountName: default
|
|
220
|
+
dnsPolicy: ClusterFirstWithHostNet
|
|
221
|
+
|
|
222
|
+
# -------------------------------------------------------------------------
|
|
223
|
+
# 2. Pathways Workers (TPU Pods)
|
|
224
|
+
# -------------------------------------------------------------------------
|
|
225
|
+
- name: pwwk
|
|
226
|
+
replicas: {NUM_SLICES}
|
|
227
|
+
template:
|
|
228
|
+
metadata:
|
|
229
|
+
annotations:
|
|
230
|
+
alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
|
|
231
|
+
spec:
|
|
232
|
+
parallelism: {VMS_PER_SLICE}
|
|
233
|
+
completions: {VMS_PER_SLICE}
|
|
234
|
+
backoffLimit: 32
|
|
235
|
+
template:
|
|
236
|
+
spec:
|
|
237
|
+
terminationGracePeriodSeconds: 60
|
|
238
|
+
hostAliases:
|
|
239
|
+
- ip: 169.254.169.254
|
|
240
|
+
hostnames:
|
|
241
|
+
- metadata
|
|
242
|
+
- metadata.google.internal
|
|
243
|
+
nodeSelector:
|
|
244
|
+
cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
|
|
245
|
+
cloud.google.com/gke-tpu-topology: {GKE_TOPOLOGY}
|
|
246
|
+
{SPOT_NODE_SELECTOR_WORKER}
|
|
247
|
+
tolerations:
|
|
248
|
+
# Spot toleration only added if --enable-spot is True
|
|
249
|
+
{SPOT_TOLERATION_WORKER}
|
|
250
|
+
containers:
|
|
251
|
+
- name: worker
|
|
252
|
+
image: us-docker.pkg.dev/cloud-tpu-v2-images/pathways/server:jax-0.9.2
|
|
253
|
+
imagePullPolicy: Always
|
|
254
|
+
ports:
|
|
255
|
+
- containerPort: 8471
|
|
256
|
+
- containerPort: 8080
|
|
257
|
+
- containerPort: 8431
|
|
258
|
+
- containerPort: 9000
|
|
259
|
+
- containerPort: 29001
|
|
260
|
+
securityContext:
|
|
261
|
+
privileged: true
|
|
262
|
+
resources:
|
|
263
|
+
limits:
|
|
264
|
+
google.com/tpu: 4
|
|
265
|
+
env:
|
|
266
|
+
- name: TPU_TYPE
|
|
267
|
+
value: {TPU_TYPE}
|
|
268
|
+
- name: NUM_TPU_SLICES
|
|
269
|
+
valueFrom:
|
|
270
|
+
fieldRef:
|
|
271
|
+
fieldPath: metadata.labels['jobset.sigs.k8s.io/replicatedjob-replicas']
|
|
272
|
+
- name: MEGASCALE_COORDINATOR_ADDRESS
|
|
273
|
+
value: {NAME}-pwhd-0-0.{NAME}
|
|
274
|
+
- name: MEGASCALE_NUM_SLICES
|
|
275
|
+
valueFrom:
|
|
276
|
+
fieldRef:
|
|
277
|
+
fieldPath: metadata.labels['jobset.sigs.k8s.io/replicatedjob-replicas']
|
|
278
|
+
- name: MEGASCALE_SLICE_ID
|
|
279
|
+
valueFrom:
|
|
280
|
+
fieldRef:
|
|
281
|
+
fieldPath: metadata.labels['jobset.sigs.k8s.io/job-index']
|
|
282
|
+
args:
|
|
283
|
+
- --server_port=29001
|
|
284
|
+
- --resource_manager_address={NAME}-pwhd-0-0.{NAME}:29001
|
|
285
|
+
- --gcs_scratch_location={GCS_SCRATCH_LOCATION}
|
|
286
|
+
- --tpu_pinned_host_allocation_recycle=true
|
|
287
|
+
- --tpu_premapped_buffer_size=274877906944
|
|
288
|
+
serviceAccountName: default
|
|
289
|
+
dnsPolicy: ClusterFirstWithHostNet
|
|
290
|
+
successPolicy:
|
|
291
|
+
operator: All
|
|
292
|
+
targetReplicatedJobs:
|
|
293
|
+
- pwhd
|
|
294
|
+
```
|
|
295
|
+
|
|
296
|
+
---
|
|
297
|
+
|
|
298
|
+
## 5. Repository Layout & Target Files
|
|
299
|
+
|
|
300
|
+
A new standalone directory structure will be created under the workspace (mocking a new repository context):
|
|
301
|
+
|
|
302
|
+
```
|
|
303
|
+
/Users/stoelinga/workspace/pathways-cli/
|
|
304
|
+
├── pyproject.toml
|
|
305
|
+
├── README.md
|
|
306
|
+
├── pwy/
|
|
307
|
+
│ ├── __init__.py
|
|
308
|
+
│ ├── cli.py # Entry point (Click CLI commands: up, down)
|
|
309
|
+
│ ├── generator.py # Topology mapping and dictionary interpolation
|
|
310
|
+
│ ├── templates.py # Text template holding the YAML manifest structure
|
|
311
|
+
│ └── kubernetes.py # Subprocess module executing "kubectl apply -f" or "kubectl delete"
|
|
312
|
+
└── tests/
|
|
313
|
+
├── __init__.py
|
|
314
|
+
├── test_generator.py
|
|
315
|
+
└── test_cli.py
|
|
316
|
+
```
|
|
317
|
+
|
|
318
|
+
### File Implementation Details
|
|
319
|
+
|
|
320
|
+
#### `pwy/generator.py`
|
|
321
|
+
Contains the lookup dictionaries and mapping functions:
|
|
322
|
+
```python
|
|
323
|
+
TPU_MAPPINGS = {
|
|
324
|
+
"v6e-4": {"topology": "2x2", "vms_per_slice": 1, "rm_type": "tpuv6e:2x2"},
|
|
325
|
+
"v6e-8": {"topology": "2x4", "vms_per_slice": 2, "rm_type": "tpuv6e:2x4"},
|
|
326
|
+
"v6e-16": {"topology": "4x4", "vms_per_slice": 4, "rm_type": "tpuv6e:4x4"},
|
|
327
|
+
"v6e-32": {"topology": "4x8", "vms_per_slice": 8, "rm_type": "tpuv6e:4x8"},
|
|
328
|
+
"v6e-64": {"topology": "8x8", "vms_per_slice": 16, "rm_type": "tpuv6e:8x8"},
|
|
329
|
+
}
|
|
330
|
+
|
|
331
|
+
def generate_yaml(
|
|
332
|
+
name: str,
|
|
333
|
+
namespace: str,
|
|
334
|
+
tpu_type: str,
|
|
335
|
+
gcs_scratch_location: str,
|
|
336
|
+
num_slices: int = 1,
|
|
337
|
+
jax_client_image: str = "python:3.12-slim",
|
|
338
|
+
command: str = None,
|
|
339
|
+
enable_spot: bool = False,
|
|
340
|
+
) -> str:
|
|
341
|
+
# 1. Look up TPU type mappings
|
|
342
|
+
# 2. Format client container commands
|
|
343
|
+
# 3. Handle spot nodeSelector and tolerations formatting
|
|
344
|
+
# 4. Interpolate templates.YAML_TEMPLATE with final string variables
|
|
345
|
+
...
|
|
346
|
+
```
|
|
347
|
+
|
|
348
|
+
#### `pwy/cli.py`
|
|
349
|
+
Handles options parsing and commands:
|
|
350
|
+
* Imports `generate_yaml`.
|
|
351
|
+
* Runs `kubectl apply` or `kubectl delete` using Python's `subprocess.run(..., input=yaml_content.encode())`.
|
|
352
|
+
|
|
@@ -0,0 +1,33 @@
|
|
|
1
|
+
[project]
|
|
2
|
+
name = "pathways-cli"
|
|
3
|
+
version = "0.1.0"
|
|
4
|
+
description = "Pathways CLI to easily bring up pathways clusters."
|
|
5
|
+
readme = "README.md"
|
|
6
|
+
authors = [
|
|
7
|
+
{ name = "Sam Stoelinga", email = "sammiestoel@gmail.com" }
|
|
8
|
+
]
|
|
9
|
+
requires-python = ">=3.12"
|
|
10
|
+
dependencies = [
|
|
11
|
+
"click>=8.4.1",
|
|
12
|
+
"python-dotenv>=1.2.2",
|
|
13
|
+
]
|
|
14
|
+
|
|
15
|
+
[project.scripts]
|
|
16
|
+
pwy = "pwy:main"
|
|
17
|
+
|
|
18
|
+
[build-system]
|
|
19
|
+
requires = ["hatchling"]
|
|
20
|
+
build-backend = "hatchling.build"
|
|
21
|
+
|
|
22
|
+
[tool.hatch.build.targets.wheel]
|
|
23
|
+
packages = ["src/pwy"]
|
|
24
|
+
|
|
25
|
+
[dependency-groups]
|
|
26
|
+
dev = [
|
|
27
|
+
"pytest>=9.0.3",
|
|
28
|
+
]
|
|
29
|
+
|
|
30
|
+
[tool.pytest.ini_options]
|
|
31
|
+
markers = [
|
|
32
|
+
"e2e: end-to-end integration tests",
|
|
33
|
+
]
|