pathways-cli 0.1.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,16 @@
1
+ # Python-generated files
2
+ __pycache__/
3
+ *.py[oc]
4
+ build/
5
+ dist/
6
+ wheels/
7
+ *.egg-info
8
+
9
+ # Virtual environments
10
+ .venv
11
+
12
+ # Local env files
13
+ .env
14
+
15
+ # Test caches
16
+ .pytest_cache/
@@ -0,0 +1 @@
1
+ 3.12
@@ -0,0 +1,52 @@
1
+ # Gemini Agent & Developer Guide: `pathways-cli`
2
+
3
+ This document details the codebase design, architecture, key lessons, and integration verification guides for the `pwy` CLI tool, serving as a developer-facing companion to the user-facing `README.md`.
4
+
5
+ ---
6
+
7
+ ## 1. Codebase Architecture
8
+
9
+ The project is structured around a standard PEP 621 package layout:
10
+
11
+ ```
12
+ /Users/stoelinga/workspace/pathways-cli/
13
+ ├── pyproject.toml # Package configurations, CLI scripts, and Pytest options
14
+ ├── .gitignore # Excludes local environments, caches, and secrets
15
+ ├── README.md # User documentation and example verification steps
16
+ ├── GEMINI.md # Codebase design and developer/agent context (this file)
17
+ ├── src/
18
+ │ └── pwy/
19
+ │ ├── __init__.py # Exposes cli entry points
20
+ │ ├── cli.py # click CLI definition: up, down commands
21
+ │ ├── generator.py # Topology math, spot VM toggles, colocated python configurations
22
+ │ ├── templates.py # Complete GKE JobSet multi-line YAML manifest template
23
+ │ └── kubernetes.py # Wrapper invoking kubectl subprocesses
24
+ └── tests/
25
+ ├── __init__.py
26
+ ├── test_cli.py # CLI option validations & mocks
27
+ ├── test_generator.py # Mappings and string-formatting unit tests
28
+ └── test_e2e.py # Real GKE cluster integration execution verifying JAX setup
29
+ ```
30
+
31
+ ---
32
+
33
+ ## 2. Testing Workflows
34
+
35
+ Verify changes using one of the three testing scopes:
36
+
37
+ ### 1. Unit Tests (Mocked)
38
+ Tests calculations and YAML generation without cluster access.
39
+ ```bash
40
+ uv run pytest tests/test_generator.py tests/test_cli.py
41
+ ```
42
+
43
+ ### 2. End-to-End Integration Tests (Active Cluster)
44
+ Runs actual deployments on a running TPU nodepool, installs JAX, executes verification scripts, and tears the setup down.
45
+ 1. Configure your GCS path in a local `.env` file:
46
+ ```env
47
+ PWY_E2E_GCS_SCRATCH_LOCATION=gs://my-staging-bucket/pathways
48
+ ```
49
+ 2. Run pytest targeting the `e2e` mark:
50
+ ```bash
51
+ uv run pytest tests/test_e2e.py -m e2e -s
52
+ ```
@@ -0,0 +1,122 @@
1
+ Metadata-Version: 2.4
2
+ Name: pathways-cli
3
+ Version: 0.1.0
4
+ Summary: Pathways CLI to easily bring up pathways clusters.
5
+ Author-email: Sam Stoelinga <sammiestoel@gmail.com>
6
+ Requires-Python: >=3.12
7
+ Requires-Dist: click>=8.4.1
8
+ Requires-Dist: python-dotenv>=1.2.2
9
+ Description-Content-Type: text/markdown
10
+
11
+ # `pwy`: Standalone Pathways GKE Cluster CLI Tool
12
+
13
+ `pwy` is a lightweight, standalone Python CLI utility designed to generate, apply, and manage interactive Pathways workloads on Google Kubernetes Engine (GKE) using Kubernetes JobSets.
14
+
15
+ ---
16
+
17
+ ## Features
18
+
19
+ - **Automated TPU Topology Calculations**: Translates simple TPU resource types (`v6e-4`, `v6e-16`, etc.) into GKE topologies, VM counts, and instance settings.
20
+ - **Spot VM Support**: Dynamically injects GKE node selectors and tolerations for running workloads on cost-effective Spot VMs.
21
+ - **Colocated Python Support**: Simplifies distributed checkpointing (e.g. via Orbax) by configuring and enabling colocated host CPU sidecars and proxy endpoints automatically.
22
+ - **Interactive & Batch Execution**: Supports spinning up pathways servers with infinite sleep drivers for interactive debugging, or executing training scripts directly.
23
+ - **Dry-run Manifest Generation**: Preview and inspect the GKE JobSet manifest without applying it to the cluster.
24
+
25
+ ---
26
+
27
+ ## Installation
28
+
29
+ This project utilizes [uv](https://github.com/astral-sh/uv) for fast, modern Python package and dependency management.
30
+
31
+ To sync the environment and install `pwy`:
32
+
33
+ ```bash
34
+ uv sync
35
+ ```
36
+
37
+ ---
38
+
39
+ ## Usage
40
+
41
+ You can invoke `pwy` commands directly using `uv run`:
42
+
43
+ ### 1. Provision / Preview a Cluster (`pwy up`)
44
+
45
+ Starts a Pathways JobSet or dry-runs the configuration.
46
+
47
+ ```bash
48
+ uv run pwy up \
49
+ --tpu-type v6e-16 \
50
+ --gcs-scratch-location gs://my-bucket/pathways-staging \
51
+ --num-slices 1 \
52
+ --dry-run
53
+ ```
54
+
55
+ #### Key Options:
56
+ - `--tpu-type`: **(Required)** TPU type (e.g., `v6e-4`, `v6e-8`, `v6e-16`, `v6e-32`, `v6e-64`).
57
+ - `--gcs-scratch-location`: **(Required)** GCS scratch path for pathways synchronization.
58
+ - `--num-slices`: Number of TPU slices to run (default: `1`).
59
+ - `--jax-client-image`: Custom client container image (default: `python:3.12-slim`).
60
+ - `--command`: Run a custom training/eval script in the client container. If omitted, defaults to `sleep infinity` (interactive mode).
61
+ - `--enable-spot`: Add node affinity and toleration settings for Spot VMs.
62
+ - `--colocated-python`: Enables colocated CPU Python sidecar/init containers on GKE workers and enables external proxy routing.
63
+ - `--dry-run`: Prints the generated YAML to stdout instead of calling `kubectl apply`.
64
+ - `--name`: Name of the Kubernetes JobSet resource (default: `pathways-interactive`).
65
+ - `--namespace`: Target Kubernetes namespace (default: `default`).
66
+
67
+ ---
68
+
69
+ ### 2. Teardown a Cluster (`pwy down`)
70
+
71
+ Deletes the running Pathways JobSet.
72
+
73
+ ```bash
74
+ uv run pwy down --name pathways-interactive --namespace default
75
+ ```
76
+
77
+ ---
78
+
79
+ ### 3. Verification Example
80
+
81
+ Once the interactive cluster is running, you can verify execution by `exec`ing into the client container:
82
+
83
+ 1. **Find the client pod name**:
84
+ ```bash
85
+ POD_NAME=$(kubectl get pods -l jobset.sigs.k8s.io/jobset-name=pathways-interactive -o jsonpath='{.items[?(@.metadata.labels.jobset\\.sigs\\.k8s\\.io/replicatedjob-name=="pwhd")].metadata.name}')
86
+ ```
87
+
88
+ 2. **Install JAX and Pathways utils**:
89
+ ```bash
90
+ kubectl exec $POD_NAME -c client -- pip install jax pathwaysutils
91
+ ```
92
+
93
+ 3. **Run a Python snippet to initialize and list devices**:
94
+ ```bash
95
+ kubectl exec $POD_NAME -c client -- python3 -c "import pathwaysutils; pathwaysutils.initialize(); import jax; print(jax.devices())"
96
+ ```
97
+
98
+ The command output should print the available virtual TPU devices (e.g., coordinates and memory spaces of the allocated chips).
99
+
100
+ ---
101
+
102
+ ## TPU Type Mappings
103
+
104
+ `pwy` handles all resource-limit math and topologies automatically according to the following matrix:
105
+
106
+ | TPU Type | GKE Topology | VMs Per Slice | RM Instance Type |
107
+ | :--- | :--- | :--- | :--- |
108
+ | `v6e-4` | `2x2` | 1 | `tpuv6e:2x2` |
109
+ | `v6e-8` | `2x4` | 2 | `tpuv6e:2x4` |
110
+ | `v6e-16` | `4x4` | 4 | `tpuv6e:4x4` |
111
+ | `v6e-32` | `4x8` | 8 | `tpuv6e:4x8` |
112
+ | `v6e-64` | `8x8` | 16 | `tpuv6e:8x8` |
113
+
114
+ ---
115
+
116
+ ## Running Tests
117
+
118
+ To execute the unit test suite:
119
+
120
+ ```bash
121
+ uv run pytest
122
+ ```
@@ -0,0 +1,112 @@
1
+ # `pwy`: Standalone Pathways GKE Cluster CLI Tool
2
+
3
+ `pwy` is a lightweight, standalone Python CLI utility designed to generate, apply, and manage interactive Pathways workloads on Google Kubernetes Engine (GKE) using Kubernetes JobSets.
4
+
5
+ ---
6
+
7
+ ## Features
8
+
9
+ - **Automated TPU Topology Calculations**: Translates simple TPU resource types (`v6e-4`, `v6e-16`, etc.) into GKE topologies, VM counts, and instance settings.
10
+ - **Spot VM Support**: Dynamically injects GKE node selectors and tolerations for running workloads on cost-effective Spot VMs.
11
+ - **Colocated Python Support**: Simplifies distributed checkpointing (e.g. via Orbax) by configuring and enabling colocated host CPU sidecars and proxy endpoints automatically.
12
+ - **Interactive & Batch Execution**: Supports spinning up pathways servers with infinite sleep drivers for interactive debugging, or executing training scripts directly.
13
+ - **Dry-run Manifest Generation**: Preview and inspect the GKE JobSet manifest without applying it to the cluster.
14
+
15
+ ---
16
+
17
+ ## Installation
18
+
19
+ This project utilizes [uv](https://github.com/astral-sh/uv) for fast, modern Python package and dependency management.
20
+
21
+ To sync the environment and install `pwy`:
22
+
23
+ ```bash
24
+ uv sync
25
+ ```
26
+
27
+ ---
28
+
29
+ ## Usage
30
+
31
+ You can invoke `pwy` commands directly using `uv run`:
32
+
33
+ ### 1. Provision / Preview a Cluster (`pwy up`)
34
+
35
+ Starts a Pathways JobSet or dry-runs the configuration.
36
+
37
+ ```bash
38
+ uv run pwy up \
39
+ --tpu-type v6e-16 \
40
+ --gcs-scratch-location gs://my-bucket/pathways-staging \
41
+ --num-slices 1 \
42
+ --dry-run
43
+ ```
44
+
45
+ #### Key Options:
46
+ - `--tpu-type`: **(Required)** TPU type (e.g., `v6e-4`, `v6e-8`, `v6e-16`, `v6e-32`, `v6e-64`).
47
+ - `--gcs-scratch-location`: **(Required)** GCS scratch path for pathways synchronization.
48
+ - `--num-slices`: Number of TPU slices to run (default: `1`).
49
+ - `--jax-client-image`: Custom client container image (default: `python:3.12-slim`).
50
+ - `--command`: Run a custom training/eval script in the client container. If omitted, defaults to `sleep infinity` (interactive mode).
51
+ - `--enable-spot`: Add node affinity and toleration settings for Spot VMs.
52
+ - `--colocated-python`: Enables colocated CPU Python sidecar/init containers on GKE workers and enables external proxy routing.
53
+ - `--dry-run`: Prints the generated YAML to stdout instead of calling `kubectl apply`.
54
+ - `--name`: Name of the Kubernetes JobSet resource (default: `pathways-interactive`).
55
+ - `--namespace`: Target Kubernetes namespace (default: `default`).
56
+
57
+ ---
58
+
59
+ ### 2. Teardown a Cluster (`pwy down`)
60
+
61
+ Deletes the running Pathways JobSet.
62
+
63
+ ```bash
64
+ uv run pwy down --name pathways-interactive --namespace default
65
+ ```
66
+
67
+ ---
68
+
69
+ ### 3. Verification Example
70
+
71
+ Once the interactive cluster is running, you can verify execution by `exec`ing into the client container:
72
+
73
+ 1. **Find the client pod name**:
74
+ ```bash
75
+ POD_NAME=$(kubectl get pods -l jobset.sigs.k8s.io/jobset-name=pathways-interactive -o jsonpath='{.items[?(@.metadata.labels.jobset\\.sigs\\.k8s\\.io/replicatedjob-name=="pwhd")].metadata.name}')
76
+ ```
77
+
78
+ 2. **Install JAX and Pathways utils**:
79
+ ```bash
80
+ kubectl exec $POD_NAME -c client -- pip install jax pathwaysutils
81
+ ```
82
+
83
+ 3. **Run a Python snippet to initialize and list devices**:
84
+ ```bash
85
+ kubectl exec $POD_NAME -c client -- python3 -c "import pathwaysutils; pathwaysutils.initialize(); import jax; print(jax.devices())"
86
+ ```
87
+
88
+ The command output should print the available virtual TPU devices (e.g., coordinates and memory spaces of the allocated chips).
89
+
90
+ ---
91
+
92
+ ## TPU Type Mappings
93
+
94
+ `pwy` handles all resource-limit math and topologies automatically according to the following matrix:
95
+
96
+ | TPU Type | GKE Topology | VMs Per Slice | RM Instance Type |
97
+ | :--- | :--- | :--- | :--- |
98
+ | `v6e-4` | `2x2` | 1 | `tpuv6e:2x2` |
99
+ | `v6e-8` | `2x4` | 2 | `tpuv6e:2x4` |
100
+ | `v6e-16` | `4x4` | 4 | `tpuv6e:4x4` |
101
+ | `v6e-32` | `4x8` | 8 | `tpuv6e:4x8` |
102
+ | `v6e-64` | `8x8` | 16 | `tpuv6e:8x8` |
103
+
104
+ ---
105
+
106
+ ## Running Tests
107
+
108
+ To execute the unit test suite:
109
+
110
+ ```bash
111
+ uv run pytest
112
+ ```
@@ -0,0 +1,352 @@
1
+ # Implementation Plan: Standalone Pathways Cluster CLI Tool (`pwy`)
2
+
3
+ This document outlines the final design and detailed implementation specifications for a standalone Python CLI tool to generate, apply, and manage interactive Pathways GKE cluster manifests.
4
+
5
+ ---
6
+
7
+ ## 1. CLI Commands & Arguments
8
+
9
+ The CLI binary/entry point will be named `pwy`.
10
+
11
+ ### Commands
12
+
13
+ #### 1. `pwy up`
14
+ Starts the cluster or dry-runs the configuration.
15
+
16
+ ```bash
17
+ pwy up \
18
+ --tpu-type=v6e-4 \
19
+ --gcs-scratch-location=gs://my-bucket/staging \
20
+ [--num-slices=1] \
21
+ [--jax-client-image=python:3.12-slim] \
22
+ [--command="python my_script.py"] \
23
+ [--enable-spot] \
24
+ [--colocated-python] \
25
+ [--dry-run] \
26
+ [--name=pathways-interactive] \
27
+ [--namespace=default]
28
+ ```
29
+
30
+ * **Behavior**:
31
+ * Calculates cluster configuration based on `--tpu-type` and `--num-slices`.
32
+ * Generates the JobSet YAML.
33
+ * If `--dry-run` is set: Prints the generated YAML to stdout and exits.
34
+ * Otherwise: Pipes the YAML directly to `kubectl apply -f -`.
35
+
36
+ #### 2. `pwy down`
37
+ Tears down the cluster.
38
+
39
+ ```bash
40
+ pwy down [--name=pathways-interactive] [--namespace=default]
41
+ ```
42
+ * **Behavior**: Runs `kubectl delete jobset <name> --namespace=<namespace>`.
43
+
44
+ ---
45
+
46
+ ## 2. Automated TPU Mappings & Configuration Math
47
+
48
+ The tool automatically maps the user-provided `--tpu-type` to GKE resources, topologies, and arguments.
49
+
50
+ ### Mappings Database
51
+
52
+ | TPU Type | GKE Topology | VMs Per Slice (`vms_per_slice`) | RM Instance Type (`rm_instance_type`) |
53
+ | :--- | :--- | :--- | :--- |
54
+ | `v6e-4` | `2x2` | 1 | `tpuv6e:2x2` |
55
+ | `v6e-8` | `2x4` | 2 | `tpuv6e:2x4` |
56
+ | `v6e-16` | `4x4` | 4 | `tpuv6e:4x4` |
57
+ | `v6e-32` | `4x8` | 8 | `tpuv6e:4x8` |
58
+ | `v6e-64` | `8x8` | 16 | `tpuv6e:8x8` |
59
+
60
+ ### Configuration Math
61
+
62
+ For any given run with `tpu-type` and `num-slices`:
63
+
64
+ * **Pathways Head (`pwhd` replicated job)**:
65
+ * `replicas` is always `1`.
66
+ * **Pathways Worker (`pwwk` replicated job)**:
67
+ * `replicas` (number of slices) = `--num-slices`
68
+ * `spec.parallelism` (VMs per slice) = `vms_per_slice`
69
+ * `spec.completions` (VMs per slice) = `vms_per_slice`
70
+ * `resources.limits["google.com/tpu"]` = `4` (always 4 for v6e)
71
+ * `nodeSelector["cloud.google.com/gke-tpu-topology"]` = `GKE Topology`
72
+ * `nodeSelector["cloud.google.com/gke-tpu-accelerator"]` = `tpu-v6e-slice`
73
+ * **Resource Manager container (`pathways-rm` sidecar)**:
74
+ * `--instance_count` = `--num-slices`
75
+ * `--instance_type` = `rm_instance_type`
76
+
77
+ ---
78
+
79
+ ## 3. Client Command & Image Execution Logic
80
+
81
+ The client container `command` field is generated dynamically based on the `--jax-client-image`, `--command`, and `--colocated-python` flags:
82
+
83
+ 1. **JAX Client Image**:
84
+ * Uses `--jax-client-image` (defaulting to `python:3.12-slim`).
85
+ 2. **Command Executed**:
86
+ * **If `--command` is NOT provided**:
87
+ ```bash
88
+ bash -c "sleep infinity"
89
+ ```
90
+ * **If `--command` IS provided** (e.g. `--command="python training.py"`):
91
+ The tool boots up the environment, initializes pathways, and then executes the custom command directly:
92
+ ```bash
93
+ bash -c "python training.py"
94
+ ```
95
+
96
+ ---
97
+
98
+ ## 4. Full Manifest Reference Template
99
+
100
+ Below is the complete reference JobSet YAML template populated with default/baseline variables for a `v6e-4` single-slice run. The CLI generator code should output a structure identical to this:
101
+
102
+ ```yaml
103
+ apiVersion: jobset.x-k8s.io/v1alpha2
104
+ kind: JobSet
105
+ metadata:
106
+ name: {NAME}
107
+ namespace: {NAMESPACE}
108
+ spec:
109
+ failurePolicy:
110
+ maxRestarts: 0
111
+ restartStrategy: BlockingRecreate
112
+ replicatedJobs:
113
+ # -------------------------------------------------------------------------
114
+ # 1. Pathways Head (Client Pod)
115
+ # -------------------------------------------------------------------------
116
+ - name: pwhd
117
+ replicas: 1
118
+ template:
119
+ spec:
120
+ parallelism: 1
121
+ completions: 1
122
+ backoffLimit: 32
123
+ template:
124
+ metadata:
125
+ annotations:
126
+ cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
127
+ spec:
128
+ terminationGracePeriodSeconds: 60
129
+ restartPolicy: Never
130
+ hostAliases:
131
+ - ip: 169.254.169.254
132
+ hostnames:
133
+ - metadata
134
+ - metadata.google.internal
135
+ tolerations:
136
+ - key: google.com/tpu
137
+ operator: Equal
138
+ value: "present"
139
+ effect: NoSchedule
140
+ # Spot toleration only added if --enable-spot is True
141
+ {SPOT_TOLERATION_HEAD}
142
+ containers:
143
+ - name: client
144
+ image: {CLIENT_IMAGE}
145
+ command:
146
+ - bash
147
+ - -c
148
+ - |
149
+ {CLIENT_EXECUTION_COMMAND}
150
+ resources:
151
+ requests:
152
+ cpu: "1000m"
153
+ memory: "16Gi"
154
+ limits:
155
+ cpu: "1000m"
156
+ memory: "16Gi"
157
+ env:
158
+ - name: TPU_TYPE
159
+ value: {TPU_TYPE}
160
+ - name: NUM_TPU_SLICES
161
+ valueFrom:
162
+ fieldRef:
163
+ fieldPath: metadata.labels['jobset.sigs.k8s.io/replicatedjob-replicas']
164
+ - name: JAX_BACKEND_TARGET
165
+ value: grpc://localhost:29000
166
+ - name: XCLOUD_ENVIRONMENT
167
+ value: GCP
168
+ - name: JAX_PLATFORMS
169
+ value: proxy
170
+ - name: ENABLE_PATHWAYS_PERSISTENCE
171
+ value: "1"
172
+ - name: TPU_SKIP_MDS_QUERY
173
+ value: "true"
174
+ - name: PYTHONUNBUFFERED
175
+ value: "1"
176
+ - name: TEST_UNDECLARED_OUTPUTS_DIR
177
+ value: "true"
178
+ - name: IFRT_PROXY_LARGE_TRANSFER_THRESHOLD
179
+ value: "1"
180
+ - name: IFRT_PROXY_LARGE_TRANSFER_OPTIMIZATION_DIRECTORY
181
+ value: /tmp/ifrt_proxy
182
+ volumeMounts:
183
+ - name: shared-memory
184
+ mountPath: /tmp/ifrt_proxy
185
+ imagePullPolicy: Always
186
+ initContainers:
187
+ - name: pathways-proxy
188
+ image: us-docker.pkg.dev/cloud-tpu-v2-images/pathways/proxy_server:jax-0.9.2
189
+ restartPolicy: Always
190
+ ports:
191
+ - containerPort: 29000
192
+ env:
193
+ - name: IFRT_PROXY_USE_INSECURE_GRPC_CREDENTIALS
194
+ value: "true"
195
+ - name: IFRT_PROXY_LARGE_TRANSFER_OPTIMIZATION_DIRECTORY
196
+ value: /tmp/ifrt_proxy
197
+ args:
198
+ - --resource_manager_address=localhost:29001
199
+ - --server_port=29000
200
+ volumeMounts:
201
+ - name: shared-memory
202
+ mountPath: /tmp/ifrt_proxy
203
+ - name: pathways-rm
204
+ image: us-docker.pkg.dev/cloud-tpu-v2-images/pathways/server:jax-0.9.2
205
+ restartPolicy: Always
206
+ env:
207
+ - name: TPU_SKIP_MDS_QUERY
208
+ value: "true"
209
+ args:
210
+ - --server_port=29001
211
+ - --node_type=resource_manager
212
+ - --instance_count={NUM_SLICES}
213
+ - --instance_type={RM_INSTANCE_TYPE}
214
+ - --gcs_scratch_location={GCS_SCRATCH_LOCATION}
215
+ volumes:
216
+ - name: shared-memory
217
+ emptyDir:
218
+ medium: Memory
219
+ serviceAccountName: default
220
+ dnsPolicy: ClusterFirstWithHostNet
221
+
222
+ # -------------------------------------------------------------------------
223
+ # 2. Pathways Workers (TPU Pods)
224
+ # -------------------------------------------------------------------------
225
+ - name: pwwk
226
+ replicas: {NUM_SLICES}
227
+ template:
228
+ metadata:
229
+ annotations:
230
+ alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
231
+ spec:
232
+ parallelism: {VMS_PER_SLICE}
233
+ completions: {VMS_PER_SLICE}
234
+ backoffLimit: 32
235
+ template:
236
+ spec:
237
+ terminationGracePeriodSeconds: 60
238
+ hostAliases:
239
+ - ip: 169.254.169.254
240
+ hostnames:
241
+ - metadata
242
+ - metadata.google.internal
243
+ nodeSelector:
244
+ cloud.google.com/gke-tpu-accelerator: tpu-v6e-slice
245
+ cloud.google.com/gke-tpu-topology: {GKE_TOPOLOGY}
246
+ {SPOT_NODE_SELECTOR_WORKER}
247
+ tolerations:
248
+ # Spot toleration only added if --enable-spot is True
249
+ {SPOT_TOLERATION_WORKER}
250
+ containers:
251
+ - name: worker
252
+ image: us-docker.pkg.dev/cloud-tpu-v2-images/pathways/server:jax-0.9.2
253
+ imagePullPolicy: Always
254
+ ports:
255
+ - containerPort: 8471
256
+ - containerPort: 8080
257
+ - containerPort: 8431
258
+ - containerPort: 9000
259
+ - containerPort: 29001
260
+ securityContext:
261
+ privileged: true
262
+ resources:
263
+ limits:
264
+ google.com/tpu: 4
265
+ env:
266
+ - name: TPU_TYPE
267
+ value: {TPU_TYPE}
268
+ - name: NUM_TPU_SLICES
269
+ valueFrom:
270
+ fieldRef:
271
+ fieldPath: metadata.labels['jobset.sigs.k8s.io/replicatedjob-replicas']
272
+ - name: MEGASCALE_COORDINATOR_ADDRESS
273
+ value: {NAME}-pwhd-0-0.{NAME}
274
+ - name: MEGASCALE_NUM_SLICES
275
+ valueFrom:
276
+ fieldRef:
277
+ fieldPath: metadata.labels['jobset.sigs.k8s.io/replicatedjob-replicas']
278
+ - name: MEGASCALE_SLICE_ID
279
+ valueFrom:
280
+ fieldRef:
281
+ fieldPath: metadata.labels['jobset.sigs.k8s.io/job-index']
282
+ args:
283
+ - --server_port=29001
284
+ - --resource_manager_address={NAME}-pwhd-0-0.{NAME}:29001
285
+ - --gcs_scratch_location={GCS_SCRATCH_LOCATION}
286
+ - --tpu_pinned_host_allocation_recycle=true
287
+ - --tpu_premapped_buffer_size=274877906944
288
+ serviceAccountName: default
289
+ dnsPolicy: ClusterFirstWithHostNet
290
+ successPolicy:
291
+ operator: All
292
+ targetReplicatedJobs:
293
+ - pwhd
294
+ ```
295
+
296
+ ---
297
+
298
+ ## 5. Repository Layout & Target Files
299
+
300
+ A new standalone directory structure will be created under the workspace (mocking a new repository context):
301
+
302
+ ```
303
+ /Users/stoelinga/workspace/pathways-cli/
304
+ ├── pyproject.toml
305
+ ├── README.md
306
+ ├── pwy/
307
+ │ ├── __init__.py
308
+ │ ├── cli.py # Entry point (Click CLI commands: up, down)
309
+ │ ├── generator.py # Topology mapping and dictionary interpolation
310
+ │ ├── templates.py # Text template holding the YAML manifest structure
311
+ │ └── kubernetes.py # Subprocess module executing "kubectl apply -f" or "kubectl delete"
312
+ └── tests/
313
+ ├── __init__.py
314
+ ├── test_generator.py
315
+ └── test_cli.py
316
+ ```
317
+
318
+ ### File Implementation Details
319
+
320
+ #### `pwy/generator.py`
321
+ Contains the lookup dictionaries and mapping functions:
322
+ ```python
323
+ TPU_MAPPINGS = {
324
+ "v6e-4": {"topology": "2x2", "vms_per_slice": 1, "rm_type": "tpuv6e:2x2"},
325
+ "v6e-8": {"topology": "2x4", "vms_per_slice": 2, "rm_type": "tpuv6e:2x4"},
326
+ "v6e-16": {"topology": "4x4", "vms_per_slice": 4, "rm_type": "tpuv6e:4x4"},
327
+ "v6e-32": {"topology": "4x8", "vms_per_slice": 8, "rm_type": "tpuv6e:4x8"},
328
+ "v6e-64": {"topology": "8x8", "vms_per_slice": 16, "rm_type": "tpuv6e:8x8"},
329
+ }
330
+
331
+ def generate_yaml(
332
+ name: str,
333
+ namespace: str,
334
+ tpu_type: str,
335
+ gcs_scratch_location: str,
336
+ num_slices: int = 1,
337
+ jax_client_image: str = "python:3.12-slim",
338
+ command: str = None,
339
+ enable_spot: bool = False,
340
+ ) -> str:
341
+ # 1. Look up TPU type mappings
342
+ # 2. Format client container commands
343
+ # 3. Handle spot nodeSelector and tolerations formatting
344
+ # 4. Interpolate templates.YAML_TEMPLATE with final string variables
345
+ ...
346
+ ```
347
+
348
+ #### `pwy/cli.py`
349
+ Handles options parsing and commands:
350
+ * Imports `generate_yaml`.
351
+ * Runs `kubectl apply` or `kubectl delete` using Python's `subprocess.run(..., input=yaml_content.encode())`.
352
+
@@ -0,0 +1,33 @@
1
+ [project]
2
+ name = "pathways-cli"
3
+ version = "0.1.0"
4
+ description = "Pathways CLI to easily bring up pathways clusters."
5
+ readme = "README.md"
6
+ authors = [
7
+ { name = "Sam Stoelinga", email = "sammiestoel@gmail.com" }
8
+ ]
9
+ requires-python = ">=3.12"
10
+ dependencies = [
11
+ "click>=8.4.1",
12
+ "python-dotenv>=1.2.2",
13
+ ]
14
+
15
+ [project.scripts]
16
+ pwy = "pwy:main"
17
+
18
+ [build-system]
19
+ requires = ["hatchling"]
20
+ build-backend = "hatchling.build"
21
+
22
+ [tool.hatch.build.targets.wheel]
23
+ packages = ["src/pwy"]
24
+
25
+ [dependency-groups]
26
+ dev = [
27
+ "pytest>=9.0.3",
28
+ ]
29
+
30
+ [tool.pytest.ini_options]
31
+ markers = [
32
+ "e2e: end-to-end integration tests",
33
+ ]
@@ -0,0 +1,3 @@
1
+ from pwy.cli import main
2
+
3
+ __all__ = ["main"]