PyPI - gpu-dev - Versions diffs - 0.5.31__tar.gz → 0.5.32__tar.gz - Mend

gpu-dev 0.5.31tar.gz → 0.5.32tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (131) hide show

{gpu_dev-0.5.31 → gpu_dev-0.5.32}/CLAUDE.md RENAMED Viewed

@@ -183,6 +183,55 @@ kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:909
 kubectl get pods -n monitoring -l app.kubernetes.io/name=grafana
 ```
+## Multi-Region Single-State Refactor (Research Notes, May 2026)
+**Goal:** One `tf apply` manages all regions. No more `tf-all`, no double Docker builds, no double AMI bakes.
+**Approach:** Module-per-region pattern.
+```hcl
+# root main.tf
+module "us_east_2" {
+  source    = "./modules/region"
+  region    = "us-east-2"
+  gpu_types = { h100 = {...}, b200 = {...}, ... }
+  spot_types = []
+  providers = { aws = aws.us_east_2 }
+}
+module "us_east_1" {
+  source    = "./modules/region"
+  region    = "us-east-1"
+  gpu_types = { b300 = {...}, t4 = {...}, ... }
+  spot_types = ["b300", "b200", "h100", ...]
+  providers = { aws = aws.us_east_1 }
+}
+```
+**What goes in the module:** VPC, subnets, EKS cluster, ASGs, launch templates, Lambda functions, DDB tables, EFS, monitoring, DNS. Basically everything in the current root except provider config and shared resources.
+**What stays at root:** Provider blocks with aliases, ECR replication config, AMI copy (`aws_ami_copy` from primary to secondary regions), global IAM roles if any, CLI config.
+**AMI sharing:** Build baked AMI in us-east-2 (primary), `aws_ami_copy` to other regions. One build, replicated. The `ami_baker` stays in root, outputs AMI ID, each module receives it as a variable.
+**Docker sharing:** ECR replication already set up. Docker builds once in primary region, auto-replicates.
+**Migration plan (since nobody uses east1 yet):**
+1. `tofu workspace select prod-east1 && tofu destroy` — clean slate
+2. Move all resources into `modules/region/`
+3. Create provider aliases in root
+4. Import prod (us-east-2) resources into new module state: `tofu import module.us_east_2.aws_vpc.gpu_dev_vpc vpc-xxx`
+5. Add us-east-1 module — fresh create, no import needed
+6. Delete workspace: `tofu workspace delete prod-east1`
+**Risks:**
+- Import step for prod is tedious (~50+ resources) but mechanical
+- Lambda zip paths need to be relative to module, not root
+- EKS auth (aws-auth ConfigMap) is per-cluster — each module manages its own
+- CLI needs to know which region to query — already handled by config
+**Estimated effort:** 1 dedicated session (~4-6 hours). Most time on the module extraction + prod import.
+**Prerequisite for:** Adding us-west-1, us-west-2, or any future region (becomes one module block each).
 ## Recent Fixes (Oct 27, 2025)
 **NVIDIA Profiling Bootstrap Configuration (Oct 27, 2025):**
@@ -232,6 +281,9 @@ kubectl get pods -n monitoring -l app.kubernetes.io/name=grafana
 ### 📋 Remaining Tasks
+- **Merge multi-region into single tf state** - HIGH PRIORITY. Kill prod-east1 workspace, refactor into module-per-region in one state. See research notes below. Enables: one `tf apply`, shared AMI (aws_ami_copy), shared Docker (ECR replication already set up), no double builds. Prerequisite for adding west regions.
+- **Add us-west-1 and us-west-2 spot regions** - BLOCKED on single-state refactor. After refactor, adding a region = adding one module block.
+- **Spot UX improvements** - Queue position should be #1 for each type (not cross-type FIFO). Status should show "queued (waiting for capacity)" not just "queued". Interactive picker should show spot GPU counts from east1 not prod.
 - **FQDN for devservers** - Set up proper domain names for development server access
 - **Automated SSH config per reservation** - ✅ DONE - Each reservation now gets `~/.devgpu/<reservation_id>-sshconfig` file, use with `ssh -F ~/.devgpu/<reservation_id>-sshconfig <pod_name>`
 - **Custom Docker image scaffold** - Create Dockerfile with pre-installed packages (Jupyter, etc.)

{gpu_dev-0.5.31 → gpu_dev-0.5.32}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: gpu-dev
-Version: 0.5.31
+Version: 0.5.32
 Summary: CLI tool for PyTorch GPU developer server reservations
 Author: PyTorch Team
 Requires-Python: >=3.10

{gpu_dev-0.5.31 → gpu_dev-0.5.32}/cli-tools/gpu-dev-cli/gpu_dev.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: gpu-dev
-Version: 0.5.31
+Version: 0.5.32
 Summary: CLI tool for PyTorch GPU developer server reservations
 Author: PyTorch Team
 Requires-Python: >=3.10

{gpu_dev-0.5.31 → gpu_dev-0.5.32}/cli-tools/gpu-dev-cli/gpu_dev.egg-info/SOURCES.txt RENAMED Viewed

@@ -1,10 +1,6 @@
 .gitignore
 CLAUDE.md
-PROGRESS.md
-PR_DESCRIPTION.md
 README.md
-TODO.md
-post.md
 pyproject.toml
 .github/workflows/no-gitlinks.yml
 .github/workflows/publish.yml

{gpu_dev-0.5.31 → gpu_dev-0.5.32}/cli-tools/gpu-dev-cli/gpu_dev_cli/cli.py RENAMED Viewed

@@ -526,7 +526,7 @@ def main(ctx: click.Context) -> None:
     "--gpu-type",
     "-t",
     type=click.Choice(
-        ["b300", "b200", "b200-mig-1g", "b200-mig-2g", "b200-mig-3g", "h200", "h100", "h100-mig-1g", "h100-mig-2g", "h100-mig-3g", "a100", "rtxpro6000", "a10g", "t4", "l4", "t4-small", "cpu-arm", "cpu-x86"], case_sensitive=False
+        ["b300", "b200", "b200-mig-1g", "b200-mig-2g", "b200-mig-3g", "h200", "h100", "h100-mig-1g", "h100-mig-2g", "h100-mig-3g", "a100", "rtxpro6000", "a10g", "t4", "l4", "t4-small", "cpu-arm", "cpu-x86", "cpu-spot"], case_sensitive=False
     ),
     help="GPU type to reserve. Full GPUs: b200, h200, h100, a100, rtxpro6000, a10g, t4, l4, t4-small. H100 MIG slices: h100-mig-1g (10 GB), h100-mig-2g (20 GB), h100-mig-3g (40 GB). B200 MIG slices (on the mixed B200 node): b200-mig-1g (23 GB), b200-mig-2g (45 GB), b200-mig-3g (90 GB). CPU: cpu-arm, cpu-x86.",
 )
@@ -698,6 +698,7 @@ def reserve(
             "b300": {"max_gpus": 8, "instance_type": "p6-b300.48xlarge"},
             "cpu-arm": {"max_gpus": 0, "instance_type": "c7g.4xlarge"},
             "cpu-x86": {"max_gpus": 0, "instance_type": "c7i.4xlarge"},
+            "cpu-spot": {"max_gpus": 0, "instance_type": "c7i.2xlarge"},
         }
         # Early validation of GPU type to extract max_gpus (needed for disk selection)
@@ -1418,7 +1419,7 @@ def reserve(
 _SUBMIT_GPU_TYPES = ["b300", "b200", "b200-mig-1g", "b200-mig-2g", "b200-mig-3g", "h200", "h100",
                      "h100-mig-1g", "h100-mig-2g", "h100-mig-3g", "a100", "rtxpro6000",
-                     "a10g", "t4", "l4", "t4-small", "cpu-arm", "cpu-x86"]
+                     "a10g", "t4", "l4", "t4-small", "cpu-arm", "cpu-x86", "cpu-spot"]
 @main.command(context_settings={"ignore_unknown_options": True})
@@ -1837,7 +1838,7 @@ def list(ctx: click.Context, user: Optional[str], status: Optional[str], details
                                         ended = item.get("reservation_ended") or item.get("expired_at") or item.get("created_at", "")
                                         if ended and ended < one_hour_ago:
                                             continue
-                                    item["_region"] = "us-east-1"
+                                    item["_region"] = "east1"
                                     results.append(item)
                             return results
                         except Exception:
@@ -1847,11 +1848,45 @@ def list(ctx: click.Context, user: Optional[str], status: Optional[str], details
                         active_future = executor.submit(fetch_active)
                         failures_future = executor.submit(fetch_recent_failures)
                         east1_future = executor.submit(fetch_east1)
-                        reservations = active_future.result() + failures_future.result() + east1_future.result()
+                        prod_results = active_future.result() + failures_future.result()
+                        for r in prod_results:
+                            if "_region" not in r:
+                                r["_region"] = "prod"
+                        east1_results = east1_future.result()
+                        for r in east1_results:
+                            if "_region" not in r:
+                                r["_region"] = "east1"
+                        reservations = prod_results + east1_results
                 else:
-                    reservations = reservation_mgr.list_reservations(
+                    prod_res = reservation_mgr.list_reservations(
                         user_filter=user_filter, statuses_to_include=statuses_to_include
                     )
+                    for r in prod_res:
+                        if "_region" not in r:
+                            r["_region"] = "prod"
+                    east1_res = fetch_east1() if not status else []
+                    if not east1_res:
+                        try:
+                            east1_env = Config.ENVIRONMENTS.get("prod-east1", {})
+                            if east1_env and config.user_config.get("environment") == "prod":
+                                import boto3 as _b3
+                                east1_ddb = _b3.resource("dynamodb", region_name=east1_env["region"])
+                                east1_table = east1_ddb.Table("pytorch-gpu-dev-reservations")
+                                for s in (statuses_to_include or ["active", "preparing", "queued", "pending"]):
+                                    resp = east1_table.query(
+                                        IndexName="StatusIndex",
+                                        KeyConditionExpression="#s = :status",
+                                        ExpressionAttributeNames={"#s": "status"},
+                                        ExpressionAttributeValues={":status": s},
+                                    )
+                                    for item in resp.get("Items", []):
+                                        if user_filter and item.get("user_id") != user_filter:
+                                            continue
+                                        item["_region"] = "east1"
+                                        east1_res.append(item)
+                        except Exception:
+                            pass
+                    reservations = prod_res + east1_res
             except RuntimeError as e:
                 rprint(f"[red]❌ {str(e)}[/red]")
                 return False
@@ -1883,7 +1918,8 @@ def list(ctx: click.Context, user: Optional[str], status: Optional[str], details
             # Create table with enhanced columns for queue info
             # Check if we have cross-region reservations
-            _has_east1 = any(r.get("_region") == "us-east-1" for r in reservations)
+            _regions = frozenset(r.get("_region", "") for r in reservations if r.get("_region"))
+            _has_multi_region = len(_regions) > 1 or "east1" in _regions
             table = Table(title="GPU Reservations")
             table.add_column("ID", style="cyan", no_wrap=True)
@@ -1894,7 +1930,7 @@ def list(ctx: click.Context, user: Optional[str], status: Optional[str], details
             table.add_column("Queue Info", style="cyan")
             table.add_column("Created", style="blue")
             table.add_column("Expires/ETA", style="red")
-            if _has_east1:
+            if _has_multi_region:
                 table.add_column("Region", style="dim")
             if details:
                 table.add_column("CLI Ver", style="dim", no_wrap=True)
@@ -1935,13 +1971,12 @@ def list(ctx: click.Context, user: Optional[str], status: Optional[str], details
                         # Use the new helper that shows time + remaining
                         expires_formatted = _format_expires_with_remaining(expires_at)
                     elif res_status in ["queued", "pending"]:
-                        # Show estimated wait time if available
                         estimated_wait = reservation.get(
                             "estimated_wait_minutes", "?")
-                        if estimated_wait != "?" and estimated_wait is not None:
+                        if estimated_wait and estimated_wait not in ("?", "None", None):
                             expires_formatted = f"~{estimated_wait}min"
                         else:
-                            expires_formatted = "Calculating..."
+                            expires_formatted = "Waiting..."
                     elif res_status in ("expired", "failed", "cancelled"):
                         reason = reservation.get("failure_reason", "")
                         ended = reservation.get("reservation_ended") or reservation.get("expired_at", "")
@@ -1968,15 +2003,11 @@ def list(ctx: click.Context, user: Optional[str], status: Optional[str], details
                     # Format queue info for queued reservations
                     queue_info = ""
                     if res_status in ["queued", "pending"]:
-                        queue_position = reservation.get("queue_position", "?")
-                        estimated_wait = reservation.get(
-                            "estimated_wait_minutes", "?")
-                        if queue_position != "?" and queue_position is not None:
-                            queue_info = f"#{queue_position}"
-                            if estimated_wait != "?" and estimated_wait is not None:
-                                queue_info += f" (~{estimated_wait}min)"
+                        detail = reservation.get("current_detailed_status") or reservation.get("detailed_status") or ""
+                        if "capacity" in detail.lower() or "spot" in detail.lower():
+                            queue_info = "Waiting for spot"
                         else:
-                            queue_info = "Calculating..."
+                            queue_info = "Spot pending"
                     elif res_status == "active":
                         # Show pod IP for multinode, SSH hint for single-node
                         pod_ip = reservation.get("pod_ip", "")
@@ -2099,9 +2130,12 @@ def list(ctx: click.Context, user: Optional[str], status: Optional[str], details
                         row_data.append(
                             f"[dim]{lambda_version_display}[/dim]" if dim_row else lambda_version_display)
-                    if _has_east1:
-                        region = reservation.get("_region", "us-east-2")
-                        row_data.append("[yellow]east1[/yellow]" if region == "us-east-1" else "prod")
+                    if _has_multi_region:
+                        region = reservation.get("_region", "prod")
+                        if region in ("us-east-1", "east1"):
+                            row_data.append("[yellow]east1[/yellow]")
+                        else:
+                            row_data.append("prod")
                     table.add_row(*row_data)
@@ -2279,8 +2313,11 @@ def list(ctx: click.Context, user: Optional[str], status: Optional[str], details
                                     queue_info = ""
                                     if res_status in ["queued", "pending"]:
-                                        queue_position = reservation.get("queue_position", "?")
-                                        queue_info = f"#{queue_position}" if queue_position != "?" else "Calculating..."
+                                        detail = reservation.get("current_detailed_status") or reservation.get("detailed_status") or ""
+                                        if "capacity" in detail.lower() or "spot" in detail.lower():
+                                            queue_info = "Waiting for spot"
+                                        else:
+                                            queue_info = "Spot pending"
                                     elif res_status == "active":
                                         queue_info = "Ready"
@@ -2313,10 +2350,10 @@ def list(ctx: click.Context, user: Optional[str], status: Optional[str], details
                                         expires_formatted = _format_expires_with_remaining(expires_at)
                                     elif res_status in ["queued", "pending"]:
                                         estimated_wait = reservation.get("estimated_wait_minutes", "?")
-                                        if estimated_wait != "?" and estimated_wait is not None:
+                                        if estimated_wait and estimated_wait not in ("?", "None", None):
                                             expires_formatted = f"~{estimated_wait}min"
                                         else:
-                                            expires_formatted = "Calculating..."
+                                            expires_formatted = "Waiting..."
                                     else:
                                         expires_formatted = "N/A"
@@ -2971,7 +3008,7 @@ def _show_availability() -> None:
                 spot_table.add_column("Avail\nNow", style="green")
                 spot_table.add_column("Per\nNode", style="bright_green")
                 spot_table.add_column("Status", style="magenta")
-                spot_table.add_column("Availability", style="dim")
+                spot_table.add_column("Spot Discount", style="dim")
                 _on_demand = {"b300": 95, "b200": 95, "h200": 55, "h100": 98, "a100": 32, "t4": 4.5, "l4": 7}
                 for gt, info in sorted(spot_region_info.items()):
                     avail = info.get("available", 0)
@@ -2981,14 +3018,12 @@ def _show_availability() -> None:
                     si = info.get("spot_info", {}) or {}
                     sp = si.get("spot_price", "") if isinstance(si, dict) else ""
                     if not sp or (isinstance(si, dict) and "No spot data" in str(si.get("spot_signal", ""))):
-                        avail_signal = "[red]Not offered[/red]"
+                        avail_signal = "[green]Available[/green]" if avail > 0 else "[dim]No price data[/dim]"
                     else:
                         try:
                             ratio = float(sp) / _on_demand.get(gt, 50)
                             pct = int((1 - ratio) * 100)
-                            if ratio < 0.4: avail_signal = f"[green]High ({pct}% off)[/green]"
-                            elif ratio < 0.7: avail_signal = f"[yellow]Medium ({pct}% off)[/yellow]"
-                            else: avail_signal = f"[red]Low ({pct}% off)[/red]"
+                            avail_signal = f"[green]{pct}% off on-demand[/green]" if pct > 0 else "[dim]At on-demand price[/dim]"
                         except (ValueError, TypeError):
                             avail_signal = "[yellow]Unknown[/yellow]"
                     spot_table.add_row(f"{gt.upper()} *", avail_display, str(per_node), status, avail_signal)

{gpu_dev-0.5.31 → gpu_dev-0.5.32}/cli-tools/gpu-dev-cli/gpu_dev_cli/config.py RENAMED Viewed

@@ -26,7 +26,7 @@ class Config:
             "region": "us-east-1",
             "workspace": "prod-east1",
             "description": "Spot-only us-east-1 environment (T4/L4/CPU)",
-            "spot_types": ["b300", "b200", "h200", "h100", "a100"],
+            "spot_types": ["b300", "b200", "h200", "h100", "a100", "t4", "l4", "rtxpro6000"],
         },
     }
     DEFAULT_ENVIRONMENT = "prod"

{gpu_dev-0.5.31 → gpu_dev-0.5.32}/cli-tools/gpu-dev-cli/gpu_dev_cli/interactive.py RENAMED Viewed

@@ -52,11 +52,19 @@ def check_interactive_support() -> bool:
 def select_gpu_type_interactive(
     availability_info: Dict[str, Dict[str, Any]],
+    _refresh: bool = False,
 ) -> Optional[str]:
     """Interactive GPU type selection with availability table"""
     if not check_interactive_support():
         return None
+    if _refresh:
+        from .reservations import ReservationManager
+        from .config import load_config
+        _cfg = load_config()
+        _mgr = ReservationManager(_cfg)
+        availability_info = _mgr.get_gpu_availability_by_type() or availability_info
     # Hide MIG slice SKUs from the top-level selector — reached via the h100 submenu.
     # Direct `--gpu-type h100-mig-1g` still works for non-interactive scripts.
     visible_info = {
@@ -194,7 +202,7 @@ def select_gpu_type_interactive(
         st.add_column("Avail\nNow", style="green")
         st.add_column("Per\nNode", style="bright_green")
         st.add_column("Status", style="magenta")
-        st.add_column("Availability", style="dim")
+        st.add_column("Spot Discount", style="dim")
         _on_demand = {"b300": 95, "b200": 95, "h200": 55, "h100": 98, "a100": 32, "t4": 4.5, "l4": 7}
         for gt, info in spot_gpus.items():
             avail = info.get("available", 0)
@@ -205,7 +213,7 @@ def select_gpu_type_interactive(
             # Availability signal from spot price vs on-demand
             sp = si.get("spot_price", "") if isinstance(si, dict) else ""
             if not sp or (isinstance(si, dict) and "No spot data" in str(si.get("spot_signal", ""))):
-                avail_signal = "[red]Not offered[/red]"
+                avail_signal = "[green]Available[/green]" if avail > 0 else "[dim]No price data[/dim]"
             else:
                 try:
                     ratio = float(sp) / _on_demand.get(gt, 50)
@@ -266,37 +274,46 @@ def select_gpu_type_interactive(
             si_data = info.get("spot_info", {}) or {}
             sp = si_data.get("spot_price", "") if isinstance(si_data, dict) else ""
             # Derive availability signal
+            avail_now = int(info.get("available", 0))
             if not sp or "No spot data" in str(si_data.get("spot_signal", "")):
-                # Not offered — skip from choices
-                continue
-            try:
-                ratio = float(sp) / _on_demand.get(gt, 50)
-                pct = int((1 - ratio) * 100)
-                if ratio < 0.4: signal = f"🟢 High avail ({pct}% off)"
-                elif ratio < 0.7: signal = f"🟡 Medium ({pct}% off)"
-                else: signal = f"🔴 Low ({pct}% off)"
-            except (ValueError, TypeError):
-                signal = "availability unknown"
+                if avail_now > 0:
+                    signal = f"🟢 {avail_now} available now"
+                else:
+                    continue
+            else:
+                try:
+                    ratio = float(sp) / _on_demand.get(gt, 50)
+                    pct = int((1 - ratio) * 100)
+                    if ratio < 0.4: signal = f"🟢 High avail ({pct}% off)"
+                    elif ratio < 0.7: signal = f"🟡 Medium ({pct}% off)"
+                    else: signal = f"🔴 Low ({pct}% off)"
+                except (ValueError, TypeError):
+                    signal = "availability unknown"
             if avail > 0:
                 label = f"✅ {gt.upper()} * ({avail} free, {pn}/node, {signal})"
             else:
                 label = f"⚡ {gt.upper()} * ({pn} GPUs/node, {signal})"
             choices.append(questionary.Choice(title=label, value=f"spot:{gt}"))
-    console.print()
+    choices.append(questionary.Separator("───"))
+    choices.append(questionary.Choice(title="🔄 Refresh availability", value="_refresh"))
-    # Interactive selection    console.print()
+    console.print()
-    # Interactive selection
-    try:
-        answer = questionary.select(
-            "Select GPU type:", choices=choices, style=custom_style
-        ).ask()
+    # Interactive selection — loop on refresh
+    while True:
+        try:
+            answer = questionary.select(
+                "Select GPU type:", choices=choices, style=custom_style
+            ).ask()
-        return answer
-    except (KeyboardInterrupt, EOFError):
-        console.print("\n[yellow]Selection cancelled.[/yellow]")
-        return None
+            if answer == "_refresh":
+                console.print("[dim]Refreshing...[/dim]")
+                return select_gpu_type_interactive(availability_info, _refresh=True)
+            return answer
+        except (KeyboardInterrupt, EOFError):
+            console.print("\n[yellow]Selection cancelled.[/yellow]")
+            return None
 def _format_eta_seconds(delta_seconds: int) -> str:

{gpu_dev-0.5.31 → gpu_dev-0.5.32}/cli-tools/gpu-dev-cli/gpu_dev_cli/reservations.py RENAMED Viewed

@@ -826,8 +826,20 @@ class ReservationManager:
             ]
             if len(matching_reservations) == 0:
-                return None
-            elif len(matching_reservations) > 1:
+                # Not found by user_id — try direct lookup (for added users viewing other's reservations)
+                try:
+                    from boto3.dynamodb.conditions import Key
+                    scan_resp = self.reservations_table.scan(
+                        FilterExpression="begins_with(reservation_id, :rid)",
+                        ExpressionAttributeValues={":rid": reservation_id},
+                        Limit=10,
+                    )
+                    matching_reservations = scan_resp.get("Items", [])
+                except Exception:
+                    pass
+                if not matching_reservations:
+                    return None
+            if len(matching_reservations) > 1:
                 return None  # Ambiguous - need longer prefix
             reservation = matching_reservations[0]

{gpu_dev-0.5.31 → gpu_dev-0.5.32}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "gpu-dev"
-version = "0.5.31"
+version = "0.5.32"
 description = "CLI tool for PyTorch GPU developer server reservations"
 authors = [{name = "PyTorch Team"}]
 readme = "cli-tools/gpu-dev-cli/README.md"

{gpu_dev-0.5.31 → gpu_dev-0.5.32}/terraform-gpu-devservers/ami-baker.tf RENAMED Viewed

@@ -11,6 +11,7 @@ locals {
   ami_baker_trigger = sha256(join("\n", [
     data.aws_ami.eks_gpu_ami_x86_64.id,
     filesha256("${path.module}/templates/al2023-user-data.sh"),
+    filesha256("${path.module}/templates/ami-baker-user-data.sh"),
     local.latest_image_uri,
   ]))
   ami_baker_name = "gpu-dev-baked-${substr(local.ami_baker_trigger, 0, 8)}"
@@ -19,11 +20,11 @@ locals {
     image_uri = local.latest_image_uri
   }))
-  # Use baked AMI when available, fall back to standard.
-  gpu_ami_id = length(data.aws_ami_ids.gpu_baked.ids) > 0 ? data.aws_ami_ids.gpu_baked.ids[0] : data.aws_ami.eks_gpu_ami_x86_64.id
+  # Use baked AMI when available (checked AFTER baker runs), fall back to standard.
+  gpu_ami_id = length(data.aws_ami_ids.gpu_baked_resolved.ids) > 0 ? data.aws_ami_ids.gpu_baked_resolved.ids[0] : data.aws_ami.eks_gpu_ami_x86_64.id
 }
-# Look up existing baked AMI — uses aws_ami_ids which returns [] instead of erroring
+# Pre-build check: does the baked AMI already exist? Controls whether baker runs.
 data "aws_ami_ids" "gpu_baked" {
   owners = ["self"]
@@ -39,6 +40,24 @@ data "aws_ami_ids" "gpu_baked" {
   sort_ascending = false
 }
+# Post-build lookup: re-reads AFTER the baker finishes, so a freshly built AMI
+# is picked up in the same apply (no second apply needed).
+data "aws_ami_ids" "gpu_baked_resolved" {
+  depends_on = [null_resource.ami_baker]
+  owners     = ["self"]
+  filter {
+    name   = "name"
+    values = [local.ami_baker_name]
+  }
+  filter {
+    name   = "state"
+    values = ["available"]
+  }
+  sort_ascending = false
+}
 # Build the baked AMI when inputs change
 resource "null_resource" "ami_baker" {
   # Only run when the target AMI doesn't exist yet

{gpu_dev-0.5.31 → gpu_dev-0.5.32}/terraform-gpu-devservers/availability.tf RENAMED Viewed

@@ -48,7 +48,7 @@ resource "aws_lambda_function" "availability_updater" {
       EKS_CLUSTER_NAME    = aws_eks_cluster.gpu_dev_cluster.name
       REGION              = local.current_config.aws_region
       SPOT_GPU_TYPES      = lookup({
-        "prod-east1" = "b300,b200,h200,h100,a100"
+        "prod-east1" = "b300,b200,h200,h100,a100,t4,l4,rtxpro6000,cpu-spot"
       }, terraform.workspace, "")
       ASG_NAME_PREFIX     = "${var.prefix}-gpu-nodes"
     }

{gpu_dev-0.5.31 → gpu_dev-0.5.32}/terraform-gpu-devservers/docker/Dockerfile RENAMED Viewed

@@ -1,6 +1,6 @@
 # Custom PyTorch GPU Development Server Image
-# Based on pytorch/pytorch:2.11.0-cuda12.8-cudnn9-devel
-FROM pytorch/pytorch:2.11.0-cuda12.8-cudnn9-devel
+# Based on pytorch/pytorch:2.12.0-cuda13.2-cudnn9-devel
+FROM pytorch/pytorch:2.12.0-cuda13.2-cudnn9-devel
 # Set environment variables for non-interactive installation
 ENV DEBIAN_FRONTEND=noninteractive
@@ -42,22 +42,22 @@ RUN for attempt in 1 2 3; do \
 RUN curl -fsSL https://deb.nodesource.com/setup_20.x | bash - && \
     apt-get install -y nodejs
-# Install CUDA 12.9, 13.0, 13.1, 13.2 alongside base CUDA 12.8
+# Install older CUDA toolkits alongside base CUDA 13.2
 # Base image already has NVIDIA repo configured, no need for cuda-keyring
 RUN apt-get update && apt-get install -y --no-install-recommends \
+        cuda-toolkit-12-8 \
         cuda-toolkit-12-9 \
         cuda-toolkit-13-0 \
         cuda-toolkit-13-1 \
-        cuda-toolkit-13-2 \
     && apt-get clean \
     && rm -rf /var/lib/apt/lists/*
-# CUDA 12.8 is the default (PyTorch compiled against it)
+# CUDA 13.2 is the default (PyTorch 2.12 compiled against it)
 # All versions available at /usr/local/cuda-{12.8,12.9,13.0,13.1,13.2}/
-# Switch with: export CUDA_HOME=/usr/local/cuda-13.2
-ENV CUDA_HOME=/usr/local/cuda-12.8
-ENV PATH=/usr/local/cuda-12.8/bin:${PATH}
-ENV LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:${LD_LIBRARY_PATH}
+# Switch with: export CUDA_HOME=/usr/local/cuda-12.8
+ENV CUDA_HOME=/usr/local/cuda-13.2
+ENV PATH=/usr/local/cuda-13.2/bin:${PATH}
+ENV LD_LIBRARY_PATH=/usr/local/cuda-13.2/lib64:${LD_LIBRARY_PATH}
 # Install EFA stack (prebuilt libfabric + OpenMPI + aws-ofi-nccl with GPU/RDMA support)
 # Uses AWS EFA installer which bundles tested, compatible versions of all components

gpu-dev 0.5.31__tar.gz → 0.5.32__tar.gz

gpu-dev 0.5.31tar.gz → 0.5.32tar.gz