PyPI - gpu-dev - Versions diffs - 0.5.31__tar.gz → 0.6.0__tar.gz - Mend

gpu-dev 0.5.31tar.gz → 0.6.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (131) hide show

{gpu_dev-0.5.31 → gpu_dev-0.6.0}/CLAUDE.md RENAMED Viewed

@@ -183,6 +183,55 @@ kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:909
 kubectl get pods -n monitoring -l app.kubernetes.io/name=grafana
 ```
+## Multi-Region Single-State Refactor (Research Notes, May 2026)
+**Goal:** One `tf apply` manages all regions. No more `tf-all`, no double Docker builds, no double AMI bakes.
+**Approach:** Module-per-region pattern.
+```hcl
+# root main.tf
+module "us_east_2" {
+  source    = "./modules/region"
+  region    = "us-east-2"
+  gpu_types = { h100 = {...}, b200 = {...}, ... }
+  spot_types = []
+  providers = { aws = aws.us_east_2 }
+}
+module "us_east_1" {
+  source    = "./modules/region"
+  region    = "us-east-1"
+  gpu_types = { b300 = {...}, t4 = {...}, ... }
+  spot_types = ["b300", "b200", "h100", ...]
+  providers = { aws = aws.us_east_1 }
+}
+```
+**What goes in the module:** VPC, subnets, EKS cluster, ASGs, launch templates, Lambda functions, DDB tables, EFS, monitoring, DNS. Basically everything in the current root except provider config and shared resources.
+**What stays at root:** Provider blocks with aliases, ECR replication config, AMI copy (`aws_ami_copy` from primary to secondary regions), global IAM roles if any, CLI config.
+**AMI sharing:** Build baked AMI in us-east-2 (primary), `aws_ami_copy` to other regions. One build, replicated. The `ami_baker` stays in root, outputs AMI ID, each module receives it as a variable.
+**Docker sharing:** ECR replication already set up. Docker builds once in primary region, auto-replicates.
+**Migration plan (since nobody uses east1 yet):**
+1. `tofu workspace select prod-east1 && tofu destroy` — clean slate
+2. Move all resources into `modules/region/`
+3. Create provider aliases in root
+4. Import prod (us-east-2) resources into new module state: `tofu import module.us_east_2.aws_vpc.gpu_dev_vpc vpc-xxx`
+5. Add us-east-1 module — fresh create, no import needed
+6. Delete workspace: `tofu workspace delete prod-east1`
+**Risks:**
+- Import step for prod is tedious (~50+ resources) but mechanical
+- Lambda zip paths need to be relative to module, not root
+- EKS auth (aws-auth ConfigMap) is per-cluster — each module manages its own
+- CLI needs to know which region to query — already handled by config
+**Estimated effort:** 1 dedicated session (~4-6 hours). Most time on the module extraction + prod import.
+**Prerequisite for:** Adding us-west-1, us-west-2, or any future region (becomes one module block each).
 ## Recent Fixes (Oct 27, 2025)
 **NVIDIA Profiling Bootstrap Configuration (Oct 27, 2025):**
@@ -232,6 +281,9 @@ kubectl get pods -n monitoring -l app.kubernetes.io/name=grafana
 ### 📋 Remaining Tasks
+- **Merge multi-region into single tf state** - HIGH PRIORITY. Kill prod-east1 workspace, refactor into module-per-region in one state. See research notes below. Enables: one `tf apply`, shared AMI (aws_ami_copy), shared Docker (ECR replication already set up), no double builds. Prerequisite for adding west regions.
+- **Add us-west-1 and us-west-2 spot regions** - BLOCKED on single-state refactor. After refactor, adding a region = adding one module block.
+- **Spot UX improvements** - Queue position should be #1 for each type (not cross-type FIFO). Status should show "queued (waiting for capacity)" not just "queued". Interactive picker should show spot GPU counts from east1 not prod.
 - **FQDN for devservers** - Set up proper domain names for development server access
 - **Automated SSH config per reservation** - ✅ DONE - Each reservation now gets `~/.devgpu/<reservation_id>-sshconfig` file, use with `ssh -F ~/.devgpu/<reservation_id>-sshconfig <pod_name>`
 - **Custom Docker image scaffold** - Create Dockerfile with pre-installed packages (Jupyter, etc.)

{gpu_dev-0.5.31 → gpu_dev-0.6.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: gpu-dev
-Version: 0.5.31
+Version: 0.6.0
 Summary: CLI tool for PyTorch GPU developer server reservations
 Author: PyTorch Team
 Requires-Python: >=3.10

{gpu_dev-0.5.31 → gpu_dev-0.6.0}/cli-tools/gpu-dev-cli/gpu_dev.egg-info/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: gpu-dev
-Version: 0.5.31
+Version: 0.6.0
 Summary: CLI tool for PyTorch GPU developer server reservations
 Author: PyTorch Team
 Requires-Python: >=3.10

{gpu_dev-0.5.31 → gpu_dev-0.6.0}/cli-tools/gpu-dev-cli/gpu_dev.egg-info/SOURCES.txt RENAMED Viewed

@@ -1,10 +1,6 @@
 .gitignore
 CLAUDE.md
-PROGRESS.md
-PR_DESCRIPTION.md
 README.md
-TODO.md
-post.md
 pyproject.toml
 .github/workflows/no-gitlinks.yml
 .github/workflows/publish.yml

{gpu_dev-0.5.31 → gpu_dev-0.6.0}/cli-tools/gpu-dev-cli/gpu_dev_cli/cli.py RENAMED Viewed

@@ -526,7 +526,7 @@ def main(ctx: click.Context) -> None:
     "--gpu-type",
     "-t",
     type=click.Choice(
-        ["b300", "b200", "b200-mig-1g", "b200-mig-2g", "b200-mig-3g", "h200", "h100", "h100-mig-1g", "h100-mig-2g", "h100-mig-3g", "a100", "rtxpro6000", "a10g", "t4", "l4", "t4-small", "cpu-arm", "cpu-x86"], case_sensitive=False
+        ["b300", "b200", "b200-mig-1g", "b200-mig-2g", "b200-mig-3g", "h200", "h100", "h100-mig-1g", "h100-mig-2g", "h100-mig-3g", "a100", "rtxpro6000", "a10g", "t4", "l4", "t4-small", "cpu-arm", "cpu-x86", "cpu-spot"], case_sensitive=False
     ),
     help="GPU type to reserve. Full GPUs: b200, h200, h100, a100, rtxpro6000, a10g, t4, l4, t4-small. H100 MIG slices: h100-mig-1g (10 GB), h100-mig-2g (20 GB), h100-mig-3g (40 GB). B200 MIG slices (on the mixed B200 node): b200-mig-1g (23 GB), b200-mig-2g (45 GB), b200-mig-3g (90 GB). CPU: cpu-arm, cpu-x86.",
 )
@@ -698,6 +698,7 @@ def reserve(
             "b300": {"max_gpus": 8, "instance_type": "p6-b300.48xlarge"},
             "cpu-arm": {"max_gpus": 0, "instance_type": "c7g.4xlarge"},
             "cpu-x86": {"max_gpus": 0, "instance_type": "c7i.4xlarge"},
+            "cpu-spot": {"max_gpus": 0, "instance_type": "c7i.2xlarge"},
         }
         # Early validation of GPU type to extract max_gpus (needed for disk selection)
@@ -896,6 +897,13 @@ def reserve(
         else:
             # Non-interactive mode - use defaults and validate
+            # Route --spot to east1 when on prod (env vars override config region)
+            if spot and load_config().user_config.get("environment") == "prod":
+                east1_cfg = Config.ENVIRONMENTS.get("prod-east1", {})
+                if east1_cfg:
+                    import os as _os
+                    _os.environ["AWS_REGION"] = east1_cfg["region"]
             if gpu_type is None:
                 gpu_type = "a100"
             if hours is None:
@@ -1418,7 +1426,7 @@ def reserve(
 _SUBMIT_GPU_TYPES = ["b300", "b200", "b200-mig-1g", "b200-mig-2g", "b200-mig-3g", "h200", "h100",
                      "h100-mig-1g", "h100-mig-2g", "h100-mig-3g", "a100", "rtxpro6000",
-                     "a10g", "t4", "l4", "t4-small", "cpu-arm", "cpu-x86"]
+                     "a10g", "t4", "l4", "t4-small", "cpu-arm", "cpu-x86", "cpu-spot"]
 @main.command(context_settings={"ignore_unknown_options": True})
@@ -1837,7 +1845,7 @@ def list(ctx: click.Context, user: Optional[str], status: Optional[str], details
                                         ended = item.get("reservation_ended") or item.get("expired_at") or item.get("created_at", "")
                                         if ended and ended < one_hour_ago:
                                             continue
-                                    item["_region"] = "us-east-1"
+                                    item["_region"] = "east1"
                                     results.append(item)
                             return results
                         except Exception:
@@ -1847,11 +1855,45 @@ def list(ctx: click.Context, user: Optional[str], status: Optional[str], details
                         active_future = executor.submit(fetch_active)
                         failures_future = executor.submit(fetch_recent_failures)
                         east1_future = executor.submit(fetch_east1)
-                        reservations = active_future.result() + failures_future.result() + east1_future.result()
+                        prod_results = active_future.result() + failures_future.result()
+                        for r in prod_results:
+                            if "_region" not in r:
+                                r["_region"] = "prod"
+                        east1_results = east1_future.result()
+                        for r in east1_results:
+                            if "_region" not in r:
+                                r["_region"] = "east1"
+                        reservations = prod_results + east1_results
                 else:
-                    reservations = reservation_mgr.list_reservations(
+                    prod_res = reservation_mgr.list_reservations(
                         user_filter=user_filter, statuses_to_include=statuses_to_include
                     )
+                    for r in prod_res:
+                        if "_region" not in r:
+                            r["_region"] = "prod"
+                    east1_res = fetch_east1() if not status else []
+                    if not east1_res:
+                        try:
+                            east1_env = Config.ENVIRONMENTS.get("prod-east1", {})
+                            if east1_env and config.user_config.get("environment") == "prod":
+                                import boto3 as _b3
+                                east1_ddb = _b3.resource("dynamodb", region_name=east1_env["region"])
+                                east1_table = east1_ddb.Table("pytorch-gpu-dev-reservations")
+                                for s in (statuses_to_include or ["active", "preparing", "queued", "pending"]):
+                                    resp = east1_table.query(
+                                        IndexName="StatusIndex",
+                                        KeyConditionExpression="#s = :status",
+                                        ExpressionAttributeNames={"#s": "status"},
+                                        ExpressionAttributeValues={":status": s},
+                                    )
+                                    for item in resp.get("Items", []):
+                                        if user_filter and item.get("user_id") != user_filter:
+                                            continue
+                                        item["_region"] = "east1"
+                                        east1_res.append(item)
+                        except Exception:
+                            pass
+                    reservations = prod_res + east1_res
             except RuntimeError as e:
                 rprint(f"[red]❌ {str(e)}[/red]")
                 return False
@@ -1883,7 +1925,8 @@ def list(ctx: click.Context, user: Optional[str], status: Optional[str], details
             # Create table with enhanced columns for queue info
             # Check if we have cross-region reservations
-            _has_east1 = any(r.get("_region") == "us-east-1" for r in reservations)
+            _regions = frozenset(r.get("_region", "") for r in reservations if r.get("_region"))
+            _has_multi_region = len(_regions) > 1 or "east1" in _regions
             table = Table(title="GPU Reservations")
             table.add_column("ID", style="cyan", no_wrap=True)
@@ -1894,7 +1937,7 @@ def list(ctx: click.Context, user: Optional[str], status: Optional[str], details
             table.add_column("Queue Info", style="cyan")
             table.add_column("Created", style="blue")
             table.add_column("Expires/ETA", style="red")
-            if _has_east1:
+            if _has_multi_region:
                 table.add_column("Region", style="dim")
             if details:
                 table.add_column("CLI Ver", style="dim", no_wrap=True)
@@ -1935,13 +1978,12 @@ def list(ctx: click.Context, user: Optional[str], status: Optional[str], details
                         # Use the new helper that shows time + remaining
                         expires_formatted = _format_expires_with_remaining(expires_at)
                     elif res_status in ["queued", "pending"]:
-                        # Show estimated wait time if available
                         estimated_wait = reservation.get(
                             "estimated_wait_minutes", "?")
-                        if estimated_wait != "?" and estimated_wait is not None:
+                        if estimated_wait and estimated_wait not in ("?", "None", None):
                             expires_formatted = f"~{estimated_wait}min"
                         else:
-                            expires_formatted = "Calculating..."
+                            expires_formatted = "Waiting..."
                     elif res_status in ("expired", "failed", "cancelled"):
                         reason = reservation.get("failure_reason", "")
                         ended = reservation.get("reservation_ended") or reservation.get("expired_at", "")
@@ -1968,15 +2010,11 @@ def list(ctx: click.Context, user: Optional[str], status: Optional[str], details
                     # Format queue info for queued reservations
                     queue_info = ""
                     if res_status in ["queued", "pending"]:
-                        queue_position = reservation.get("queue_position", "?")
-                        estimated_wait = reservation.get(
-                            "estimated_wait_minutes", "?")
-                        if queue_position != "?" and queue_position is not None:
-                            queue_info = f"#{queue_position}"
-                            if estimated_wait != "?" and estimated_wait is not None:
-                                queue_info += f" (~{estimated_wait}min)"
+                        detail = reservation.get("current_detailed_status") or reservation.get("detailed_status") or ""
+                        if "capacity" in detail.lower() or "spot" in detail.lower():
+                            queue_info = "Waiting for spot"
                         else:
-                            queue_info = "Calculating..."
+                            queue_info = "Spot pending"
                     elif res_status == "active":
                         # Show pod IP for multinode, SSH hint for single-node
                         pod_ip = reservation.get("pod_ip", "")
@@ -2099,9 +2137,12 @@ def list(ctx: click.Context, user: Optional[str], status: Optional[str], details
                         row_data.append(
                             f"[dim]{lambda_version_display}[/dim]" if dim_row else lambda_version_display)
-                    if _has_east1:
-                        region = reservation.get("_region", "us-east-2")
-                        row_data.append("[yellow]east1[/yellow]" if region == "us-east-1" else "prod")
+                    if _has_multi_region:
+                        region = reservation.get("_region", "prod")
+                        if region in ("us-east-1", "east1"):
+                            row_data.append("[yellow]east1[/yellow]")
+                        else:
+                            row_data.append("prod")
                     table.add_row(*row_data)
@@ -2279,8 +2320,11 @@ def list(ctx: click.Context, user: Optional[str], status: Optional[str], details
                                     queue_info = ""
                                     if res_status in ["queued", "pending"]:
-                                        queue_position = reservation.get("queue_position", "?")
-                                        queue_info = f"#{queue_position}" if queue_position != "?" else "Calculating..."
+                                        detail = reservation.get("current_detailed_status") or reservation.get("detailed_status") or ""
+                                        if "capacity" in detail.lower() or "spot" in detail.lower():
+                                            queue_info = "Waiting for spot"
+                                        else:
+                                            queue_info = "Spot pending"
                                     elif res_status == "active":
                                         queue_info = "Ready"
@@ -2313,10 +2357,10 @@ def list(ctx: click.Context, user: Optional[str], status: Optional[str], details
                                         expires_formatted = _format_expires_with_remaining(expires_at)
                                     elif res_status in ["queued", "pending"]:
                                         estimated_wait = reservation.get("estimated_wait_minutes", "?")
-                                        if estimated_wait != "?" and estimated_wait is not None:
+                                        if estimated_wait and estimated_wait not in ("?", "None", None):
                                             expires_formatted = f"~{estimated_wait}min"
                                         else:
-                                            expires_formatted = "Calculating..."
+                                            expires_formatted = "Waiting..."
                                     else:
                                         expires_formatted = "N/A"
@@ -2531,10 +2575,21 @@ def cancel(
             with Live(
                 Spinner("dots", text="📡 Cancelling reservations..."), console=console
             ) as live:
+                # Build east1 reservation manager for cross-region cancellations
+                east1_mgr = None
+                east1_env = Config.ENVIRONMENTS.get("prod-east1", {})
+                if east1_env:
+                    import os as _os
+                    _east1_config = Config()
+                    _east1_config.aws_region = east1_env["region"]
+                    east1_mgr = ReservationManager(_east1_config)
                 for reservation in reservations:
                     res_id = reservation.get("reservation_id", "")
                     if res_id:
-                        success = reservation_mgr.cancel_reservation(
+                        # Use east1 manager for east1 reservations
+                        mgr = east1_mgr if reservation.get("_region") in ("east1", "us-east-1") and east1_mgr else reservation_mgr
+                        success = mgr.cancel_reservation(
                             res_id, user_info["user_id"]
                         )
                         if success:
@@ -2971,7 +3026,7 @@ def _show_availability() -> None:
                 spot_table.add_column("Avail\nNow", style="green")
                 spot_table.add_column("Per\nNode", style="bright_green")
                 spot_table.add_column("Status", style="magenta")
-                spot_table.add_column("Availability", style="dim")
+                spot_table.add_column("Spot Discount", style="dim")
                 _on_demand = {"b300": 95, "b200": 95, "h200": 55, "h100": 98, "a100": 32, "t4": 4.5, "l4": 7}
                 for gt, info in sorted(spot_region_info.items()):
                     avail = info.get("available", 0)
@@ -2981,14 +3036,12 @@ def _show_availability() -> None:
                     si = info.get("spot_info", {}) or {}
                     sp = si.get("spot_price", "") if isinstance(si, dict) else ""
                     if not sp or (isinstance(si, dict) and "No spot data" in str(si.get("spot_signal", ""))):
-                        avail_signal = "[red]Not offered[/red]"
+                        avail_signal = "[green]Available[/green]" if avail > 0 else "[dim]No price data[/dim]"
                     else:
                         try:
                             ratio = float(sp) / _on_demand.get(gt, 50)
                             pct = int((1 - ratio) * 100)
-                            if ratio < 0.4: avail_signal = f"[green]High ({pct}% off)[/green]"
-                            elif ratio < 0.7: avail_signal = f"[yellow]Medium ({pct}% off)[/yellow]"
-                            else: avail_signal = f"[red]Low ({pct}% off)[/red]"
+                            avail_signal = f"[green]{pct}% off on-demand[/green]" if pct > 0 else "[dim]At on-demand price[/dim]"
                         except (ValueError, TypeError):
                             avail_signal = "[yellow]Unknown[/yellow]"
                     spot_table.add_row(f"{gt.upper()} *", avail_display, str(per_node), status, avail_signal)
@@ -3266,21 +3319,30 @@ def connect(ctx: click.Context, reservation_id: Optional[str]) -> None:
                 live.start()
-            # If the selected reservation is from east1, switch to east1 reservation_mgr
-            _sel = next((r for r in (locals().get("reservations") or []) if r.get("reservation_id", "").startswith(reservation_id)), None)
-            if _sel and _sel.get("_region") == "us-east-1":
-                import os as _os
-                east1_cfg = Config.ENVIRONMENTS.get("prod-east1", {})
-                _os.environ["AWS_DEFAULT_REGION"] = east1_cfg["region"]
-                _east1_config = Config()
-                _east1_config.aws_region = east1_cfg["region"]
-                reservation_mgr = ReservationManager(_east1_config)
-            # Get connection info
+            # Try current region first, then cross-region if not found
             connection_info = reservation_mgr.get_connection_info(
                 reservation_id, user_info["user_id"]
             )
+            # If not found, try the other region
+            if not connection_info:
+                import os as _os
+                current_env = config.user_config.get("environment", "prod")
+                other_envs = {"prod": "prod-east1", "prod-east1": "prod"}
+                other_env_name = other_envs.get(current_env)
+                if other_env_name:
+                    other_env = Config.ENVIRONMENTS.get(other_env_name, {})
+                    if other_env:
+                        _os.environ["AWS_DEFAULT_REGION"] = other_env["region"]
+                        _other_config = Config()
+                        _other_config.aws_region = other_env["region"]
+                        other_mgr = ReservationManager(_other_config)
+                        connection_info = other_mgr.get_connection_info(
+                            reservation_id, user_info["user_id"]
+                        )
+                        if connection_info:
+                            reservation_mgr = other_mgr
         live.stop()
         if not connection_info:
@@ -3829,7 +3891,7 @@ def set(key: str, value: str) -> None:
 @config.command()
-@click.argument("env_name", type=click.Choice(["test", "prod", "prod-east1"]))
+@click.argument("env_name", type=click.Choice(["test", "prod"]))
 def environment(env_name: str) -> None:
     """Set the environment
@@ -3841,7 +3903,7 @@ def environment(env_name: str) -> None:
     \b
     Examples:
         gpu-dev config environment prod        # Production (us-east-2)
-        gpu-dev config environment prod-east1  # Spot-only us-east-1
+        gpu-dev config environment prod         # Production (spot accessible via interactive picker)
         gpu-dev config environment test        # Test (us-west-1)
     Environment configurations:

{gpu_dev-0.5.31 → gpu_dev-0.6.0}/cli-tools/gpu-dev-cli/gpu_dev_cli/config.py RENAMED Viewed

@@ -26,7 +26,7 @@ class Config:
             "region": "us-east-1",
             "workspace": "prod-east1",
             "description": "Spot-only us-east-1 environment (T4/L4/CPU)",
-            "spot_types": ["b300", "b200", "h200", "h100", "a100"],
+            "spot_types": ["b300", "b200", "h200", "h100", "a100", "t4", "l4", "rtxpro6000"],
         },
     }
     DEFAULT_ENVIRONMENT = "prod"
@@ -42,13 +42,14 @@ class Config:
         # Load unified config (handles migration from legacy files)
         self.user_config = self._load_config()
-        # Get region from config, then AWS env vars, or default
-        if self.user_config.get("region"):
+        # Get region: env vars take priority (for spot routing), then config, then default
+        env_region = os.getenv("AWS_REGION") or os.getenv("AWS_DEFAULT_REGION")
+        if env_region and env_region != self.user_config.get("region"):
+            self.aws_region = env_region
+        elif self.user_config.get("region"):
             self.aws_region = self.user_config["region"]
         else:
-            self.aws_region = os.getenv(
-                "AWS_REGION", os.getenv("AWS_DEFAULT_REGION", "us-east-2")
-            )
+            self.aws_region = "us-east-2"
         os.environ["AWS_DEFAULT_REGION"] = self.aws_region

{gpu_dev-0.5.31 → gpu_dev-0.6.0}/cli-tools/gpu-dev-cli/gpu_dev_cli/disks.py RENAMED Viewed

@@ -355,8 +355,21 @@ def unlock_disk(disk_name: str, user_id: str, config: Config) -> bool:
         return False
     if not disk['in_use']:
-        print(f"Disk '{disk_name}' is not locked")
-        return False
+        # DDB says not locked — but check if EBS volume is still physically attached
+        try:
+            ec2 = config.session.client('ec2', region_name=config.aws_region)
+            vols = ec2.describe_volumes(Filters=[
+                {"Name": "tag:gpu-dev-user", "Values": [user_id]},
+                {"Name": "tag:disk_name", "Values": [disk_name]},
+                {"Name": "status", "Values": ["in-use"]},
+            ]).get("Volumes", [])
+            if not vols:
+                print(f"Disk '{disk_name}' is not locked")
+                return False
+            print(f"Disk '{disk_name}' DDB lock is clear but EBS volume is still attached — sending force-detach request")
+        except Exception:
+            print(f"Disk '{disk_name}' is not locked")
+            return False
     operation_id = str(uuid.uuid4())

{gpu_dev-0.5.31 → gpu_dev-0.6.0}/cli-tools/gpu-dev-cli/gpu_dev_cli/interactive.py RENAMED Viewed

@@ -52,11 +52,19 @@ def check_interactive_support() -> bool:
 def select_gpu_type_interactive(
     availability_info: Dict[str, Dict[str, Any]],
+    _refresh: bool = False,
 ) -> Optional[str]:
     """Interactive GPU type selection with availability table"""
     if not check_interactive_support():
         return None
+    if _refresh:
+        from .reservations import ReservationManager
+        from .config import load_config
+        _cfg = load_config()
+        _mgr = ReservationManager(_cfg)
+        availability_info = _mgr.get_gpu_availability_by_type() or availability_info
     # Hide MIG slice SKUs from the top-level selector — reached via the h100 submenu.
     # Direct `--gpu-type h100-mig-1g` still works for non-interactive scripts.
     visible_info = {
@@ -194,7 +202,7 @@ def select_gpu_type_interactive(
         st.add_column("Avail\nNow", style="green")
         st.add_column("Per\nNode", style="bright_green")
         st.add_column("Status", style="magenta")
-        st.add_column("Availability", style="dim")
+        st.add_column("Spot Discount", style="dim")
         _on_demand = {"b300": 95, "b200": 95, "h200": 55, "h100": 98, "a100": 32, "t4": 4.5, "l4": 7}
         for gt, info in spot_gpus.items():
             avail = info.get("available", 0)
@@ -205,7 +213,7 @@ def select_gpu_type_interactive(
             # Availability signal from spot price vs on-demand
             sp = si.get("spot_price", "") if isinstance(si, dict) else ""
             if not sp or (isinstance(si, dict) and "No spot data" in str(si.get("spot_signal", ""))):
-                avail_signal = "[red]Not offered[/red]"
+                avail_signal = "[green]Available[/green]" if avail > 0 else "[dim]No price data[/dim]"
             else:
                 try:
                     ratio = float(sp) / _on_demand.get(gt, 50)
@@ -266,37 +274,46 @@ def select_gpu_type_interactive(
             si_data = info.get("spot_info", {}) or {}
             sp = si_data.get("spot_price", "") if isinstance(si_data, dict) else ""
             # Derive availability signal
+            avail_now = int(info.get("available", 0))
             if not sp or "No spot data" in str(si_data.get("spot_signal", "")):
-                # Not offered — skip from choices
-                continue
-            try:
-                ratio = float(sp) / _on_demand.get(gt, 50)
-                pct = int((1 - ratio) * 100)
-                if ratio < 0.4: signal = f"🟢 High avail ({pct}% off)"
-                elif ratio < 0.7: signal = f"🟡 Medium ({pct}% off)"
-                else: signal = f"🔴 Low ({pct}% off)"
-            except (ValueError, TypeError):
-                signal = "availability unknown"
+                if avail_now > 0:
+                    signal = f"🟢 {avail_now} available now"
+                else:
+                    continue
+            else:
+                try:
+                    ratio = float(sp) / _on_demand.get(gt, 50)
+                    pct = int((1 - ratio) * 100)
+                    if ratio < 0.4: signal = f"🟢 High avail ({pct}% off)"
+                    elif ratio < 0.7: signal = f"🟡 Medium ({pct}% off)"
+                    else: signal = f"🔴 Low ({pct}% off)"
+                except (ValueError, TypeError):
+                    signal = "availability unknown"
             if avail > 0:
                 label = f"✅ {gt.upper()} * ({avail} free, {pn}/node, {signal})"
             else:
                 label = f"⚡ {gt.upper()} * ({pn} GPUs/node, {signal})"
             choices.append(questionary.Choice(title=label, value=f"spot:{gt}"))
-    console.print()
+    choices.append(questionary.Separator("───"))
+    choices.append(questionary.Choice(title="🔄 Refresh availability", value="_refresh"))
-    # Interactive selection    console.print()
+    console.print()
-    # Interactive selection
-    try:
-        answer = questionary.select(
-            "Select GPU type:", choices=choices, style=custom_style
-        ).ask()
+    # Interactive selection — loop on refresh
+    while True:
+        try:
+            answer = questionary.select(
+                "Select GPU type:", choices=choices, style=custom_style
+            ).ask()
-        return answer
-    except (KeyboardInterrupt, EOFError):
-        console.print("\n[yellow]Selection cancelled.[/yellow]")
-        return None
+            if answer == "_refresh":
+                console.print("[dim]Refreshing...[/dim]")
+                return select_gpu_type_interactive(availability_info, _refresh=True)
+            return answer
+        except (KeyboardInterrupt, EOFError):
+            console.print("\n[yellow]Selection cancelled.[/yellow]")
+            return None
 def _format_eta_seconds(delta_seconds: int) -> str:

{gpu_dev-0.5.31 → gpu_dev-0.6.0}/cli-tools/gpu-dev-cli/gpu_dev_cli/reservations.py RENAMED Viewed

@@ -826,8 +826,20 @@ class ReservationManager:
             ]
             if len(matching_reservations) == 0:
-                return None
-            elif len(matching_reservations) > 1:
+                # Not found by user_id — try direct lookup (for added users viewing other's reservations)
+                try:
+                    from boto3.dynamodb.conditions import Key
+                    scan_resp = self.reservations_table.scan(
+                        FilterExpression="begins_with(reservation_id, :rid)",
+                        ExpressionAttributeValues={":rid": reservation_id},
+                        Limit=10,
+                    )
+                    matching_reservations = scan_resp.get("Items", [])
+                except Exception:
+                    pass
+                if not matching_reservations:
+                    return None
+            if len(matching_reservations) > 1:
                 return None  # Ambiguous - need longer prefix
             reservation = matching_reservations[0]
@@ -1689,6 +1701,7 @@ class ReservationManager:
                 initial_text = f"📡 Starting multinode reservation..." if is_multinode else "🔄 Sending reservation request..."
                 spinner = Spinner("dots", text=initial_text)
                 live.update(spinner)
+                poll_delay = 0.5  # start fast, back off over time
                 while (
                     (timeout_seconds is None or time.time() -
@@ -1749,7 +1762,7 @@ class ReservationManager:
                                     if not is_multinode:
                                         spinner.text = "📡 Waiting for reservation status update..."
                                         live.update(spinner)
-                                        time.sleep(2)
+                                        time.sleep(0.5)
                                         continue
                                     else:
                                         node_details.append({
@@ -2281,8 +2294,9 @@ class ReservationManager:
                             return None
-                        # Continue polling
-                        time.sleep(3)
+                        # Poll with backoff: 0.5s → 1s → 1.5s → 2s → 3s (cap)
+                        time.sleep(poll_delay)
+                        poll_delay = min(poll_delay + 0.5, 3.0)
                     except Exception as e:
                         console.print(

{gpu_dev-0.5.31 → gpu_dev-0.6.0}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "gpu-dev"
-version = "0.5.31"
+version = "0.6.0"
 description = "CLI tool for PyTorch GPU developer server reservations"
 authors = [{name = "PyTorch Team"}]
 readme = "cli-tools/gpu-dev-cli/README.md"

gpu-dev 0.5.31__tar.gz → 0.6.0__tar.gz

gpu-dev 0.5.31tar.gz → 0.6.0tar.gz