PyPI - gpu-dev - Versions diffs - 0.3.8__tar.gz → 0.4.0__tar.gz - Mend

gpu-dev 0.3.8tar.gz → 0.4.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (116) hide show

gpu_dev-0.4.0/.github/workflows/no-gitlinks.yml ADDED Viewed

@@ -0,0 +1,22 @@
+name: Validate repository structure
+on:
+  pull_request:
+  push:
+    branches:
+      - main
+jobs:
+  no-gitlinks:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+      - name: Ensure no gitlinks are tracked
+        run: |
+          gitlinks=$(git ls-files -s | awk "$1 == 160000 {print}")
+          if [ -n "$gitlinks" ]; then
+            echo "Unexpected gitlinks found:"
+            echo "$gitlinks"
+            exit 1
+          fi

{gpu_dev-0.3.8 → gpu_dev-0.4.0}/.gitignore RENAMED Viewed

@@ -71,3 +71,5 @@ lambda/*/package/
 # Admin output files
 admin/output/
+.claude/worktrees/

{gpu_dev-0.3.8 → gpu_dev-0.4.0}/CLAUDE.md RENAMED Viewed

@@ -8,6 +8,8 @@ This will help both you, the agent, but also other agents down the road that sha
 - NEVER run `terraform apply` or any destructive terraform commands
 - You can run read-only terraform commands like `terraform plan`, `terraform state show`, etc.
 - You can run AWS CLI commands for read-only resource fetching and analysis
+- NEVER run destructive AWS CLI commands: `aws ec2 terminate-instances`, `aws ec2 stop-instances`, `aws autoscaling set-desired-capacity` (to 0), `aws ec2 delete-*`, `aws dynamodb delete-table`, etc. On 2026-03-09 an agent accidentally terminated 10 EC2 instances including 6 pet H100 instances from another team's capacity reservations. This must never happen again.
+- NEVER run `kubectl delete node`, `kubectl drain`, `kubectl cordon`, or any command that removes/disrupts running workloads
 - User will handle all infrastructure deployments themselves
 - Note: We use OpenTofu, so user runs `opentofu apply` or `tf apply` locally (tf is aliased to opentofu)
 - we use k for kubectl and have kubens configured to namespace gpu-dev
@@ -73,6 +75,31 @@ Currently we're working on a developer servers with GPUs in AWS. This means we'l
 **K8s Decision:** EKS with GPU-optimized EC2 node groups (Fargate has no GPU support)
+## Multi-Node NCCL Communication (Mar 2026)
+**Working Configuration (SENDRECV protocol):**
+- Protocol: `OFI_NCCL_PROTOCOL=SENDRECV` (host-staged EFA, avoids RDMA mr_regattr failures)
+- GDR disabled: `FI_EFA_USE_DEVICE_RDMA=0`, `NCCL_NET_GDR_LEVEL=0`
+- Socket interface: `NCCL_SOCKET_IFNAME=^lo,docker` (H100 nodes use enp71s0/enp72s0, NOT eth0)
+- Algorithm: `NCCL_ALGO=ring,tree` (NCCL auto-selects tree for large messages, ~2x faster)
+- Exclude Mellanox: `NCCL_IB_HCA=^mlx`
+- OpenMPI lib path: `/opt/amazon/openmpi/lib` (NOT lib64 — EFA installer puts it in lib)
+**Benchmark Results (2x p5.48xlarge, 16 GPUs):**
+- Ring algorithm: ~9.5 GB/s avg bus bandwidth, ~13.4 GB/s peak
+- Tree algorithm: ~21.4 GB/s avg bus bandwidth, ~33.6 GB/s peak
+- Ring+tree combined: ~21.0 GB/s avg (NCCL auto-selects tree for large msgs)
+- Single-node NVLink: ~34 GB/s (for reference)
+**GDR Status (NOT working — future optimization):**
+- EFA RDMA protocol fails: `fi_mr_regattr` returns EFAULT for flush buffer (even host memory)
+- EFA device version: 6 (above aws-ofi-nccl blocklist threshold of 1-3)
+- EFA kernel driver: 2.17.2a (need 2.17.3+ which has "Support P2P with NVIDIA 580 drivers")
+- nvidia-peermem: NOT available (module not found for kernel 6.12.68)
+- efa-nv-peermem: NOT installed (available in amzn-drivers repo, works with open NVIDIA drivers)
+- To enable GDR in future: install efa-nv-peermem module on host nodes, or update EFA kernel driver
+- Expected GDR improvement: ~300-370 GB/s bus bandwidth (vs ~33 GB/s current)
 ## Implementation Status (Jan 11, 2025)
 ### ✅ Completed and Working

{gpu_dev-0.3.8/cli-tools/gpu-dev-cli/gpu_dev.egg-info → gpu_dev-0.4.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: gpu-dev
-Version: 0.3.8
+Version: 0.4.0
 Summary: CLI tool for PyTorch GPU developer server reservations
 Author: PyTorch Team
 Requires-Python: >=3.10
@@ -12,7 +12,7 @@ Requires-Dist: pydantic>=2.5.0
 Requires-Dist: rich>=13.7.0
 Requires-Dist: pyyaml>=6.0.1
 Requires-Dist: questionary>=2.1.1
-Requires-Dist: websockets<13.0,>=12.0
+Requires-Dist: websockets>=12.0
 Requires-Dist: certifi>=2023.7.22
 Requires-Dist: mcp>=1.0.0

{gpu_dev-0.3.8 → gpu_dev-0.4.0/cli-tools/gpu-dev-cli/gpu_dev.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: gpu-dev
-Version: 0.3.8
+Version: 0.4.0
 Summary: CLI tool for PyTorch GPU developer server reservations
 Author: PyTorch Team
 Requires-Python: >=3.10
@@ -12,7 +12,7 @@ Requires-Dist: pydantic>=2.5.0
 Requires-Dist: rich>=13.7.0
 Requires-Dist: pyyaml>=6.0.1
 Requires-Dist: questionary>=2.1.1
-Requires-Dist: websockets<13.0,>=12.0
+Requires-Dist: websockets>=12.0
 Requires-Dist: certifi>=2023.7.22
 Requires-Dist: mcp>=1.0.0

{gpu_dev-0.3.8 → gpu_dev-0.4.0}/cli-tools/gpu-dev-cli/gpu_dev.egg-info/SOURCES.txt RENAMED Viewed

@@ -5,6 +5,7 @@ PR_DESCRIPTION.md
 TODO.md
 post.md
 pyproject.toml
+.github/workflows/no-gitlinks.yml
 .github/workflows/publish.yml
 admin/README.md
 admin/generate_stats.py

{gpu_dev-0.3.8 → gpu_dev-0.4.0}/cli-tools/gpu-dev-cli/gpu_dev.egg-info/requires.txt RENAMED Viewed

@@ -5,6 +5,6 @@ pydantic>=2.5.0
 rich>=13.7.0
 pyyaml>=6.0.1
 questionary>=2.1.1
-websockets<13.0,>=12.0
+websockets>=12.0
 certifi>=2023.7.22
 mcp>=1.0.0

{gpu_dev-0.3.8 → gpu_dev-0.4.0}/cli-tools/gpu-dev-cli/gpu_dev_cli/auth.py RENAMED Viewed

@@ -95,7 +95,6 @@ def validate_ssh_key_matches_github_user(config: Config, live=None) -> Dict[str,
             # Restart the spinner
             if live:
                 live.start()
-                live.update(Spinner("dots", text="🔐 Validating SSH key..."))
             # Check if we got the expected GitHub response
             if "Hi " in ssh_output and "You've successfully authenticated" in ssh_output:

{gpu_dev-0.3.8 → gpu_dev-0.4.0}/cli-tools/gpu-dev-cli/gpu_dev_cli/cli.py RENAMED Viewed

@@ -310,6 +310,12 @@ def _show_single_reservation(connection_info: dict) -> None:
             oom_time_display = format_timestamp(last_oom_at) if last_oom_at else "Unknown"
             oom_section = f"\n[red]⚠️  OOM Events:[/red] [red]{oom_count} OOM(s) detected (last: {oom_time_display})[/red]"
+        # Show pod internal IP for multinode reservations
+        pod_ip_info = ""
+        pod_ip = connection_info.get("pod_ip")
+        if pod_ip and connection_info.get("is_multinode"):
+            pod_ip_info = f"[blue]Internal IP:[/blue] {pod_ip}\n"
         panel_content = (
             f"[green]Reservation Details[/green]\n\n"
             f"[blue]Quick Connect:[/blue] {connect_command}\n"
@@ -317,7 +323,8 @@ def _show_single_reservation(connection_info: dict) -> None:
             + vscode_info
             + jupyter_info
             + f"[blue]Pod Name:[/blue] {connection_info['pod_name']}\n"
-            f"[blue]GPUs:[/blue] {gpu_info}\n"
+            + pod_ip_info
+            + f"[blue]GPUs:[/blue] {gpu_info}\n"
             f"[blue]Instance Type:[/blue] {instance_type}\n"
             + secondary_users_info
             + f"[blue]Storage:[/blue] {disk_status}\n"
@@ -1408,59 +1415,42 @@ def list(ctx: click.Context, user: Optional[str], status: Optional[str], details
                         statuses_to_include = requested_statuses
                 else:
-                    # Default: in-progress + recent failures (last hour)
+                    # Default: active statuses only (fast path)
+                    # failed/cancelled are fetched separately and filtered to last hour
                     statuses_to_include = [
-                        "active", "preparing", "queued", "pending", "failed", "cancelled"]
-                reservations = reservation_mgr.list_reservations(
-                    user_filter=user_filter, statuses_to_include=statuses_to_include
-                )
+                        "active", "preparing", "queued", "pending"]
+                # For default view, fetch active statuses + recent failures in parallel
+                if not status:
+                    from datetime import datetime, timezone, timedelta
+                    from concurrent.futures import ThreadPoolExecutor
+                    one_hour_ago = (datetime.now(timezone.utc) - timedelta(hours=1)).isoformat()
+                    def fetch_active():
+                        return reservation_mgr.list_reservations(
+                            user_filter=user_filter, statuses_to_include=statuses_to_include)
+                    def fetch_recent_failures():
+                        return reservation_mgr.list_reservations(
+                            user_filter=user_filter,
+                            statuses_to_include=["failed", "cancelled"],
+                            created_after=one_hour_ago)
+                    with ThreadPoolExecutor(max_workers=2) as executor:
+                        active_future = executor.submit(fetch_active)
+                        failures_future = executor.submit(fetch_recent_failures)
+                        reservations = active_future.result() + failures_future.result()
+                else:
+                    reservations = reservation_mgr.list_reservations(
+                        user_filter=user_filter, statuses_to_include=statuses_to_include
+                    )
             except RuntimeError as e:
                 rprint(f"[red]❌ {str(e)}[/red]")
                 return False
             # Filter failed/cancelled reservations to only show recent ones (last hour)
             if not status or "all" not in (status.split(",") if status else []):
-                # Only apply time filtering when using default filters (not when user specifies --status)
-                from datetime import datetime, timezone, timedelta
-                now = datetime.now(timezone.utc)
-                one_hour_ago = now - timedelta(hours=1)
-                filtered_reservations = []
-                for reservation in reservations:
-                    reservation_status = reservation.get("status", "unknown")
-                    if reservation_status in ["active", "preparing", "queued", "pending"]:
-                        # Always show active/pending reservations
-                        filtered_reservations.append(reservation)
-                    elif reservation_status in ["failed", "cancelled"]:
-                        # Only show failed/cancelled from last hour
-                        created_at = reservation.get("created_at")
-                        if created_at:
-                            try:
-                                if isinstance(created_at, str):
-                                    if created_at.endswith("Z"):
-                                        created_dt = datetime.fromisoformat(
-                                            created_at.replace("Z", "+00:00"))
-                                    elif "+" in created_at or created_at.endswith("00:00"):
-                                        created_dt = datetime.fromisoformat(
-                                            created_at)
-                                    else:
-                                        naive_dt = datetime.fromisoformat(
-                                            created_at)
-                                        created_dt = naive_dt.replace(
-                                            tzinfo=timezone.utc)
-                                else:
-                                    created_dt = datetime.fromtimestamp(
-                                        created_at, tz=timezone.utc)
-                                if created_dt >= one_hour_ago:
-                                    filtered_reservations.append(reservation)
-                            except (ValueError, TypeError):
-                                # If timestamp parsing fails, include it to be safe
-                                filtered_reservations.append(reservation)
-                    else:
-                        # Include other statuses as-is
-                        filtered_reservations.append(reservation)
+                filtered_reservations = reservations
                 reservations = filtered_reservations
@@ -1556,20 +1546,25 @@ def list(ctx: click.Context, user: Optional[str], status: Optional[str], details
                         else:
                             queue_info = "Calculating..."
                     elif res_status == "active":
-                        # Show SSH connection hint for active reservations
-                        ssh_command = reservation.get("ssh_command", "")
-                        if ssh_command and "dev@" in ssh_command:
-                            try:
-                                node_info = (
-                                    ssh_command.split("dev@")[1].split()[0]
-                                    if "dev@" in ssh_command
-                                    else "Ready"
-                                )
-                                queue_info = f"Ready: {node_info}"
-                            except (IndexError, AttributeError):
-                                queue_info = "Ready"
+                        # Show pod IP for multinode, SSH hint for single-node
+                        pod_ip = reservation.get("pod_ip", "")
+                        is_multinode = reservation.get("is_multinode", False)
+                        if is_multinode and pod_ip:
+                            queue_info = f"IP: {pod_ip}"
                         else:
-                            queue_info = "Ready"
+                            ssh_command = reservation.get("ssh_command", "")
+                            if ssh_command and "dev@" in ssh_command:
+                                try:
+                                    node_info = (
+                                        ssh_command.split("dev@")[1].split()[0]
+                                        if "dev@" in ssh_command
+                                        else "Ready"
+                                    )
+                                    queue_info = f"Ready: {node_info}"
+                                except (IndexError, AttributeError):
+                                    queue_info = "Ready"
+                            else:
+                                queue_info = "Ready"
                     # Format storage indicator - show disk name if available
                     disk_name = reservation.get("disk_name")
@@ -2471,11 +2466,14 @@ def _show_availability() -> None:
                     else:
                         wait_display = f"{hours}h {minutes}min"
-                # Color code availability based on full nodes available
-                # Red: 0 GPUs available
-                # Yellow: Some GPUs available but no full node
-                # Green: At least one full node available
-                if available == 0:
+                # Check maintenance mode
+                is_maintenance = info.get("maintenance", False)
+                maintenance_reason = info.get("maintenance_reason", "")
+                if is_maintenance:
+                    available_display = f"[red]MAINTENANCE[/red]"
+                    wait_display = maintenance_reason or "Under maintenance"
+                elif available == 0:
                     available_display = f"[red]{available}[/red]"
                 elif full_nodes_available > 0:
                     available_display = f"[green]{available}[/green]"
@@ -2485,9 +2483,9 @@ def _show_availability() -> None:
                 table.add_row(
                     gpu_type.upper(),
                     available_display,
-                    str(max_reservable),
+                    str(max_reservable) if not is_maintenance else "-",
                     str(total),
-                    str(queue_length),
+                    str(queue_length) if not is_maintenance else "-",
                     arch,
                     wait_display,
                 )
@@ -2576,6 +2574,7 @@ def _show_availability_watch(interval: int) -> None:
                             title="GPU Availability by Type (numbers are GPUs, not nodes)")
                         table.add_column("GPU Type", style="cyan")
                         table.add_column("Available", style="green")
+                        table.add_column("Max Reservable", style="blue")
                         table.add_column("Total", style="blue")
                         table.add_column("Queue Length", style="yellow")
                         table.add_column("Architecture", style="dim")
@@ -2588,11 +2587,13 @@ def _show_availability_watch(interval: int) -> None:
                             # Add separator before CPU section
                             if last_arch and not last_arch.startswith("CPU") and arch.startswith("CPU"):
                                 table.add_row("---", "---", "---",
-                                              "---", "---", "---")
+                                              "---", "---", "---", "---")
                             last_arch = arch
                             available = info.get("available", 0)
+                            max_reservable = info.get("max_reservable", 0)
                             total = info.get("total", 0)
+                            full_nodes_available = info.get("full_nodes_available", 0)
                             queue_length = info.get("queue_length", 0)
                             est_wait = info.get("estimated_wait_minutes", 0)
@@ -2611,17 +2612,26 @@ def _show_availability_watch(interval: int) -> None:
                                 else:
                                     wait_display = f"{hours}h {minutes}min"
-                            # Color code availability
-                            if available > 0:
+                            # Check maintenance mode
+                            is_maintenance = info.get("maintenance", False)
+                            maintenance_reason = info.get("maintenance_reason", "")
+                            if is_maintenance:
+                                available_display = f"[red]MAINTENANCE[/red]"
+                                wait_display = maintenance_reason or "Under maintenance"
+                            elif available == 0:
+                                available_display = f"[red]{available}[/red]"
+                            elif full_nodes_available > 0:
                                 available_display = f"[green]{available}[/green]"
                             else:
-                                available_display = f"[red]{available}[/red]"
+                                available_display = f"[yellow]{available}[/yellow]"
                             table.add_row(
                                 gpu_type.upper(),
                                 available_display,
+                                str(max_reservable) if not is_maintenance else "-",
                                 str(total),
-                                str(queue_length),
+                                str(queue_length) if not is_maintenance else "-",
                                 arch,
                                 wait_display,
                             )
@@ -3505,6 +3515,9 @@ def edit(
         # Stop spinner before validation and operations
         live.stop()
+        # Use the full reservation_id from connection_info (not the user-provided prefix)
+        reservation_id = connection_info["reservation_id"]
         if connection_info["status"] != "active":
             rprint(
                 f"[red]❌ Can only edit active reservations (current status: {connection_info['status']})[/red]"

{gpu_dev-0.3.8 → gpu_dev-0.4.0}/cli-tools/gpu-dev-cli/gpu_dev_cli/config.py RENAMED Viewed

@@ -67,15 +67,15 @@ class Config:
     def _create_aws_session(self):
         """Create AWS session with profile support"""
-        try:
-            # Try to use 'gpu-dev' profile if it exists
-            session = boto3.Session(profile_name="gpu-dev")
-            # Test if profile works by checking credentials
-            session.get_credentials()
-            return session
-        except Exception:
-            # Fall back to default credentials (environment, default profile, IAM role, etc.)
-            return boto3.Session()
+        available_profiles = boto3.Session().available_profiles
+        if "gpu-dev" in available_profiles:
+            try:
+                session = boto3.Session(profile_name="gpu-dev")
+                session.get_credentials()
+                return session
+            except Exception:
+                pass
+        return boto3.Session()
     @property
     def sts_client(self):

{gpu_dev-0.3.8 → gpu_dev-0.4.0}/cli-tools/gpu-dev-cli/gpu_dev_cli/interactive.py RENAMED Viewed

@@ -92,8 +92,14 @@ def select_gpu_type_interactive(
                 wait_display = f"{hours}h {minutes}min"
             status_indicator = "⏳"
-        # Color code availability
-        if available > 0:
+        # Check maintenance mode
+        is_maintenance = info.get("maintenance", False)
+        maintenance_reason = info.get("maintenance_reason", "")
+        if is_maintenance:
+            available_display = f"[red]MAINTENANCE[/red]"
+            wait_display = maintenance_reason or "Under maintenance"
+        elif available > 0:
             available_display = f"[green]{available}[/green]"
         else:
             available_display = f"[red]{available}[/red]"
@@ -102,18 +108,25 @@ def select_gpu_type_interactive(
             gpu_type.upper(),
             available_display,
             str(total),
-            str(queue_length),
+            str(queue_length) if not is_maintenance else "-",
             wait_display,
         )
-        # Create choice label with status
-        choice_label = (
-            f"{status_indicator} {gpu_type.upper()} ({available}/{total} available)"
-        )
-        if queue_length > 0:
-            choice_label += f" - {queue_length} in queue"
+        if is_maintenance:
+            choices.append(questionary.Choice(
+                title=f"🔧 {gpu_type.upper()} - MAINTENANCE: {maintenance_reason}",
+                value=gpu_type,
+                disabled="Under maintenance",
+            ))
+        else:
+            # Create choice label with status
+            choice_label = (
+                f"{status_indicator} {gpu_type.upper()} ({available}/{total} available)"
+            )
+            if queue_length > 0:
+                choice_label += f" - {queue_length} in queue"
-        choices.append(questionary.Choice(title=choice_label, value=gpu_type))
+            choices.append(questionary.Choice(title=choice_label, value=gpu_type))
     console.print(table)
     console.print()

gpu-dev 0.3.8__tar.gz → 0.4.0__tar.gz

gpu-dev 0.3.8tar.gz → 0.4.0tar.gz