gpu-dev 0.3.5__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- gpu_dev-0.3.5.dist-info/METADATA +687 -0
- gpu_dev-0.3.5.dist-info/RECORD +14 -0
- gpu_dev-0.3.5.dist-info/WHEEL +5 -0
- gpu_dev-0.3.5.dist-info/entry_points.txt +4 -0
- gpu_dev-0.3.5.dist-info/top_level.txt +1 -0
- gpu_dev_cli/__init__.py +9 -0
- gpu_dev_cli/auth.py +158 -0
- gpu_dev_cli/cli.py +3754 -0
- gpu_dev_cli/config.py +248 -0
- gpu_dev_cli/disks.py +523 -0
- gpu_dev_cli/interactive.py +702 -0
- gpu_dev_cli/name_generator.py +117 -0
- gpu_dev_cli/reservations.py +2231 -0
- gpu_dev_cli/ssh_proxy.py +106 -0
|
@@ -0,0 +1,687 @@
|
|
|
1
|
+
Metadata-Version: 2.4
|
|
2
|
+
Name: gpu-dev
|
|
3
|
+
Version: 0.3.5
|
|
4
|
+
Summary: CLI tool for PyTorch GPU developer server reservations
|
|
5
|
+
Author: PyTorch Team
|
|
6
|
+
Requires-Python: >=3.8
|
|
7
|
+
Description-Content-Type: text/markdown
|
|
8
|
+
Requires-Dist: click>=8.1.0
|
|
9
|
+
Requires-Dist: boto3>=1.34.0
|
|
10
|
+
Requires-Dist: requests>=2.31.0
|
|
11
|
+
Requires-Dist: pydantic>=2.5.0
|
|
12
|
+
Requires-Dist: rich>=13.7.0
|
|
13
|
+
Requires-Dist: pyyaml>=6.0.1
|
|
14
|
+
Requires-Dist: questionary>=2.1.1
|
|
15
|
+
Requires-Dist: websockets>=12.0
|
|
16
|
+
Requires-Dist: certifi>=2023.7.22
|
|
17
|
+
Requires-Dist: mcp>=1.0.0
|
|
18
|
+
|
|
19
|
+
# GPU Developer CLI
|
|
20
|
+
|
|
21
|
+
A command-line tool for reserving and managing GPU development servers on AWS EKS.
|
|
22
|
+
|
|
23
|
+
## Table of Contents
|
|
24
|
+
|
|
25
|
+
- [Installation](#installation)
|
|
26
|
+
- [Configuration](#configuration)
|
|
27
|
+
- [Quick Start](#quick-start)
|
|
28
|
+
- [Commands Reference](#commands-reference)
|
|
29
|
+
- [GPU Types](#gpu-types)
|
|
30
|
+
- [Storage](#storage)
|
|
31
|
+
- [Multinode Reservations](#multinode-reservations)
|
|
32
|
+
- [Custom Docker Images](#custom-docker-images)
|
|
33
|
+
- [Nsight Profiling](#nsight-profiling)
|
|
34
|
+
- [Default Container Image](#default-container-image)
|
|
35
|
+
- [SSH & IDE Integration](#ssh--ide-integration)
|
|
36
|
+
- [Architecture](#architecture)
|
|
37
|
+
- [Troubleshooting](#troubleshooting)
|
|
38
|
+
|
|
39
|
+
---
|
|
40
|
+
|
|
41
|
+
## Installation
|
|
42
|
+
|
|
43
|
+
```bash
|
|
44
|
+
# Install directly from GitHub (recommended)
|
|
45
|
+
python3 -m pip install --upgrade "git+https://github.com/wdvr/osdc.git"
|
|
46
|
+
|
|
47
|
+
# Or install from local clone
|
|
48
|
+
git clone https://github.com/wdvr/osdc.git
|
|
49
|
+
cd osdc
|
|
50
|
+
pip install -e .
|
|
51
|
+
```
|
|
52
|
+
|
|
53
|
+
## Configuration
|
|
54
|
+
|
|
55
|
+
### Initial Setup
|
|
56
|
+
|
|
57
|
+
```bash
|
|
58
|
+
# Set your GitHub username (required for SSH key authentication)
|
|
59
|
+
gpu-dev config set github_user your-github-username
|
|
60
|
+
|
|
61
|
+
# View current configuration
|
|
62
|
+
gpu-dev config show
|
|
63
|
+
```
|
|
64
|
+
|
|
65
|
+
Configuration is stored at `~/.config/gpu-dev/config.json`.
|
|
66
|
+
|
|
67
|
+
### SSH Config Integration
|
|
68
|
+
|
|
69
|
+
Enable automatic SSH config for seamless VS Code/Cursor integration:
|
|
70
|
+
|
|
71
|
+
```bash
|
|
72
|
+
# Enable SSH config auto-include (recommended)
|
|
73
|
+
gpu-dev config ssh-include enable
|
|
74
|
+
|
|
75
|
+
# Disable if needed
|
|
76
|
+
gpu-dev config ssh-include disable
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
When enabled, this adds `Include ~/.gpu-dev/*-sshconfig` to:
|
|
80
|
+
- `~/.ssh/config`
|
|
81
|
+
- `~/.cursor/ssh_config`
|
|
82
|
+
|
|
83
|
+
### AWS Authentication
|
|
84
|
+
|
|
85
|
+
The CLI uses your AWS credentials. Configure via:
|
|
86
|
+
- `aws configure` command
|
|
87
|
+
- Environment variables (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`)
|
|
88
|
+
- IAM roles (for EC2/Lambda)
|
|
89
|
+
- SSO: `aws sso login --profile your-profile`
|
|
90
|
+
|
|
91
|
+
---
|
|
92
|
+
|
|
93
|
+
## Quick Start
|
|
94
|
+
|
|
95
|
+
```bash
|
|
96
|
+
# Interactive reservation (guided setup)
|
|
97
|
+
gpu-dev reserve
|
|
98
|
+
|
|
99
|
+
# Reserve 4 H100 GPUs for 8 hours
|
|
100
|
+
gpu-dev reserve --gpu-type h100 --gpus 4 --hours 8
|
|
101
|
+
|
|
102
|
+
# Check your reservations
|
|
103
|
+
gpu-dev list
|
|
104
|
+
|
|
105
|
+
# Connect to your active reservation
|
|
106
|
+
gpu-dev connect
|
|
107
|
+
|
|
108
|
+
# Check GPU availability
|
|
109
|
+
gpu-dev avail
|
|
110
|
+
```
|
|
111
|
+
|
|
112
|
+
---
|
|
113
|
+
|
|
114
|
+
## Commands Reference
|
|
115
|
+
|
|
116
|
+
### `gpu-dev reserve`
|
|
117
|
+
|
|
118
|
+
Create a GPU reservation.
|
|
119
|
+
|
|
120
|
+
**Interactive Mode** (default when parameters omitted):
|
|
121
|
+
```bash
|
|
122
|
+
gpu-dev reserve
|
|
123
|
+
```
|
|
124
|
+
Guides you through GPU type, count, duration, disk, and Jupyter selection.
|
|
125
|
+
|
|
126
|
+
**Command-line Mode**:
|
|
127
|
+
```bash
|
|
128
|
+
gpu-dev reserve [OPTIONS]
|
|
129
|
+
```
|
|
130
|
+
|
|
131
|
+
| Option | Short | Description |
|
|
132
|
+
|--------|-------|-------------|
|
|
133
|
+
| `--gpus` | `-g` | Number of GPUs (1, 2, 4, 8, 12, 16, 20, 24, 32, 40, 48) |
|
|
134
|
+
| `--gpu-type` | `-t` | GPU type: `b200`, `h200`, `h100`, `a100`, `a10g`, `t4`, `l4`, `t4-small`, `cpu-arm`, `cpu-x86` |
|
|
135
|
+
| `--hours` | `-h` | Duration in hours (0.0833 to 24, supports decimals) |
|
|
136
|
+
| `--name` | `-n` | Optional reservation name |
|
|
137
|
+
| `--jupyter` | | Enable Jupyter Lab access |
|
|
138
|
+
| `--disk` | | Named persistent disk to use, or `none` for temporary storage |
|
|
139
|
+
| `--no-persist` | | Create without persistent disk (ephemeral `/home/dev`) |
|
|
140
|
+
| `--ignore-no-persist` | | Skip warning when disk is in use |
|
|
141
|
+
| `--recreate-env` | | Recreate shell environment on existing disk |
|
|
142
|
+
| `--distributed` | `-d` | Required for multinode reservations (>8 GPUs) |
|
|
143
|
+
| `--dockerfile` | | Path to custom Dockerfile (max 512KB) |
|
|
144
|
+
| `--dockerimage` | | Custom Docker image URL |
|
|
145
|
+
| `--preserve-entrypoint` | | Keep original container ENTRYPOINT/CMD |
|
|
146
|
+
| `--node-label` | `-l` | Node selector labels (e.g., `--node-label nsight=true`) |
|
|
147
|
+
| `--verbose` | `-v` | Enable debug output |
|
|
148
|
+
| `--no-interactive` | | Force non-interactive mode |
|
|
149
|
+
|
|
150
|
+
**Examples**:
|
|
151
|
+
```bash
|
|
152
|
+
# 2 H100 GPUs for 4 hours with Jupyter
|
|
153
|
+
gpu-dev reserve -t h100 -g 2 -h 4 --jupyter
|
|
154
|
+
|
|
155
|
+
# Use specific persistent disk
|
|
156
|
+
gpu-dev reserve -t a100 -g 4 -h 8 --disk pytorch-dev
|
|
157
|
+
|
|
158
|
+
# Temporary storage only
|
|
159
|
+
gpu-dev reserve -t t4 -g 1 -h 2 --disk none
|
|
160
|
+
|
|
161
|
+
# 16 GPUs across 2 nodes (multinode)
|
|
162
|
+
gpu-dev reserve -t h100 -g 16 -h 12 --distributed
|
|
163
|
+
|
|
164
|
+
# Custom Docker image
|
|
165
|
+
gpu-dev reserve -t h100 -g 4 --dockerimage pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel
|
|
166
|
+
|
|
167
|
+
# Request Nsight profiling node
|
|
168
|
+
gpu-dev reserve -t h100 -g 8 --node-label nsight=true
|
|
169
|
+
```
|
|
170
|
+
|
|
171
|
+
### `gpu-dev list`
|
|
172
|
+
|
|
173
|
+
List your reservations.
|
|
174
|
+
|
|
175
|
+
```bash
|
|
176
|
+
gpu-dev list [OPTIONS]
|
|
177
|
+
```
|
|
178
|
+
|
|
179
|
+
| Option | Short | Description |
|
|
180
|
+
|--------|-------|-------------|
|
|
181
|
+
| `--user` | `-u` | Filter by user (`all` for all users) |
|
|
182
|
+
| `--status` | `-s` | Filter by status: `active`, `queued`, `pending`, `preparing`, `expired`, `cancelled`, `failed` |
|
|
183
|
+
| `--all` | `-a` | Show all reservations (including expired/cancelled) |
|
|
184
|
+
| `--watch` | | Continuously refresh every 2 seconds |
|
|
185
|
+
|
|
186
|
+
### `gpu-dev show`
|
|
187
|
+
|
|
188
|
+
Show detailed information for a specific reservation.
|
|
189
|
+
|
|
190
|
+
```bash
|
|
191
|
+
gpu-dev show [RESERVATION_ID]
|
|
192
|
+
```
|
|
193
|
+
|
|
194
|
+
If no ID provided, shows details for your active/pending reservation.
|
|
195
|
+
|
|
196
|
+
### `gpu-dev connect`
|
|
197
|
+
|
|
198
|
+
SSH to your active reservation.
|
|
199
|
+
|
|
200
|
+
```bash
|
|
201
|
+
gpu-dev connect [RESERVATION_ID]
|
|
202
|
+
```
|
|
203
|
+
|
|
204
|
+
If no ID provided, connects to your active reservation.
|
|
205
|
+
|
|
206
|
+
### `gpu-dev cancel`
|
|
207
|
+
|
|
208
|
+
Cancel a reservation.
|
|
209
|
+
|
|
210
|
+
```bash
|
|
211
|
+
gpu-dev cancel [RESERVATION_ID]
|
|
212
|
+
```
|
|
213
|
+
|
|
214
|
+
**Interactive Mode**: If no ID provided, shows selection menu.
|
|
215
|
+
|
|
216
|
+
| Option | Short | Description |
|
|
217
|
+
|--------|-------|-------------|
|
|
218
|
+
| `--all` | `-a` | Cancel all your active reservations |
|
|
219
|
+
|
|
220
|
+
### `gpu-dev edit`
|
|
221
|
+
|
|
222
|
+
Modify an active reservation.
|
|
223
|
+
|
|
224
|
+
```bash
|
|
225
|
+
gpu-dev edit [RESERVATION_ID] [OPTIONS]
|
|
226
|
+
```
|
|
227
|
+
|
|
228
|
+
| Option | Description |
|
|
229
|
+
|--------|-------------|
|
|
230
|
+
| `--enable-jupyter` | Enable Jupyter Lab |
|
|
231
|
+
| `--disable-jupyter` | Disable Jupyter Lab |
|
|
232
|
+
| `--extend` | Extend reservation duration |
|
|
233
|
+
| `--add-user` | Add secondary user (GitHub username) |
|
|
234
|
+
|
|
235
|
+
**Examples**:
|
|
236
|
+
```bash
|
|
237
|
+
# Enable Jupyter on existing reservation
|
|
238
|
+
gpu-dev edit abc12345 --enable-jupyter
|
|
239
|
+
|
|
240
|
+
# Extend reservation
|
|
241
|
+
gpu-dev edit abc12345 --extend
|
|
242
|
+
|
|
243
|
+
# Add collaborator
|
|
244
|
+
gpu-dev edit abc12345 --add-user colleague-github-name
|
|
245
|
+
```
|
|
246
|
+
|
|
247
|
+
### `gpu-dev avail`
|
|
248
|
+
|
|
249
|
+
Check GPU availability by type.
|
|
250
|
+
|
|
251
|
+
```bash
|
|
252
|
+
gpu-dev avail [OPTIONS]
|
|
253
|
+
```
|
|
254
|
+
|
|
255
|
+
| Option | Description |
|
|
256
|
+
|--------|-------------|
|
|
257
|
+
| `--watch` | Continuously refresh every 5 seconds |
|
|
258
|
+
|
|
259
|
+
### `gpu-dev status`
|
|
260
|
+
|
|
261
|
+
Show overall cluster status and capacity.
|
|
262
|
+
|
|
263
|
+
```bash
|
|
264
|
+
gpu-dev status
|
|
265
|
+
```
|
|
266
|
+
|
|
267
|
+
### `gpu-dev disk`
|
|
268
|
+
|
|
269
|
+
Manage persistent disks.
|
|
270
|
+
|
|
271
|
+
#### `gpu-dev disk list`
|
|
272
|
+
```bash
|
|
273
|
+
gpu-dev disk list [OPTIONS]
|
|
274
|
+
```
|
|
275
|
+
|
|
276
|
+
| Option | Description |
|
|
277
|
+
|--------|-------------|
|
|
278
|
+
| `--watch` | Continuously refresh every 2 seconds |
|
|
279
|
+
| `--user` | Impersonate another user |
|
|
280
|
+
|
|
281
|
+
Shows: disk name, size, created date, last used, snapshot count, status (available/in-use/backing-up/deleted).
|
|
282
|
+
|
|
283
|
+
#### `gpu-dev disk create`
|
|
284
|
+
```bash
|
|
285
|
+
gpu-dev disk create <DISK_NAME>
|
|
286
|
+
```
|
|
287
|
+
Creates a new named persistent disk. Disk names can contain letters, numbers, hyphens, and underscores.
|
|
288
|
+
|
|
289
|
+
#### `gpu-dev disk delete`
|
|
290
|
+
```bash
|
|
291
|
+
gpu-dev disk delete <DISK_NAME> [--yes/-y]
|
|
292
|
+
```
|
|
293
|
+
Soft-deletes a disk. Snapshots are permanently deleted after 30 days.
|
|
294
|
+
|
|
295
|
+
#### `gpu-dev disk list-content`
|
|
296
|
+
```bash
|
|
297
|
+
gpu-dev disk list-content <DISK_NAME>
|
|
298
|
+
```
|
|
299
|
+
Shows file listing from the latest snapshot of a disk.
|
|
300
|
+
|
|
301
|
+
#### `gpu-dev disk rename`
|
|
302
|
+
```bash
|
|
303
|
+
gpu-dev disk rename <OLD_NAME> <NEW_NAME>
|
|
304
|
+
```
|
|
305
|
+
Renames an existing disk.
|
|
306
|
+
|
|
307
|
+
### `gpu-dev help`
|
|
308
|
+
|
|
309
|
+
Show help information.
|
|
310
|
+
|
|
311
|
+
---
|
|
312
|
+
|
|
313
|
+
## GPU Types
|
|
314
|
+
|
|
315
|
+
| GPU Type | Instance Type | GPUs/Node | Memory/GPU | Best For |
|
|
316
|
+
|----------|--------------|-----------|------------|----------|
|
|
317
|
+
| `b200` | p6-b200.48xlarge | 8 | 192GB | Latest NVIDIA Blackwell, highest performance |
|
|
318
|
+
| `h200` | p5e.48xlarge | 8 | 141GB | Large models, high memory workloads |
|
|
319
|
+
| `h100` | p5.48xlarge | 8 | 80GB | Production training, large-scale inference |
|
|
320
|
+
| `a100` | p4d.24xlarge | 8 | 40GB | General ML training |
|
|
321
|
+
| `a10g` | g5.12xlarge | 4 | 24GB | Inference, smaller training |
|
|
322
|
+
| `l4` | g6.12xlarge | 4 | 24GB | Inference, cost-effective |
|
|
323
|
+
| `t4` | g4dn.12xlarge | 4 | 16GB | Development, testing |
|
|
324
|
+
| `t4-small` | g4dn.xlarge | 1 | 16GB | Single GPU development |
|
|
325
|
+
| `cpu-arm` | c7g.4xlarge | 0 | N/A | ARM CPU-only workloads |
|
|
326
|
+
| `cpu-x86` | c7i.4xlarge | 0 | N/A | x86 CPU-only workloads |
|
|
327
|
+
|
|
328
|
+
---
|
|
329
|
+
|
|
330
|
+
## Storage
|
|
331
|
+
|
|
332
|
+
### Persistent Disk (EBS) - `/home/dev`
|
|
333
|
+
|
|
334
|
+
Each user can have **named persistent disks** that preserve data between sessions:
|
|
335
|
+
|
|
336
|
+
- **Mount point**: `/home/dev` (your home directory)
|
|
337
|
+
- **Size**: 100GB per disk
|
|
338
|
+
- **Backed up**: Automatic snapshots when reservation ends
|
|
339
|
+
- **Content tracking**: View contents via `gpu-dev disk list-content`
|
|
340
|
+
|
|
341
|
+
**Workflow**:
|
|
342
|
+
```bash
|
|
343
|
+
# Create a new disk
|
|
344
|
+
gpu-dev disk create my-project
|
|
345
|
+
|
|
346
|
+
# Use it in a reservation
|
|
347
|
+
gpu-dev reserve --disk my-project
|
|
348
|
+
|
|
349
|
+
# List your disks
|
|
350
|
+
gpu-dev disk list
|
|
351
|
+
|
|
352
|
+
# View disk contents (from snapshot)
|
|
353
|
+
gpu-dev disk list-content my-project
|
|
354
|
+
```
|
|
355
|
+
|
|
356
|
+
**Multiple Disks**: You can have multiple named disks for different projects (e.g., `pytorch-dev`, `llm-training`, `experiments`).
|
|
357
|
+
|
|
358
|
+
**Disk Selection**: During interactive reservation, you'll be prompted to select a disk or create a new one.
|
|
359
|
+
|
|
360
|
+
### Shared Personal Storage (EFS) - `/shared-personal`
|
|
361
|
+
|
|
362
|
+
Per-user EFS filesystem for larger files that persist across all your reservations:
|
|
363
|
+
|
|
364
|
+
- **Mount point**: `/shared-personal`
|
|
365
|
+
- **Size**: Elastic (pay for what you use)
|
|
366
|
+
- **Use case**: Datasets, model checkpoints, large files
|
|
367
|
+
|
|
368
|
+
### Shared ccache (EFS) - `/ccache`
|
|
369
|
+
|
|
370
|
+
Shared compiler cache across ALL users:
|
|
371
|
+
|
|
372
|
+
- **Mount point**: `/ccache`
|
|
373
|
+
- **Environment**: `CCACHE_DIR=/ccache`
|
|
374
|
+
- **Benefit**: Faster compilation for PyTorch and other C++ projects
|
|
375
|
+
- **Shared**: Cache hits from any user benefit everyone
|
|
376
|
+
|
|
377
|
+
### Temporary Storage
|
|
378
|
+
|
|
379
|
+
Use `--disk none` or `--no-persist` for reservations without persistent disk:
|
|
380
|
+
- `/home/dev` uses ephemeral storage
|
|
381
|
+
- Data is lost when reservation ends
|
|
382
|
+
- Useful for quick experiments or CI-like workflows
|
|
383
|
+
|
|
384
|
+
---
|
|
385
|
+
|
|
386
|
+
## Multinode Reservations
|
|
387
|
+
|
|
388
|
+
For distributed training across multiple GPU nodes:
|
|
389
|
+
|
|
390
|
+
```bash
|
|
391
|
+
# 16 H100 GPUs (2 nodes x 8 GPUs)
|
|
392
|
+
gpu-dev reserve -t h100 -g 16 --distributed
|
|
393
|
+
|
|
394
|
+
# 24 H100 GPUs (3 nodes x 8 GPUs)
|
|
395
|
+
gpu-dev reserve -t h100 -g 24 --distributed
|
|
396
|
+
```
|
|
397
|
+
|
|
398
|
+
**Requirements**:
|
|
399
|
+
- GPU count must be a multiple of GPUs-per-node (e.g., 16, 24, 32 for H100)
|
|
400
|
+
- `--distributed` flag is required
|
|
401
|
+
|
|
402
|
+
**What you get**:
|
|
403
|
+
- Multiple pods with hostname resolution: `<podname>-headless.gpu-dev.svc.cluster.local`
|
|
404
|
+
- Shared network drive between nodes
|
|
405
|
+
- Network connectivity between all pods
|
|
406
|
+
- Master port 29500 available on all nodes
|
|
407
|
+
- EFA (Elastic Fabric Adapter) for high-bandwidth inter-node communication
|
|
408
|
+
|
|
409
|
+
**Node naming**: Nodes are numbered 0 to N-1. Use `$RANK` or node index to set `MASTER_ADDR`.
|
|
410
|
+
|
|
411
|
+
---
|
|
412
|
+
|
|
413
|
+
## Custom Docker Images
|
|
414
|
+
|
|
415
|
+
### Using a Pre-built Image
|
|
416
|
+
|
|
417
|
+
```bash
|
|
418
|
+
gpu-dev reserve --dockerimage pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel
|
|
419
|
+
```
|
|
420
|
+
|
|
421
|
+
Note: The image must have SSH server capabilities for remote access.
|
|
422
|
+
|
|
423
|
+
### Using a Custom Dockerfile
|
|
424
|
+
|
|
425
|
+
```bash
|
|
426
|
+
gpu-dev reserve --dockerfile ./my-project/Dockerfile
|
|
427
|
+
```
|
|
428
|
+
|
|
429
|
+
**Limitations**:
|
|
430
|
+
- Dockerfile max size: 512KB
|
|
431
|
+
- Build context (directory) max size: ~700KB compressed
|
|
432
|
+
- Build happens at reservation time (adds startup time)
|
|
433
|
+
|
|
434
|
+
**Example Dockerfile**:
|
|
435
|
+
```dockerfile
|
|
436
|
+
FROM pytorch/pytorch:2.3.0-cuda12.1-cudnn8-devel
|
|
437
|
+
|
|
438
|
+
# Install additional packages
|
|
439
|
+
RUN pip install transformers datasets accelerate
|
|
440
|
+
|
|
441
|
+
# Your customizations...
|
|
442
|
+
```
|
|
443
|
+
|
|
444
|
+
### Preserving Entrypoint
|
|
445
|
+
|
|
446
|
+
To keep the original container's ENTRYPOINT/CMD instead of the SSH server:
|
|
447
|
+
|
|
448
|
+
```bash
|
|
449
|
+
gpu-dev reserve --dockerimage myimage:latest --preserve-entrypoint
|
|
450
|
+
```
|
|
451
|
+
|
|
452
|
+
---
|
|
453
|
+
|
|
454
|
+
## Nsight Profiling
|
|
455
|
+
|
|
456
|
+
For GPU profiling with NVIDIA Nsight Compute (ncu) and Nsight Systems (nsys):
|
|
457
|
+
|
|
458
|
+
```bash
|
|
459
|
+
# Request a profiling-dedicated node
|
|
460
|
+
gpu-dev reserve -t h100 -g 8 --node-label nsight=true
|
|
461
|
+
```
|
|
462
|
+
|
|
463
|
+
**Why dedicated nodes?**
|
|
464
|
+
- DCGM (GPU monitoring) conflicts with Nsight profiling
|
|
465
|
+
- Profiling-dedicated nodes have DCGM disabled
|
|
466
|
+
- One H100, one B200, and one T4 node are reserved for profiling
|
|
467
|
+
|
|
468
|
+
**Profiling capabilities enabled**:
|
|
469
|
+
- `CAP_SYS_ADMIN` Linux capability on pods
|
|
470
|
+
- `NVreg_RestrictProfilingToAdminUsers=0` on nodes
|
|
471
|
+
- `NVIDIA_DRIVER_CAPABILITIES=compute,utility`
|
|
472
|
+
|
|
473
|
+
**Available profiling tools**:
|
|
474
|
+
- `ncu` - Nsight Compute for kernel profiling
|
|
475
|
+
- `nsys` - Nsight Systems for system-wide profiling
|
|
476
|
+
|
|
477
|
+
---
|
|
478
|
+
|
|
479
|
+
## Default Container Image
|
|
480
|
+
|
|
481
|
+
The default image (`pytorch/pytorch:2.9.1-cuda12.8-cudnn9-devel` based) includes:
|
|
482
|
+
|
|
483
|
+
### Pre-installed Software
|
|
484
|
+
|
|
485
|
+
**Deep Learning**:
|
|
486
|
+
- PyTorch 2.9.1 with CUDA 12.8
|
|
487
|
+
- cuDNN 9
|
|
488
|
+
- CUDA Toolkit 12.8 + 13.0
|
|
489
|
+
|
|
490
|
+
**Python Packages**:
|
|
491
|
+
- JupyterLab, ipywidgets
|
|
492
|
+
- matplotlib, seaborn, plotly
|
|
493
|
+
- pandas, numpy, scikit-learn
|
|
494
|
+
- tensorboard
|
|
495
|
+
|
|
496
|
+
**System Tools**:
|
|
497
|
+
- zsh with oh-my-zsh (default shell)
|
|
498
|
+
- bash with bash-completion
|
|
499
|
+
- vim, nano, neovim
|
|
500
|
+
- tmux, htop, tree
|
|
501
|
+
- git, curl, wget
|
|
502
|
+
- ccache
|
|
503
|
+
|
|
504
|
+
**Development**:
|
|
505
|
+
- Claude Code CLI (`claude`)
|
|
506
|
+
- Node.js 20
|
|
507
|
+
- SSH server
|
|
508
|
+
|
|
509
|
+
### Shell Environment
|
|
510
|
+
|
|
511
|
+
- **Default shell**: zsh with oh-my-zsh
|
|
512
|
+
- **Plugins**: zsh-autosuggestions, zsh-syntax-highlighting
|
|
513
|
+
- **User**: `dev` with passwordless sudo
|
|
514
|
+
- **Home**: `/home/dev` (persistent or temporary based on disk settings)
|
|
515
|
+
|
|
516
|
+
### Environment Variables
|
|
517
|
+
|
|
518
|
+
```bash
|
|
519
|
+
CUDA_12_PATH=/usr/local/cuda-12.8
|
|
520
|
+
CUDA_13_PATH=/usr/local/cuda-13.0
|
|
521
|
+
CCACHE_DIR=/ccache
|
|
522
|
+
```
|
|
523
|
+
|
|
524
|
+
---
|
|
525
|
+
|
|
526
|
+
## SSH & IDE Integration
|
|
527
|
+
|
|
528
|
+
### SSH Access
|
|
529
|
+
|
|
530
|
+
After reservation is active:
|
|
531
|
+
|
|
532
|
+
```bash
|
|
533
|
+
# Quick connect
|
|
534
|
+
gpu-dev connect
|
|
535
|
+
|
|
536
|
+
# Or use the SSH command shown in reservation details
|
|
537
|
+
ssh dev@<node-ip> -p <nodeport>
|
|
538
|
+
|
|
539
|
+
# With SSH config enabled (recommended)
|
|
540
|
+
ssh <pod-name>
|
|
541
|
+
```
|
|
542
|
+
|
|
543
|
+
### VS Code Remote
|
|
544
|
+
|
|
545
|
+
With SSH config enabled:
|
|
546
|
+
```bash
|
|
547
|
+
code --remote ssh-remote+<pod-name> /home/dev
|
|
548
|
+
```
|
|
549
|
+
|
|
550
|
+
Or click the VS Code link shown in `gpu-dev show` output.
|
|
551
|
+
|
|
552
|
+
### Cursor IDE
|
|
553
|
+
|
|
554
|
+
Works the same as VS Code when SSH config is enabled:
|
|
555
|
+
1. Open Remote SSH in Cursor
|
|
556
|
+
2. Select your pod from the list
|
|
557
|
+
|
|
558
|
+
### SSH Agent Forwarding
|
|
559
|
+
|
|
560
|
+
To use your local SSH keys on the server (e.g., for git):
|
|
561
|
+
```bash
|
|
562
|
+
ssh -A <pod-name>
|
|
563
|
+
```
|
|
564
|
+
|
|
565
|
+
Or add to your SSH config:
|
|
566
|
+
```
|
|
567
|
+
Host gpu-dev-*
|
|
568
|
+
ForwardAgent yes
|
|
569
|
+
```
|
|
570
|
+
|
|
571
|
+
---
|
|
572
|
+
|
|
573
|
+
## Reservation Limits
|
|
574
|
+
|
|
575
|
+
| Limit | Value |
|
|
576
|
+
|-------|-------|
|
|
577
|
+
| Maximum duration | 24 hours |
|
|
578
|
+
| Minimum duration | 5 minutes (0.0833 hours) |
|
|
579
|
+
| Extension | Once, up to 24 additional hours |
|
|
580
|
+
| Total max time | 48 hours (24h initial + 24h extension) |
|
|
581
|
+
|
|
582
|
+
**Expiry Warnings**:
|
|
583
|
+
- 30 minutes before expiry
|
|
584
|
+
- 15 minutes before expiry
|
|
585
|
+
- 5 minutes before expiry
|
|
586
|
+
|
|
587
|
+
Warnings appear as files in your home directory and via `wall` messages.
|
|
588
|
+
|
|
589
|
+
---
|
|
590
|
+
|
|
591
|
+
## Architecture
|
|
592
|
+
|
|
593
|
+
### System Components
|
|
594
|
+
|
|
595
|
+
```
|
|
596
|
+
┌─────────────┐ ┌──────────────┐ ┌─────────────────────┐
|
|
597
|
+
│ GPU Dev │────▶│ SQS Queue │────▶│ Lambda Processor │
|
|
598
|
+
│ CLI │ │ │ │ │
|
|
599
|
+
└─────────────┘ └──────────────┘ └──────────┬──────────┘
|
|
600
|
+
│ │
|
|
601
|
+
│ ▼
|
|
602
|
+
│ ┌──────────────┐ ┌─────────────────────┐
|
|
603
|
+
└───────────▶│ DynamoDB │◀────│ EKS Cluster │
|
|
604
|
+
│ Reservations │ │ (GPU Nodes) │
|
|
605
|
+
└──────────────┘ └─────────────────────┘
|
|
606
|
+
```
|
|
607
|
+
|
|
608
|
+
### Infrastructure
|
|
609
|
+
|
|
610
|
+
- **EKS Cluster**: Kubernetes cluster with GPU-enabled nodes
|
|
611
|
+
- **Node Groups**: Auto-scaling groups per GPU type
|
|
612
|
+
- **NVIDIA GPU Operator**: Manages GPU drivers and device plugin
|
|
613
|
+
- **EBS CSI Driver**: Handles persistent volume attachments
|
|
614
|
+
- **EFS**: Shared storage for personal files and ccache
|
|
615
|
+
|
|
616
|
+
### Networking
|
|
617
|
+
|
|
618
|
+
- **SSH Access**: Via NodePort services (30000-32767)
|
|
619
|
+
- **Inter-node**: EFA (Elastic Fabric Adapter) for multinode
|
|
620
|
+
- **DNS**: Pod hostname resolution via headless services
|
|
621
|
+
- **Internet**: Full outbound access from pods
|
|
622
|
+
|
|
623
|
+
---
|
|
624
|
+
|
|
625
|
+
## Troubleshooting
|
|
626
|
+
|
|
627
|
+
### Common Issues
|
|
628
|
+
|
|
629
|
+
**"Disk is in use"**:
|
|
630
|
+
- Your disk is attached to another reservation
|
|
631
|
+
- Cancel the other reservation or use `--disk none`
|
|
632
|
+
- Check: `gpu-dev disk list`
|
|
633
|
+
|
|
634
|
+
**"Queued" status**:
|
|
635
|
+
- No GPU capacity available
|
|
636
|
+
- Wait for queue position to advance
|
|
637
|
+
- Check availability: `gpu-dev avail`
|
|
638
|
+
|
|
639
|
+
**SSH connection refused**:
|
|
640
|
+
- Pod may still be starting
|
|
641
|
+
- Wait for status to become "active"
|
|
642
|
+
- Check: `gpu-dev show <id>`
|
|
643
|
+
|
|
644
|
+
**Pod stuck in "preparing"**:
|
|
645
|
+
- Image pull may be slow (especially for custom images)
|
|
646
|
+
- Disk attachment may take time
|
|
647
|
+
- Check detailed status: `gpu-dev show <id>`
|
|
648
|
+
|
|
649
|
+
### Debugging Commands
|
|
650
|
+
|
|
651
|
+
```bash
|
|
652
|
+
# Show detailed reservation info
|
|
653
|
+
gpu-dev show <reservation-id>
|
|
654
|
+
|
|
655
|
+
# Watch reservation status
|
|
656
|
+
gpu-dev list --watch
|
|
657
|
+
|
|
658
|
+
# Check cluster status
|
|
659
|
+
gpu-dev status
|
|
660
|
+
|
|
661
|
+
# View disk contents
|
|
662
|
+
gpu-dev disk list-content <disk-name>
|
|
663
|
+
```
|
|
664
|
+
|
|
665
|
+
### Getting Help
|
|
666
|
+
|
|
667
|
+
- Use `gpu-dev help` or `gpu-dev <command> --help`
|
|
668
|
+
- Report issues: https://github.com/anthropics/claude-code/issues
|
|
669
|
+
|
|
670
|
+
---
|
|
671
|
+
|
|
672
|
+
## Development
|
|
673
|
+
|
|
674
|
+
```bash
|
|
675
|
+
# Install development dependencies
|
|
676
|
+
poetry install --with dev
|
|
677
|
+
|
|
678
|
+
# Run tests
|
|
679
|
+
poetry run pytest
|
|
680
|
+
|
|
681
|
+
# Format code
|
|
682
|
+
poetry run black .
|
|
683
|
+
poetry run isort .
|
|
684
|
+
|
|
685
|
+
# Type checking
|
|
686
|
+
poetry run mypy .
|
|
687
|
+
```
|
|
@@ -0,0 +1,14 @@
|
|
|
1
|
+
gpu_dev_cli/__init__.py,sha256=uReOa-UoRdJ34LyyTK0erA32lhJ7FW4B0_mWbE32ams,248
|
|
2
|
+
gpu_dev_cli/auth.py,sha256=cmL6Riu8Xu3REZ-KqAUm43dbjRDJpIvMzd-GxTkiqf4,5768
|
|
3
|
+
gpu_dev_cli/cli.py,sha256=EPKl4AV0Hf4YZReB34jo4XA6nCiJ091Q7i6OpbIF190,161134
|
|
4
|
+
gpu_dev_cli/config.py,sha256=9ji-fx4_1eyG6zNayEmDE6XFefOPwO-dK7rkeXXfBFY,8660
|
|
5
|
+
gpu_dev_cli/disks.py,sha256=xq0Llxx3Zj0S3INi63WaLytvxvWejaPbWFJHYoqSLWs,19436
|
|
6
|
+
gpu_dev_cli/interactive.py,sha256=be_XTvmHtd3PWYuyWxQEfztfB4JisQDxeM1zCEVps-w,24530
|
|
7
|
+
gpu_dev_cli/name_generator.py,sha256=b6mXImM0Hl7BxsYnIjwOfSPxRustDNoQZKubcSR5API,3150
|
|
8
|
+
gpu_dev_cli/reservations.py,sha256=jNCUpTdVqcPV3sPMrrrJNR4UumofFhsvtbRMwrODcN0,106732
|
|
9
|
+
gpu_dev_cli/ssh_proxy.py,sha256=_lRMaf6AaydDqL8nyY5lpQGchQGdvU5by-0_23NqrkU,3548
|
|
10
|
+
gpu_dev-0.3.5.dist-info/METADATA,sha256=szTvK-5U6V7rS7HCVUbfgs79na0YLguBZXCgoOuC1-I,17359
|
|
11
|
+
gpu_dev-0.3.5.dist-info/WHEEL,sha256=YLJXdYXQ2FQ0Uqn2J-6iEIC-3iOey8lH3xCtvFLkd8Q,91
|
|
12
|
+
gpu_dev-0.3.5.dist-info/entry_points.txt,sha256=Am0TH9VdmGD35rTwilbMRZ4RCmo2mE0ETeysEpxAeb4,138
|
|
13
|
+
gpu_dev-0.3.5.dist-info/top_level.txt,sha256=fMLD98XhbRH_8A2V_vwPfWFt8yagyZysSNBbeEgM2Ks,12
|
|
14
|
+
gpu_dev-0.3.5.dist-info/RECORD,,
|
|
@@ -0,0 +1 @@
|
|
|
1
|
+
gpu_dev_cli
|