dayhoff-tools 1.1.10__py3-none-any.whl → 1.13.12__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- dayhoff_tools/__init__.py +10 -0
- dayhoff_tools/cli/cloud_commands.py +179 -43
- dayhoff_tools/cli/engine1/__init__.py +323 -0
- dayhoff_tools/cli/engine1/engine_core.py +703 -0
- dayhoff_tools/cli/engine1/engine_lifecycle.py +136 -0
- dayhoff_tools/cli/engine1/engine_maintenance.py +431 -0
- dayhoff_tools/cli/engine1/engine_management.py +505 -0
- dayhoff_tools/cli/engine1/shared.py +501 -0
- dayhoff_tools/cli/engine1/studio_commands.py +825 -0
- dayhoff_tools/cli/engines_studios/__init__.py +6 -0
- dayhoff_tools/cli/engines_studios/api_client.py +351 -0
- dayhoff_tools/cli/engines_studios/auth.py +144 -0
- dayhoff_tools/cli/engines_studios/engine-studio-cli.md +1230 -0
- dayhoff_tools/cli/engines_studios/engine_commands.py +1151 -0
- dayhoff_tools/cli/engines_studios/progress.py +260 -0
- dayhoff_tools/cli/engines_studios/simulators/cli-simulators.md +151 -0
- dayhoff_tools/cli/engines_studios/simulators/demo.sh +75 -0
- dayhoff_tools/cli/engines_studios/simulators/engine_list_simulator.py +319 -0
- dayhoff_tools/cli/engines_studios/simulators/engine_status_simulator.py +369 -0
- dayhoff_tools/cli/engines_studios/simulators/idle_status_simulator.py +476 -0
- dayhoff_tools/cli/engines_studios/simulators/simulator_utils.py +180 -0
- dayhoff_tools/cli/engines_studios/simulators/studio_list_simulator.py +374 -0
- dayhoff_tools/cli/engines_studios/simulators/studio_status_simulator.py +164 -0
- dayhoff_tools/cli/engines_studios/studio_commands.py +755 -0
- dayhoff_tools/cli/main.py +106 -7
- dayhoff_tools/cli/utility_commands.py +896 -179
- dayhoff_tools/deployment/base.py +70 -6
- dayhoff_tools/deployment/deploy_aws.py +165 -25
- dayhoff_tools/deployment/deploy_gcp.py +78 -5
- dayhoff_tools/deployment/deploy_utils.py +20 -7
- dayhoff_tools/deployment/job_runner.py +9 -4
- dayhoff_tools/deployment/processors.py +230 -418
- dayhoff_tools/deployment/swarm.py +47 -12
- dayhoff_tools/embedders.py +28 -26
- dayhoff_tools/fasta.py +181 -64
- dayhoff_tools/warehouse.py +268 -1
- {dayhoff_tools-1.1.10.dist-info → dayhoff_tools-1.13.12.dist-info}/METADATA +20 -5
- dayhoff_tools-1.13.12.dist-info/RECORD +54 -0
- {dayhoff_tools-1.1.10.dist-info → dayhoff_tools-1.13.12.dist-info}/WHEEL +1 -1
- dayhoff_tools-1.1.10.dist-info/RECORD +0 -32
- {dayhoff_tools-1.1.10.dist-info → dayhoff_tools-1.13.12.dist-info}/entry_points.txt +0 -0
|
@@ -0,0 +1,1230 @@
|
|
|
1
|
+
# Engine & Studio CLI Commands (v2)
|
|
2
|
+
|
|
3
|
+
Comprehensive CLI for managing ephemeral compute engines and persistent studio volumes with **real-time progress tracking and enhanced observability**.
|
|
4
|
+
|
|
5
|
+
## Overview
|
|
6
|
+
|
|
7
|
+
This is the **new implementation** of the engines/studios CLI, currently accessed via `dh engine2` and `dh studio2` during the migration period.
|
|
8
|
+
|
|
9
|
+
### Key Improvements Over v1
|
|
10
|
+
- ✅ **Real-time progress tracking** for launch and attach operations
|
|
11
|
+
- ✅ **Detailed idle detector visibility** with sensor-level information
|
|
12
|
+
- ✅ **Click-based architecture** for better composability
|
|
13
|
+
- ✅ **Comprehensive error messages** with actionable guidance
|
|
14
|
+
- ✅ **Environment flag support** across all commands
|
|
15
|
+
|
|
16
|
+
### Command Migration
|
|
17
|
+
|
|
18
|
+
**Current (during transition):**
|
|
19
|
+
- `dh engine` / `dh studio` → Legacy Typer-based commands (v1)
|
|
20
|
+
- `dh engine2` / `dh studio2` → New Click-based commands with progress (v2)
|
|
21
|
+
|
|
22
|
+
**After production deployment:**
|
|
23
|
+
- `dh engine2` will become `dh engine`
|
|
24
|
+
- `dh studio2` will become `dh studio`
|
|
25
|
+
- v1 commands will be deprecated
|
|
26
|
+
|
|
27
|
+
### System Components
|
|
28
|
+
- **Engines**: Ephemeral EC2 instances for compute (CPU, GPU types)
|
|
29
|
+
- **Studios**: Persistent EBS volumes that attach/detach from engines
|
|
30
|
+
- **Auto-shutdown**: Modular idle detection prevents runaway costs
|
|
31
|
+
- **Progress APIs**: Real-time status updates during async operations
|
|
32
|
+
|
|
33
|
+
## Global Options
|
|
34
|
+
|
|
35
|
+
All commands support:
|
|
36
|
+
- `--env <dev|sand|prod>` - Target environment (default: dev)
|
|
37
|
+
- `--help` - Show command help
|
|
38
|
+
|
|
39
|
+
## Engine Commands
|
|
40
|
+
|
|
41
|
+
### Lifecycle Management
|
|
42
|
+
|
|
43
|
+
#### `dh engine2 launch`
|
|
44
|
+
|
|
45
|
+
Launch a new engine and wait for it to be ready with real-time progress tracking.
|
|
46
|
+
|
|
47
|
+
**Usage:**
|
|
48
|
+
```bash
|
|
49
|
+
dh engine2 launch <name> --type <type> [options]
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
**Arguments:**
|
|
53
|
+
- `name` - Unique name for the engine (used for SSH, identification)
|
|
54
|
+
|
|
55
|
+
**Options:**
|
|
56
|
+
- `--type <type>` - **Required.** Engine type:
|
|
57
|
+
- `cpu` - r6i.2xlarge (8 vCPU, 64GB RAM)
|
|
58
|
+
- `cpumax` - r7i.8xlarge (32 vCPU, 256GB RAM)
|
|
59
|
+
- `t4` - g4dn.2xlarge (T4 GPU, 16GB VRAM)
|
|
60
|
+
- `a10g` - g5.2xlarge (A10G GPU, 24GB VRAM)
|
|
61
|
+
- `a100` - p4d.24xlarge (8x A100, 40GB VRAM each)
|
|
62
|
+
- `4_t4`, `8_t4` - Multi-GPU T4 instances
|
|
63
|
+
- `4_a10g`, `8_a10g` - Multi-GPU A10G instances
|
|
64
|
+
- `--size <GB>` - Boot disk size in GB (optional)
|
|
65
|
+
- `--user <username>` - User to launch engine for (defaults to current user, use for testing/admin)
|
|
66
|
+
- `--no-wait` - Return immediately without waiting for readiness
|
|
67
|
+
- `-y, --yes` - Skip confirmation for non-dev environments
|
|
68
|
+
- `--env <env>` - Target environment (default: dev)
|
|
69
|
+
|
|
70
|
+
**Examples:**
|
|
71
|
+
```bash
|
|
72
|
+
# Launch CPU engine for development
|
|
73
|
+
dh engine2 launch dev-work --type cpu
|
|
74
|
+
|
|
75
|
+
# Launch GPU engine with custom disk size
|
|
76
|
+
dh engine2 launch training-job --type a10g --size 200
|
|
77
|
+
|
|
78
|
+
# Launch without waiting (check status later)
|
|
79
|
+
dh engine2 launch batch-worker --type cpumax --no-wait
|
|
80
|
+
|
|
81
|
+
# Launch engine for test user (testing/admin)
|
|
82
|
+
dh engine2 launch e2e-engine --type cpu --user testuser1
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
**Output with progress tracking:**
|
|
86
|
+
```
|
|
87
|
+
🚀 Launching cpu engine 'my-engine'...
|
|
88
|
+
✓ EC2 instance launched: i-1234567890abcdef0
|
|
89
|
+
|
|
90
|
+
⏳ Waiting for engine to be ready (typically 2-3 minutes)...
|
|
91
|
+
|
|
92
|
+
Progress ████████████░░░░░░░░░░ 60%
|
|
93
|
+
[5s] Instance Running
|
|
94
|
+
[8s] Downloading Scripts
|
|
95
|
+
[15s] Installing Packages
|
|
96
|
+
[22s] Mounting Primordial Drive
|
|
97
|
+
[45s] Configuring Idle Detector
|
|
98
|
+
[52s] Finalizing
|
|
99
|
+
|
|
100
|
+
✓ Engine ready!
|
|
101
|
+
|
|
102
|
+
Connect with:
|
|
103
|
+
dh engine2 config-ssh # Add to SSH config
|
|
104
|
+
ssh my-engine # Then use native SSH
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
**Bootstrap stages (9 total):**
|
|
108
|
+
1. Instance running
|
|
109
|
+
2. Downloading scripts
|
|
110
|
+
3. Installing packages
|
|
111
|
+
4. Mounting Primordial Drive
|
|
112
|
+
5. Installing GPU drivers (if applicable)
|
|
113
|
+
6. Creating environment
|
|
114
|
+
7. Configuring idle detector
|
|
115
|
+
8. Configuring SSH (passwordless access for IDE connections)
|
|
116
|
+
9. Ready
|
|
117
|
+
|
|
118
|
+
**Bootstrap time:**
|
|
119
|
+
- CPU: 1-2 minutes
|
|
120
|
+
- GPU (first boot): 3-5 minutes (driver installation + reboot)
|
|
121
|
+
- GPU (from GAMI): 1-2 minutes
|
|
122
|
+
|
|
123
|
+
---
|
|
124
|
+
|
|
125
|
+
#### `dh engine2 start`
|
|
126
|
+
|
|
127
|
+
Start a stopped engine.
|
|
128
|
+
|
|
129
|
+
**Usage:**
|
|
130
|
+
```bash
|
|
131
|
+
dh engine2 start <name-or-id> [options]
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
**Arguments:**
|
|
135
|
+
- `name-or-id` - Engine name or EC2 instance ID
|
|
136
|
+
|
|
137
|
+
**Options:**
|
|
138
|
+
- `--no-wait` - Return immediately without waiting for readiness
|
|
139
|
+
- `--skip-ssh-config` - Don't automatically update SSH config
|
|
140
|
+
- `-y, --yes` - Skip confirmation for non-dev environments
|
|
141
|
+
- `--env <env>` - Target environment
|
|
142
|
+
|
|
143
|
+
**Examples:**
|
|
144
|
+
```bash
|
|
145
|
+
dh engine2 start my-engine
|
|
146
|
+
dh engine2 start i-1234567890abcdef0
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
**Output:**
|
|
150
|
+
```
|
|
151
|
+
Starting engine 'my-engine'...
|
|
152
|
+
✓ Engine 'my-engine' is starting
|
|
153
|
+
```
|
|
154
|
+
|
|
155
|
+
**Note:** Starting an engine does not re-run bootstrap. The engine resumes from its previous state.
|
|
156
|
+
|
|
157
|
+
---
|
|
158
|
+
|
|
159
|
+
#### `dh engine2 stop`
|
|
160
|
+
|
|
161
|
+
Stop a running engine (keeps EBS boot disk).
|
|
162
|
+
|
|
163
|
+
**Usage:**
|
|
164
|
+
```bash
|
|
165
|
+
dh engine2 stop <name-or-id> [options]
|
|
166
|
+
```
|
|
167
|
+
|
|
168
|
+
**Arguments:**
|
|
169
|
+
- `name-or-id` - Engine name or EC2 instance ID
|
|
170
|
+
|
|
171
|
+
**Options:**
|
|
172
|
+
- `-y, --yes` - Skip confirmation for non-dev environments
|
|
173
|
+
- `--env <env>` - Target environment
|
|
174
|
+
|
|
175
|
+
**Examples:**
|
|
176
|
+
```bash
|
|
177
|
+
dh engine2 stop my-engine
|
|
178
|
+
```
|
|
179
|
+
|
|
180
|
+
**Output:**
|
|
181
|
+
```
|
|
182
|
+
Stopping engine 'my-engine'...
|
|
183
|
+
✓ Engine 'my-engine' is stopping
|
|
184
|
+
```
|
|
185
|
+
|
|
186
|
+
**Note:** Stopped engines still incur EBS storage costs (~$0.08/GB-month for boot disk). Studios must be detached before stopping.
|
|
187
|
+
|
|
188
|
+
---
|
|
189
|
+
|
|
190
|
+
#### `dh engine2 terminate`
|
|
191
|
+
|
|
192
|
+
Permanently terminate an engine (deletes EBS boot disk).
|
|
193
|
+
|
|
194
|
+
**Usage:**
|
|
195
|
+
```bash
|
|
196
|
+
dh engine2 terminate <name-or-id> [options]
|
|
197
|
+
```
|
|
198
|
+
|
|
199
|
+
**Arguments:**
|
|
200
|
+
- `name-or-id` - Engine name or EC2 instance ID
|
|
201
|
+
|
|
202
|
+
**Options:**
|
|
203
|
+
- `-y, --yes` - Skip confirmation prompt
|
|
204
|
+
- `--env <env>` - Target environment
|
|
205
|
+
|
|
206
|
+
**Examples:**
|
|
207
|
+
```bash
|
|
208
|
+
# With confirmation
|
|
209
|
+
dh engine2 terminate my-engine
|
|
210
|
+
|
|
211
|
+
# Skip confirmation
|
|
212
|
+
dh engine2 terminate my-engine -y
|
|
213
|
+
```
|
|
214
|
+
|
|
215
|
+
**Output:**
|
|
216
|
+
```
|
|
217
|
+
Terminate engine 'my-engine' (i-1234567890abcdef0)? [y/N]: y
|
|
218
|
+
✓ Engine 'my-engine' is terminating
|
|
219
|
+
```
|
|
220
|
+
|
|
221
|
+
**Warning:** This permanently deletes the engine's boot disk. Any data not in studios or Primordial Drive will be lost.
|
|
222
|
+
|
|
223
|
+
---
|
|
224
|
+
|
|
225
|
+
### Status and Information
|
|
226
|
+
|
|
227
|
+
#### `dh engine2 status`
|
|
228
|
+
|
|
229
|
+
Show comprehensive engine status including idle detector state with real-time sensor data.
|
|
230
|
+
|
|
231
|
+
**Usage:**
|
|
232
|
+
```bash
|
|
233
|
+
dh engine2 status <name-or-id> [options]
|
|
234
|
+
```
|
|
235
|
+
|
|
236
|
+
**Arguments:**
|
|
237
|
+
- `name-or-id` - Engine name or EC2 instance ID
|
|
238
|
+
|
|
239
|
+
**Options:**
|
|
240
|
+
- `--detailed` - Show detailed sensor information with confidence levels
|
|
241
|
+
- `--env <env>` - Target environment
|
|
242
|
+
|
|
243
|
+
**Examples:**
|
|
244
|
+
```bash
|
|
245
|
+
# Basic status
|
|
246
|
+
dh engine2 status my-engine
|
|
247
|
+
|
|
248
|
+
# Detailed status with sensor breakdown
|
|
249
|
+
dh engine2 status my-engine --detailed
|
|
250
|
+
```
|
|
251
|
+
|
|
252
|
+
**Output (basic):**
|
|
253
|
+
```
|
|
254
|
+
Engine: my-engine
|
|
255
|
+
Instance ID: i-1234567890abcdef0
|
|
256
|
+
Type: cpu
|
|
257
|
+
State: running
|
|
258
|
+
Public IP: 54.123.45.67
|
|
259
|
+
Launched: 2 hours ago
|
|
260
|
+
|
|
261
|
+
Idle Status: 🟢 ACTIVE
|
|
262
|
+
Reason: ssh: 1 active SSH session(s)
|
|
263
|
+
```
|
|
264
|
+
|
|
265
|
+
**Output (detailed):**
|
|
266
|
+
```
|
|
267
|
+
Engine: my-engine
|
|
268
|
+
Instance ID: i-1234567890abcdef0
|
|
269
|
+
Type: cpu
|
|
270
|
+
State: running
|
|
271
|
+
|
|
272
|
+
Idle Status: 🟢 ACTIVE
|
|
273
|
+
Reason: ssh: 1 active SSH session(s)
|
|
274
|
+
|
|
275
|
+
============================================================
|
|
276
|
+
Activity Sensors:
|
|
277
|
+
============================================================
|
|
278
|
+
|
|
279
|
+
✓ SSH (HIGH confidence)
|
|
280
|
+
1 active SSH session(s)
|
|
281
|
+
sessions:
|
|
282
|
+
- alice pts/0 2025-11-10 14:30 old 12345
|
|
283
|
+
|
|
284
|
+
✗ IDE (MEDIUM confidence)
|
|
285
|
+
No IDE connections found
|
|
286
|
+
|
|
287
|
+
✗ DOCKER (MEDIUM confidence)
|
|
288
|
+
No workload containers
|
|
289
|
+
ignored:
|
|
290
|
+
- devcontainer-1 (dev-container)
|
|
291
|
+
|
|
292
|
+
✗ COFFEE (HIGH confidence)
|
|
293
|
+
No coffee lock
|
|
294
|
+
```
|
|
295
|
+
|
|
296
|
+
**Idle detector sensors:**
|
|
297
|
+
- **SSH** (HIGH confidence) - Detects active SSH sessions via `who -u`
|
|
298
|
+
- **IDE** (MEDIUM confidence) - Detects VS Code/Cursor remote connections
|
|
299
|
+
- **Docker** (MEDIUM confidence) - Detects non-dev workload containers
|
|
300
|
+
- **Coffee** (HIGH confidence) - Explicit user keep-alive lock
|
|
301
|
+
|
|
302
|
+
---
|
|
303
|
+
|
|
304
|
+
#### `dh engine2 list`
|
|
305
|
+
|
|
306
|
+
List all engines in the environment.
|
|
307
|
+
|
|
308
|
+
**Usage:**
|
|
309
|
+
```bash
|
|
310
|
+
dh engine2 list [--env <env>]
|
|
311
|
+
```
|
|
312
|
+
|
|
313
|
+
**Examples:**
|
|
314
|
+
```bash
|
|
315
|
+
# List engines in dev
|
|
316
|
+
dh engine2 list
|
|
317
|
+
|
|
318
|
+
# List engines in sand
|
|
319
|
+
dh engine2 list --env sand
|
|
320
|
+
```
|
|
321
|
+
|
|
322
|
+
**Output:**
|
|
323
|
+
```
|
|
324
|
+
|
|
325
|
+
Engines for AWS Account dev
|
|
326
|
+
╭──────────────┬─────────────┬─────────────┬─────────────┬─────────────────────╮
|
|
327
|
+
│ Name │ State │ User │ Type │ Instance ID │
|
|
328
|
+
├──────────────┼─────────────┼─────────────┼─────────────┼─────────────────────┤
|
|
329
|
+
│ alice-work │ running │ alice │ cpu │ i-0123456789abcdef0 │
|
|
330
|
+
│ bob-training │ running │ bob │ a10g │ i-0fedcba987654321 │
|
|
331
|
+
│ batch-worker │ stopped │ charlie │ cpumax │ i-0abc123def456789 │
|
|
332
|
+
╰──────────────┴─────────────┴─────────────┴─────────────┴─────────────────────╯
|
|
333
|
+
Total: 3
|
|
334
|
+
|
|
335
|
+
```
|
|
336
|
+
|
|
337
|
+
**Formatting:**
|
|
338
|
+
- Full table borders with Unicode box-drawing characters
|
|
339
|
+
- Engine names are displayed in blue
|
|
340
|
+
- State is color-coded: green for "running", yellow for "starting/stopping", grey for "stopped"
|
|
341
|
+
- Instance IDs are displayed in grey
|
|
342
|
+
- Name column width adjusts dynamically to fit the longest engine name
|
|
343
|
+
|
|
344
|
+
---
|
|
345
|
+
|
|
346
|
+
### Access
|
|
347
|
+
|
|
348
|
+
#### `dh engine2 config-ssh`
|
|
349
|
+
|
|
350
|
+
Update `~/.ssh/config` with entries for all running engines.
|
|
351
|
+
|
|
352
|
+
**Usage:**
|
|
353
|
+
```bash
|
|
354
|
+
dh engine2 config-ssh [options]
|
|
355
|
+
```
|
|
356
|
+
|
|
357
|
+
**Options:**
|
|
358
|
+
- `--clean` - Remove all managed entries (doesn't add new ones)
|
|
359
|
+
- `--all` - Include engines from all users (default: only your engines)
|
|
360
|
+
- `--admin` - Use `ec2-user` instead of owner username
|
|
361
|
+
- `--env <env>` - Target environment
|
|
362
|
+
|
|
363
|
+
**Examples:**
|
|
364
|
+
```bash
|
|
365
|
+
# Add your running engines
|
|
366
|
+
dh engine2 config-ssh
|
|
367
|
+
|
|
368
|
+
# Add all engines (all users)
|
|
369
|
+
dh engine2 config-ssh --all
|
|
370
|
+
|
|
371
|
+
# Remove managed entries
|
|
372
|
+
dh engine2 config-ssh --clean
|
|
373
|
+
|
|
374
|
+
# Add engines as admin user
|
|
375
|
+
dh engine2 config-ssh --admin
|
|
376
|
+
```
|
|
377
|
+
|
|
378
|
+
**Output:**
|
|
379
|
+
```
|
|
380
|
+
✓ Updated SSH config with 3 engine(s)
|
|
381
|
+
```
|
|
382
|
+
|
|
383
|
+
**Managed section in ~/.ssh/config:**
|
|
384
|
+
```
|
|
385
|
+
# BEGIN DAYHOFF ENGINES
|
|
386
|
+
|
|
387
|
+
Host my-engine
|
|
388
|
+
HostName i-1234567890abcdef0
|
|
389
|
+
User alice
|
|
390
|
+
ProxyCommand aws ssm start-session --target %h --document-name AWS-StartSSHSession --parameters 'portNumber=%p' --profile dev-devaccess
|
|
391
|
+
|
|
392
|
+
# END DAYHOFF ENGINES
|
|
393
|
+
```
|
|
394
|
+
|
|
395
|
+
**Note**: The `--profile` flag is automatically added based on `--env`:
|
|
396
|
+
- `--env dev` → `--profile dev-devaccess`
|
|
397
|
+
- `--env sand` → `--profile sand-devaccess`
|
|
398
|
+
- `--env prod` → `--profile prod-devaccess`
|
|
399
|
+
|
|
400
|
+
This ensures GUI applications like VS Code and Cursor can connect without inheriting shell environment variables.
|
|
401
|
+
|
|
402
|
+
**Usage after config - Standard SSH:**
|
|
403
|
+
```bash
|
|
404
|
+
# Interactive SSH
|
|
405
|
+
ssh my-engine
|
|
406
|
+
|
|
407
|
+
# Execute remote commands
|
|
408
|
+
ssh my-engine "ls /studios"
|
|
409
|
+
|
|
410
|
+
# File transfer
|
|
411
|
+
scp local-file my-engine:/studios/alice/
|
|
412
|
+
rsync -avz project/ my-engine:/studios/alice/project/
|
|
413
|
+
|
|
414
|
+
# Port forwarding
|
|
415
|
+
ssh -L 8080:localhost:8080 my-engine
|
|
416
|
+
|
|
417
|
+
# VS Code Remote SSH
|
|
418
|
+
code --remote ssh-remote+my-engine /studios/alice/project
|
|
419
|
+
|
|
420
|
+
# VS Code Remote - Tunnels
|
|
421
|
+
code tunnel --name my-engine
|
|
422
|
+
```
|
|
423
|
+
|
|
424
|
+
**All standard SSH features work:**
|
|
425
|
+
- Command execution: `ssh <engine> "<command>"`
|
|
426
|
+
- File transfer: `scp`, `rsync`, `sftp`
|
|
427
|
+
- Port forwarding: `-L`, `-R`, `-D` flags
|
|
428
|
+
- IDE remote development: VS Code, Cursor, PyCharm
|
|
429
|
+
- SSH agent forwarding: `-A` flag
|
|
430
|
+
- Config file options: ControlMaster, compression, etc.
|
|
431
|
+
|
|
432
|
+
**Recommended workflow:**
|
|
433
|
+
1. Run `dh engine2 config-ssh` once after launching engines
|
|
434
|
+
2. Use native `ssh <engine-name>` for all access
|
|
435
|
+
3. Rerun `config-ssh` if you launch new engines
|
|
436
|
+
|
|
437
|
+
---
|
|
438
|
+
|
|
439
|
+
### Idle Detection Control
|
|
440
|
+
|
|
441
|
+
#### `dh engine2 coffee`
|
|
442
|
+
|
|
443
|
+
Set or cancel a "coffee lock" to prevent idle shutdown.
|
|
444
|
+
|
|
445
|
+
**Usage:**
|
|
446
|
+
```bash
|
|
447
|
+
dh engine2 coffee <name-or-id> <duration> [options]
|
|
448
|
+
dh engine2 coffee <name-or-id> --cancel [options]
|
|
449
|
+
```
|
|
450
|
+
|
|
451
|
+
**Arguments:**
|
|
452
|
+
- `name-or-id` - Engine name or EC2 instance ID
|
|
453
|
+
- `duration` - How long to keep alive (e.g., `4h`, `2h30m`, `45m`)
|
|
454
|
+
|
|
455
|
+
**Options:**
|
|
456
|
+
- `--cancel` - Cancel existing coffee lock
|
|
457
|
+
- `--env <env>` - Target environment
|
|
458
|
+
|
|
459
|
+
**Examples:**
|
|
460
|
+
```bash
|
|
461
|
+
# Keep alive for 4 hours
|
|
462
|
+
dh engine2 coffee my-engine 4h
|
|
463
|
+
|
|
464
|
+
# Keep alive for 2.5 hours
|
|
465
|
+
dh engine2 coffee my-engine 2h30m
|
|
466
|
+
|
|
467
|
+
# Cancel coffee lock
|
|
468
|
+
dh engine2 coffee my-engine --cancel
|
|
469
|
+
```
|
|
470
|
+
|
|
471
|
+
**Output:**
|
|
472
|
+
```
|
|
473
|
+
✓ Coffee lock set for 'my-engine': 4h
|
|
474
|
+
```
|
|
475
|
+
|
|
476
|
+
**Use cases:**
|
|
477
|
+
- Long-running training jobs without active SSH
|
|
478
|
+
- Batch processing where idle detector might trigger
|
|
479
|
+
- Overnight jobs that don't show activity
|
|
480
|
+
|
|
481
|
+
**Note:** Coffee lock is HIGH confidence - overrides all other sensors.
|
|
482
|
+
|
|
483
|
+
---
|
|
484
|
+
|
|
485
|
+
#### `dh engine2 idle`
|
|
486
|
+
|
|
487
|
+
Show or configure idle timeout settings.
|
|
488
|
+
|
|
489
|
+
**Usage:**
|
|
490
|
+
```bash
|
|
491
|
+
dh engine2 idle <name-or-id> [options]
|
|
492
|
+
```
|
|
493
|
+
|
|
494
|
+
**Arguments:**
|
|
495
|
+
- `name-or-id` - Engine name or EC2 instance ID
|
|
496
|
+
|
|
497
|
+
**Options:**
|
|
498
|
+
- `--set <duration>` - Set new timeout (e.g., `2h`, `45m`)
|
|
499
|
+
- `--slack <none|default|all>` - Configure Slack notifications
|
|
500
|
+
- `--env <env>` - Target environment
|
|
501
|
+
|
|
502
|
+
**Examples:**
|
|
503
|
+
```bash
|
|
504
|
+
# Show current settings
|
|
505
|
+
dh engine2 idle my-engine
|
|
506
|
+
|
|
507
|
+
# Set 2-hour timeout
|
|
508
|
+
dh engine2 idle my-engine --set 2h
|
|
509
|
+
|
|
510
|
+
# Configure Slack notifications
|
|
511
|
+
dh engine2 idle my-engine --slack all
|
|
512
|
+
```
|
|
513
|
+
|
|
514
|
+
**Output:**
|
|
515
|
+
```
|
|
516
|
+
Idle Settings for 'my-engine':
|
|
517
|
+
Timeout: 30 minutes
|
|
518
|
+
Current State: ACTIVE
|
|
519
|
+
```
|
|
520
|
+
|
|
521
|
+
**Default timeout:** 30 minutes (1800 seconds)
|
|
522
|
+
|
|
523
|
+
---
|
|
524
|
+
|
|
525
|
+
### Maintenance
|
|
526
|
+
|
|
527
|
+
#### `dh engine2 resize`
|
|
528
|
+
|
|
529
|
+
Resize an engine's boot disk.
|
|
530
|
+
|
|
531
|
+
**Usage:**
|
|
532
|
+
```bash
|
|
533
|
+
dh engine2 resize <name-or-id> --size <GB> [options]
|
|
534
|
+
```
|
|
535
|
+
|
|
536
|
+
**Arguments:**
|
|
537
|
+
- `name-or-id` - Engine name or EC2 instance ID
|
|
538
|
+
|
|
539
|
+
**Options:**
|
|
540
|
+
- `-s, --size <GB>` - **Required.** New size in GB
|
|
541
|
+
- `--online` - Resize while running (requires manual filesystem expansion)
|
|
542
|
+
- `-f, --force` - Skip confirmation
|
|
543
|
+
- `--env <env>` - Target environment
|
|
544
|
+
|
|
545
|
+
**Examples:**
|
|
546
|
+
```bash
|
|
547
|
+
# Offline resize (requires stop/start)
|
|
548
|
+
dh engine2 resize my-engine --size 200 --force
|
|
549
|
+
|
|
550
|
+
# Online resize (advanced)
|
|
551
|
+
dh engine2 resize my-engine --size 200 --online --force
|
|
552
|
+
```
|
|
553
|
+
|
|
554
|
+
**Output (offline resize):**
|
|
555
|
+
```
|
|
556
|
+
✓ Boot disk resize initiated for 'my-engine'
|
|
557
|
+
Engine will be stopped and restarted
|
|
558
|
+
Filesystem will be automatically expanded
|
|
559
|
+
```
|
|
560
|
+
|
|
561
|
+
**Output (online resize):**
|
|
562
|
+
```
|
|
563
|
+
✓ Boot disk resize initiated for 'my-engine'
|
|
564
|
+
|
|
565
|
+
⚠ Manual filesystem expansion required:
|
|
566
|
+
ssh my-engine
|
|
567
|
+
sudo growpart /dev/nvme0n1 1
|
|
568
|
+
sudo xfs_growfs /
|
|
569
|
+
df -h # Verify new size
|
|
570
|
+
```
|
|
571
|
+
|
|
572
|
+
**Note:** Online resize keeps the engine running but requires manual steps. Offline resize (default) stops/starts the engine but handles filesystem expansion automatically.
|
|
573
|
+
|
|
574
|
+
---
|
|
575
|
+
|
|
576
|
+
#### `dh engine2 debug`
|
|
577
|
+
|
|
578
|
+
Debug engine bootstrap status and show detailed stage information.
|
|
579
|
+
|
|
580
|
+
**Usage:**
|
|
581
|
+
```bash
|
|
582
|
+
dh engine2 debug <name-or-id> [--env <env>]
|
|
583
|
+
```
|
|
584
|
+
|
|
585
|
+
**Arguments:**
|
|
586
|
+
- `name-or-id` - Engine name or EC2 instance ID
|
|
587
|
+
|
|
588
|
+
**Examples:**
|
|
589
|
+
```bash
|
|
590
|
+
dh engine2 debug my-engine
|
|
591
|
+
```
|
|
592
|
+
|
|
593
|
+
**Output:**
|
|
594
|
+
```
|
|
595
|
+
Engine: i-1234567890abcdef0
|
|
596
|
+
Ready: False
|
|
597
|
+
Current Stage: installing_packages
|
|
598
|
+
|
|
599
|
+
Bootstrap Stages:
|
|
600
|
+
✓ 1. instance_running (30.0s)
|
|
601
|
+
✓ 2. downloading_scripts (8.0s)
|
|
602
|
+
⏳ 3. installing_packages
|
|
603
|
+
```
|
|
604
|
+
|
|
605
|
+
**Use case:** Troubleshooting stuck or failed bootstrap. Shows exactly which stage failed and timing information.
|
|
606
|
+
|
|
607
|
+
---
|
|
608
|
+
|
|
609
|
+
## Studio Commands
|
|
610
|
+
|
|
611
|
+
### Lifecycle Management
|
|
612
|
+
|
|
613
|
+
#### `dh studio2 create`
|
|
614
|
+
|
|
615
|
+
Create a new studio for the current user (or specified user with `--user` flag).
|
|
616
|
+
|
|
617
|
+
**Usage:**
|
|
618
|
+
```bash
|
|
619
|
+
dh studio2 create [options]
|
|
620
|
+
```
|
|
621
|
+
|
|
622
|
+
**Options:**
|
|
623
|
+
- `--size <GB>` - Studio size in GB (default: 100)
|
|
624
|
+
- `--user <username>` - User to create studio for (defaults to current user, use for testing/admin)
|
|
625
|
+
- `--env <env>` - Target environment
|
|
626
|
+
|
|
627
|
+
**Examples:**
|
|
628
|
+
```bash
|
|
629
|
+
# Create 100GB studio (default)
|
|
630
|
+
dh studio2 create
|
|
631
|
+
|
|
632
|
+
# Create 200GB studio
|
|
633
|
+
dh studio2 create --size 200
|
|
634
|
+
|
|
635
|
+
# Create studio for test user (testing/admin)
|
|
636
|
+
dh studio2 create --user testuser1 --size 50
|
|
637
|
+
```
|
|
638
|
+
|
|
639
|
+
**Output:**
|
|
640
|
+
```
|
|
641
|
+
Creating 100GB studio for alice...
|
|
642
|
+
✓ Studio created: vol-0123456789abcdef0
|
|
643
|
+
|
|
644
|
+
Attach to an engine with:
|
|
645
|
+
dh studio2 attach <engine-name>
|
|
646
|
+
```
|
|
647
|
+
|
|
648
|
+
**Limits:**
|
|
649
|
+
- One studio per user per environment
|
|
650
|
+
- Studio is encrypted with AWS-managed keys
|
|
651
|
+
- Billed at ~$0.08/GB-month for EBS storage
|
|
652
|
+
|
|
653
|
+
---
|
|
654
|
+
|
|
655
|
+
#### `dh studio2 delete`
|
|
656
|
+
|
|
657
|
+
Delete your studio (or another user's studio with `--user` flag).
|
|
658
|
+
|
|
659
|
+
**Usage:**
|
|
660
|
+
```bash
|
|
661
|
+
dh studio2 delete [options]
|
|
662
|
+
```
|
|
663
|
+
|
|
664
|
+
**Options:**
|
|
665
|
+
- `-y, --yes` - Skip confirmation
|
|
666
|
+
- `--user <username>` - User whose studio to delete (defaults to current user, use for testing/admin)
|
|
667
|
+
- `--env <env>` - Target environment
|
|
668
|
+
|
|
669
|
+
**Examples:**
|
|
670
|
+
```bash
|
|
671
|
+
# With confirmation
|
|
672
|
+
dh studio2 delete
|
|
673
|
+
|
|
674
|
+
# Skip confirmation
|
|
675
|
+
dh studio2 delete -y
|
|
676
|
+
|
|
677
|
+
# Delete another user's studio (testing/admin)
|
|
678
|
+
dh studio2 delete --user testuser1 -y
|
|
679
|
+
```
|
|
680
|
+
|
|
681
|
+
**Warning prompt:**
|
|
682
|
+
```
|
|
683
|
+
⚠ WARNING: This will permanently delete all data in vol-0123456789abcdef0
|
|
684
|
+
Are you sure? [y/N]:
|
|
685
|
+
```
|
|
686
|
+
|
|
687
|
+
**Output:**
|
|
688
|
+
```
|
|
689
|
+
✓ Studio vol-0123456789abcdef0 deleted
|
|
690
|
+
```
|
|
691
|
+
|
|
692
|
+
**Requirements:**
|
|
693
|
+
- Studio must be detached (`dh studio2 detach` first)
|
|
694
|
+
- All data in the studio will be permanently lost
|
|
695
|
+
|
|
696
|
+
---
|
|
697
|
+
|
|
698
|
+
### Status and Information
|
|
699
|
+
|
|
700
|
+
#### `dh studio2 status`
|
|
701
|
+
|
|
702
|
+
Show information about your studio.
|
|
703
|
+
|
|
704
|
+
**Usage:**
|
|
705
|
+
```bash
|
|
706
|
+
dh studio2 status [--env <env>]
|
|
707
|
+
```
|
|
708
|
+
|
|
709
|
+
**Examples:**
|
|
710
|
+
```bash
|
|
711
|
+
dh studio2 status
|
|
712
|
+
```
|
|
713
|
+
|
|
714
|
+
**Output (available):**
|
|
715
|
+
```
|
|
716
|
+
Studio ID: vol-0123456789abcdef0
|
|
717
|
+
User: alice
|
|
718
|
+
Size: 100GB
|
|
719
|
+
Status: available
|
|
720
|
+
Created: 5 days ago
|
|
721
|
+
```
|
|
722
|
+
|
|
723
|
+
**Output (attached):**
|
|
724
|
+
```
|
|
725
|
+
Studio ID: vol-0123456789abcdef0
|
|
726
|
+
User: alice
|
|
727
|
+
Size: 100GB
|
|
728
|
+
Status: attached
|
|
729
|
+
Created: 5 days ago
|
|
730
|
+
Attached to: i-0fedcba987654321
|
|
731
|
+
```
|
|
732
|
+
|
|
733
|
+
**Statuses:**
|
|
734
|
+
- `available` - Ready to attach
|
|
735
|
+
- `attached` - Attached to an engine
|
|
736
|
+
- `attaching` - Attachment in progress
|
|
737
|
+
- `detaching` - Detachment in progress
|
|
738
|
+
- `error` - Stuck state (use `dh studio2 reset`)
|
|
739
|
+
|
|
740
|
+
---
|
|
741
|
+
|
|
742
|
+
#### `dh studio2 list`
|
|
743
|
+
|
|
744
|
+
List all studios in the environment.
|
|
745
|
+
|
|
746
|
+
**Usage:**
|
|
747
|
+
```bash
|
|
748
|
+
dh studio2 list [--env <env>]
|
|
749
|
+
```
|
|
750
|
+
|
|
751
|
+
**Examples:**
|
|
752
|
+
```bash
|
|
753
|
+
dh studio2 list
|
|
754
|
+
```
|
|
755
|
+
|
|
756
|
+
**Output:**
|
|
757
|
+
```
|
|
758
|
+
|
|
759
|
+
Studios for AWS Account dev
|
|
760
|
+
╭────────┬──────────────┬──────────────┬───────────┬───────────────────────────╮
|
|
761
|
+
│ User │ Status │ Attached To │ Size │ Studio ID │
|
|
762
|
+
├────────┼──────────────┼──────────────┼───────────┼───────────────────────────┤
|
|
763
|
+
│ alice │ attached │ alice-work │ 100GB │ vol-0123456789abcdef0 │
|
|
764
|
+
│ bob │ available │ - │ 200GB │ vol-0fedcba987654321 │
|
|
765
|
+
│ carol │ attaching │ carol-gpu │ 150GB │ vol-0abc123def456789 │
|
|
766
|
+
╰────────┴──────────────┴──────────────┴───────────┴───────────────────────────╯
|
|
767
|
+
Total: 3
|
|
768
|
+
|
|
769
|
+
```
|
|
770
|
+
|
|
771
|
+
**Formatting:**
|
|
772
|
+
- Full table borders with Unicode box-drawing characters
|
|
773
|
+
- User names are displayed in blue
|
|
774
|
+
- Status is color-coded: purple for "attached", green for "available", yellow for "attaching/detaching", red for "error"
|
|
775
|
+
- "Attached To" shows engine name, or "-" if not attached
|
|
776
|
+
- Studio IDs are displayed in grey
|
|
777
|
+
- User column width adjusts dynamically to fit the longest username
|
|
778
|
+
- Attached To column width adjusts dynamically to fit the longest engine name
|
|
779
|
+
- Columns are ordered: User, Status, Attached To, Size, Studio ID
|
|
780
|
+
|
|
781
|
+
---
|
|
782
|
+
|
|
783
|
+
### Attachment
|
|
784
|
+
|
|
785
|
+
#### `dh studio2 attach`
|
|
786
|
+
|
|
787
|
+
Attach your studio to an engine with real-time progress tracking through all 6 attachment stages.
|
|
788
|
+
|
|
789
|
+
**Usage:**
|
|
790
|
+
```bash
|
|
791
|
+
dh studio2 attach <engine-name-or-id> [--env <env>]
|
|
792
|
+
```
|
|
793
|
+
|
|
794
|
+
**Arguments:**
|
|
795
|
+
- `engine-name-or-id` - Engine name or EC2 instance ID
|
|
796
|
+
|
|
797
|
+
**Examples:**
|
|
798
|
+
```bash
|
|
799
|
+
dh studio2 attach my-engine
|
|
800
|
+
```
|
|
801
|
+
|
|
802
|
+
**Output with progress tracking:**
|
|
803
|
+
```
|
|
804
|
+
📎 Attaching studio to my-engine...
|
|
805
|
+
|
|
806
|
+
⏳ Attachment in progress...
|
|
807
|
+
|
|
808
|
+
Progress ████████████████████ 100%
|
|
809
|
+
Validate Engine
|
|
810
|
+
Find Device Slot
|
|
811
|
+
Attach Volume
|
|
812
|
+
Resolve Device
|
|
813
|
+
Mount Filesystem
|
|
814
|
+
Update State
|
|
815
|
+
|
|
816
|
+
✓ Studio attached successfully!
|
|
817
|
+
|
|
818
|
+
Your files are now available at:
|
|
819
|
+
/studios/alice/
|
|
820
|
+
|
|
821
|
+
Connect with:
|
|
822
|
+
ssh my-engine
|
|
823
|
+
```
|
|
824
|
+
|
|
825
|
+
**6-step attachment process:**
|
|
826
|
+
1. **Validate Engine** - Ensure engine is ready (~250ms)
|
|
827
|
+
2. **Find Device Slot** - Locate available `/dev/sd[f-p]` (~150ms)
|
|
828
|
+
3. **Attach Volume** - AWS EBS attachment (~8-10s)
|
|
829
|
+
4. **Resolve Device** - Map to NVMe device path via `/dev/disk/by-id/` (~2s)
|
|
830
|
+
5. **Mount Filesystem** - Execute mount script via SSM RunCommand (~5s)
|
|
831
|
+
6. **Update State** - Mark studio as `attached` in DynamoDB (~200ms)
|
|
832
|
+
|
|
833
|
+
**Total time:** ~15-20 seconds
|
|
834
|
+
|
|
835
|
+
**Requirements:**
|
|
836
|
+
- Studio must be in `available` status
|
|
837
|
+
- Engine must be in `ready` state
|
|
838
|
+
- Engine can have max 10 studios attached (device slots)
|
|
839
|
+
|
|
840
|
+
**Error handling:**
|
|
841
|
+
If attachment fails, shows detailed error with failed step:
|
|
842
|
+
```
|
|
843
|
+
✗ Attachment failed: Mount filesystem timeout
|
|
844
|
+
|
|
845
|
+
Failed at step: mount_filesystem
|
|
846
|
+
Error: SSM command timeout after 30s
|
|
847
|
+
```
|
|
848
|
+
|
|
849
|
+
---
|
|
850
|
+
|
|
851
|
+
#### `dh studio2 detach`
|
|
852
|
+
|
|
853
|
+
Detach your studio from its engine.
|
|
854
|
+
|
|
855
|
+
**Usage:**
|
|
856
|
+
```bash
|
|
857
|
+
dh studio2 detach [--env <env>]
|
|
858
|
+
```
|
|
859
|
+
|
|
860
|
+
**Examples:**
|
|
861
|
+
```bash
|
|
862
|
+
dh studio2 detach
|
|
863
|
+
```
|
|
864
|
+
|
|
865
|
+
**Output:**
|
|
866
|
+
```
|
|
867
|
+
Detaching studio vol-0123456789abcdef0...
|
|
868
|
+
✓ Studio detached
|
|
869
|
+
```
|
|
870
|
+
|
|
871
|
+
**Process:**
|
|
872
|
+
1. Clean unmount with `sync`
|
|
873
|
+
2. AWS EBS detachment
|
|
874
|
+
3. Update studio status to `available`
|
|
875
|
+
|
|
876
|
+
**Use cases:**
|
|
877
|
+
- Moving studio to a different engine
|
|
878
|
+
- Shutting down engine but preserving studio data
|
|
879
|
+
- Preparing for studio deletion or resize
|
|
880
|
+
|
|
881
|
+
---
|
|
882
|
+
|
|
883
|
+
### Maintenance
|
|
884
|
+
|
|
885
|
+
#### `dh studio2 resize`
|
|
886
|
+
|
|
887
|
+
Resize your studio volume (requires detachment).
|
|
888
|
+
|
|
889
|
+
**Usage:**
|
|
890
|
+
```bash
|
|
891
|
+
dh studio2 resize --size <GB> [options]
|
|
892
|
+
```
|
|
893
|
+
|
|
894
|
+
**Options:**
|
|
895
|
+
- `-s, --size <GB>` - **Required.** New size in GB
|
|
896
|
+
- `-y, --yes` - Skip confirmation
|
|
897
|
+
- `--user <username>` - User whose studio to resize (defaults to current user, use for testing/admin)
|
|
898
|
+
- `--env <env>` - Target environment
|
|
899
|
+
|
|
900
|
+
**Examples:**
|
|
901
|
+
```bash
|
|
902
|
+
# With confirmation
|
|
903
|
+
dh studio2 resize --size 200
|
|
904
|
+
|
|
905
|
+
# Skip confirmation
|
|
906
|
+
dh studio2 resize --size 200 -y
|
|
907
|
+
|
|
908
|
+
# Resize test user's studio (testing/admin)
|
|
909
|
+
dh studio2 resize --size 200 -y --user testuser1
|
|
910
|
+
```
|
|
911
|
+
|
|
912
|
+
**Output:**
|
|
913
|
+
```
|
|
914
|
+
Resize studio from 100GB to 200GB? [y/N]: y
|
|
915
|
+
✓ Studio resize initiated: 100GB → 200GB
|
|
916
|
+
```
|
|
917
|
+
|
|
918
|
+
**Requirements:**
|
|
919
|
+
- Studio must be detached
|
|
920
|
+
- New size must be larger than current size (no shrinking)
|
|
921
|
+
- Filesystem automatically expands on next attach
|
|
922
|
+
|
|
923
|
+
**Note:** You're billed for the new size immediately (~$0.08/GB-month).
|
|
924
|
+
|
|
925
|
+
---
|
|
926
|
+
|
|
927
|
+
#### `dh studio2 reset`
|
|
928
|
+
|
|
929
|
+
Reset a stuck studio to `available` status (admin operation).
|
|
930
|
+
|
|
931
|
+
**Usage:**
|
|
932
|
+
```bash
|
|
933
|
+
dh studio2 reset [options]
|
|
934
|
+
```
|
|
935
|
+
|
|
936
|
+
**Options:**
|
|
937
|
+
- `-y, --yes` - Skip confirmation
|
|
938
|
+
- `--user <username>` - User whose studio to reset (defaults to current user, use for testing/admin)
|
|
939
|
+
- `--env <env>` - Target environment
|
|
940
|
+
|
|
941
|
+
**Examples:**
|
|
942
|
+
```bash
|
|
943
|
+
# Reset your own studio (with confirmation)
|
|
944
|
+
dh studio2 reset
|
|
945
|
+
|
|
946
|
+
# Skip confirmation
|
|
947
|
+
dh studio2 reset -y
|
|
948
|
+
|
|
949
|
+
# Reset test user's studio (testing/admin)
|
|
950
|
+
dh studio2 reset -y --user testuser1
|
|
951
|
+
```
|
|
952
|
+
|
|
953
|
+
**Output:**
|
|
954
|
+
```
|
|
955
|
+
Studio: vol-0123456789abcdef0
|
|
956
|
+
Current Status: attaching
|
|
957
|
+
|
|
958
|
+
Reset studio status to 'available'? [y/N]: y
|
|
959
|
+
✓ Studio reset to 'available' status
|
|
960
|
+
Note: Manual cleanup may be required on engines
|
|
961
|
+
```
|
|
962
|
+
|
|
963
|
+
**Use cases:**
|
|
964
|
+
- Studio stuck in `attaching` or `detaching`
|
|
965
|
+
- Attachment operation failed and didn't revert
|
|
966
|
+
- DynamoDB state out of sync with actual state
|
|
967
|
+
|
|
968
|
+
**Warning:** This only resets the DynamoDB state. If the volume is actually attached, you'll need to manually detach via AWS console or unmount on the engine.
|
|
969
|
+
|
|
970
|
+
---
|
|
971
|
+
|
|
972
|
+
## Common Workflows
|
|
973
|
+
|
|
974
|
+
### Daily Development
|
|
975
|
+
|
|
976
|
+
```bash
|
|
977
|
+
# Launch engine
|
|
978
|
+
dh engine2 launch dev-work --type cpu
|
|
979
|
+
|
|
980
|
+
# Add to SSH config
|
|
981
|
+
dh engine2 config-ssh
|
|
982
|
+
|
|
983
|
+
# Create studio (first time only)
|
|
984
|
+
dh studio2 create --size 100
|
|
985
|
+
|
|
986
|
+
# Attach studio
|
|
987
|
+
dh studio2 attach dev-work
|
|
988
|
+
|
|
989
|
+
# Connect with native SSH
|
|
990
|
+
ssh dev-work
|
|
991
|
+
|
|
992
|
+
# When done, detach and terminate
|
|
993
|
+
dh studio2 detach
|
|
994
|
+
dh engine2 terminate dev-work -y
|
|
995
|
+
```
|
|
996
|
+
|
|
997
|
+
### GPU Training with Coffee Lock
|
|
998
|
+
|
|
999
|
+
```bash
|
|
1000
|
+
# Launch GPU engine
|
|
1001
|
+
dh engine2 launch training --type a10g
|
|
1002
|
+
|
|
1003
|
+
# Add to SSH config
|
|
1004
|
+
dh engine2 config-ssh
|
|
1005
|
+
|
|
1006
|
+
# Set coffee lock for long job
|
|
1007
|
+
dh engine2 coffee training 8h
|
|
1008
|
+
|
|
1009
|
+
# Attach and start work
|
|
1010
|
+
dh studio2 attach training
|
|
1011
|
+
ssh training
|
|
1012
|
+
|
|
1013
|
+
# Job runs without idle shutdown
|
|
1014
|
+
# When done:
|
|
1015
|
+
dh engine2 coffee training --cancel
|
|
1016
|
+
dh studio2 detach
|
|
1017
|
+
dh engine2 terminate training -y
|
|
1018
|
+
```
|
|
1019
|
+
|
|
1020
|
+
### Multi-Engine Development
|
|
1021
|
+
|
|
1022
|
+
```bash
|
|
1023
|
+
# Launch multiple engines
|
|
1024
|
+
dh engine2 launch frontend --type cpu
|
|
1025
|
+
dh engine2 launch backend --type cpu
|
|
1026
|
+
dh engine2 launch ml --type t4
|
|
1027
|
+
|
|
1028
|
+
# Update SSH config for easy access
|
|
1029
|
+
dh engine2 config-ssh
|
|
1030
|
+
|
|
1031
|
+
# Now use direct SSH
|
|
1032
|
+
ssh frontend
|
|
1033
|
+
ssh backend
|
|
1034
|
+
ssh ml
|
|
1035
|
+
```
|
|
1036
|
+
|
|
1037
|
+
### Monitoring Idle Detection
|
|
1038
|
+
|
|
1039
|
+
```bash
|
|
1040
|
+
# Check basic idle status
|
|
1041
|
+
dh engine2 status my-engine
|
|
1042
|
+
|
|
1043
|
+
# Check detailed sensor information
|
|
1044
|
+
dh engine2 status my-engine --detailed
|
|
1045
|
+
|
|
1046
|
+
# Shows all 4 sensors with confidence levels:
|
|
1047
|
+
# - SSH (HIGH)
|
|
1048
|
+
# - IDE (MEDIUM)
|
|
1049
|
+
# - Docker (MEDIUM)
|
|
1050
|
+
# - Coffee (HIGH)
|
|
1051
|
+
```
|
|
1052
|
+
|
|
1053
|
+
---
|
|
1054
|
+
|
|
1055
|
+
## Error Handling
|
|
1056
|
+
|
|
1057
|
+
### Common Errors
|
|
1058
|
+
|
|
1059
|
+
**"You already have a studio"**
|
|
1060
|
+
```bash
|
|
1061
|
+
✗ You already have a studio: vol-0123456789abcdef0
|
|
1062
|
+
Use 'dh studio2 delete' to remove it first
|
|
1063
|
+
```
|
|
1064
|
+
Solution: Delete existing studio or use existing one.
|
|
1065
|
+
|
|
1066
|
+
**"Studio must be detached before deletion"**
|
|
1067
|
+
```bash
|
|
1068
|
+
✗ Studio must be detached before deletion
|
|
1069
|
+
Run: dh studio2 detach
|
|
1070
|
+
```
|
|
1071
|
+
Solution: Detach studio first.
|
|
1072
|
+
|
|
1073
|
+
**"Studio is not available"**
|
|
1074
|
+
```bash
|
|
1075
|
+
✗ Studio is not available (status: attaching)
|
|
1076
|
+
```
|
|
1077
|
+
Solution: Wait for current operation to complete or use `dh studio2 reset`.
|
|
1078
|
+
|
|
1079
|
+
**"Could not fetch API URL"**
|
|
1080
|
+
```bash
|
|
1081
|
+
✗ Could not fetch API URL from /dev/studio-manager/api-url
|
|
1082
|
+
```
|
|
1083
|
+
Solution: Ensure you're authenticated to AWS and the environment is deployed.
|
|
1084
|
+
|
|
1085
|
+
**"Attachment failed"**
|
|
1086
|
+
Shows detailed error with failed step:
|
|
1087
|
+
```bash
|
|
1088
|
+
✗ Attachment failed: Mount filesystem timeout
|
|
1089
|
+
|
|
1090
|
+
Failed at step: mount_filesystem
|
|
1091
|
+
Error: SSM command timeout after 30s
|
|
1092
|
+
```
|
|
1093
|
+
Solution: Check engine is ready and SSM agent is running. Retry the attachment.
|
|
1094
|
+
|
|
1095
|
+
---
|
|
1096
|
+
|
|
1097
|
+
## Progress Tracking Features
|
|
1098
|
+
|
|
1099
|
+
The v2 implementation includes real-time progress tracking for long-running operations:
|
|
1100
|
+
|
|
1101
|
+
### Launch Progress
|
|
1102
|
+
- Shows progress through 8 bootstrap stages
|
|
1103
|
+
- Real-time percentage completion
|
|
1104
|
+
- Stage timing information
|
|
1105
|
+
- Estimated time remaining
|
|
1106
|
+
|
|
1107
|
+
### Attachment Progress
|
|
1108
|
+
- Shows progress through 6 attachment steps
|
|
1109
|
+
- Visual progress bar
|
|
1110
|
+
- Step-by-step updates
|
|
1111
|
+
- Detailed error reporting if failure occurs
|
|
1112
|
+
|
|
1113
|
+
### Status Visibility
|
|
1114
|
+
- Real-time idle detector sensor states
|
|
1115
|
+
- Confidence levels for each sensor
|
|
1116
|
+
- Detailed activity information with `--detailed` flag
|
|
1117
|
+
- Clear indication of what's keeping engine awake
|
|
1118
|
+
|
|
1119
|
+
---
|
|
1120
|
+
|
|
1121
|
+
## Idle Detection Architecture
|
|
1122
|
+
|
|
1123
|
+
The v2 implementation uses a modular sensor-based idle detector with confidence levels:
|
|
1124
|
+
|
|
1125
|
+
### 4 Independent Sensors
|
|
1126
|
+
|
|
1127
|
+
1. **SSH Sensor** (HIGH confidence)
|
|
1128
|
+
- Uses `who -u` to detect active sessions
|
|
1129
|
+
- Filters out system users
|
|
1130
|
+
- HIGH confidence: presence/absence is definitive
|
|
1131
|
+
|
|
1132
|
+
2. **IDE Sensor** (MEDIUM confidence)
|
|
1133
|
+
- Detects VS Code/Cursor remote connections
|
|
1134
|
+
- Uses `ss -tanpo` to inspect TCP connections
|
|
1135
|
+
- 3 retries to avoid false positives
|
|
1136
|
+
- MEDIUM confidence: connections can be transient
|
|
1137
|
+
|
|
1138
|
+
3. **Docker Sensor** (MEDIUM confidence)
|
|
1139
|
+
- Detects non-dev workload containers
|
|
1140
|
+
- Filters out dev containers, system images, transient patterns
|
|
1141
|
+
- MEDIUM confidence: heuristic-based filtering
|
|
1142
|
+
|
|
1143
|
+
4. **Coffee Lock Sensor** (HIGH confidence)
|
|
1144
|
+
- Explicit user keep-alive via `/var/run/engine-coffee`
|
|
1145
|
+
- Timestamp-based expiration
|
|
1146
|
+
- HIGH confidence: user intent is clear
|
|
1147
|
+
|
|
1148
|
+
### Decision Logic
|
|
1149
|
+
|
|
1150
|
+
Conservative fail-safe approach:
|
|
1151
|
+
```
|
|
1152
|
+
if any_sensor_has_high_confidence_activity:
|
|
1153
|
+
return ACTIVE
|
|
1154
|
+
elif any_sensor_has_error (LOW confidence):
|
|
1155
|
+
return ACTIVE # Fail safe - don't shut down on errors
|
|
1156
|
+
elif any_sensor_has_medium_confidence_activity:
|
|
1157
|
+
return ACTIVE
|
|
1158
|
+
else:
|
|
1159
|
+
return IDLE
|
|
1160
|
+
```
|
|
1161
|
+
|
|
1162
|
+
**Philosophy:** Better to waste a bit of compute than lose user work.
|
|
1163
|
+
|
|
1164
|
+
### Visibility
|
|
1165
|
+
|
|
1166
|
+
Use `dh engine2 status --detailed` to see:
|
|
1167
|
+
- Current state (ACTIVE/IDLE)
|
|
1168
|
+
- Reason for current state
|
|
1169
|
+
- All 4 sensor states with confidence levels
|
|
1170
|
+
- Detailed activity information for each sensor
|
|
1171
|
+
|
|
1172
|
+
---
|
|
1173
|
+
|
|
1174
|
+
## Storage Tiers
|
|
1175
|
+
|
|
1176
|
+
| Storage | Path | Speed | Use Case |
|
|
1177
|
+
|---------|------|-------|----------|
|
|
1178
|
+
| Studios | `/studios/{user}/` | Fast (<1ms) | Personal code, configs, experiments |
|
|
1179
|
+
| Primordial | `/primordial/` | Medium (1-3ms) | Shared datasets, batch I/O |
|
|
1180
|
+
| S3 | `s3://` | Slow (~100ms) | Archives, raw data, final models |
|
|
1181
|
+
|
|
1182
|
+
**Primordial Drive** (shared EFS):
|
|
1183
|
+
- Automatically mounted at `/primordial/` during bootstrap
|
|
1184
|
+
- Intelligent-Tiering: $0.30/GB-month → $0.016/GB-month after 30 days
|
|
1185
|
+
- Available on all engines and batch jobs
|
|
1186
|
+
- Use for datasets shared across users
|
|
1187
|
+
|
|
1188
|
+
---
|
|
1189
|
+
|
|
1190
|
+
## Technical Implementation
|
|
1191
|
+
|
|
1192
|
+
### API Backend
|
|
1193
|
+
- **18 REST endpoints** via API Gateway + Lambda
|
|
1194
|
+
- **3 DynamoDB tables** for state management
|
|
1195
|
+
- **Optimistic locking** prevents race conditions
|
|
1196
|
+
- **Progress tracking** for all async operations stored in DynamoDB
|
|
1197
|
+
|
|
1198
|
+
### Security
|
|
1199
|
+
- **IAM Identity Center (SSO)** for authentication
|
|
1200
|
+
- **SSM Session Manager** for SSH (no bastion hosts)
|
|
1201
|
+
- **Encrypted EBS volumes** for studios
|
|
1202
|
+
- **Least-privilege IAM roles** for engines
|
|
1203
|
+
|
|
1204
|
+
### Monitoring
|
|
1205
|
+
- **CloudWatch metrics** for operations
|
|
1206
|
+
- **Slack notifications** (configurable per engine)
|
|
1207
|
+
- **Progress APIs** for real-time status
|
|
1208
|
+
- **Detailed logging** for debugging
|
|
1209
|
+
|
|
1210
|
+
---
|
|
1211
|
+
|
|
1212
|
+
## Migration Timeline
|
|
1213
|
+
|
|
1214
|
+
**Current Status:**
|
|
1215
|
+
- v2 implementation complete and tested
|
|
1216
|
+
- All commands available via `engine2`/`studio2`
|
|
1217
|
+
- v1 commands remain available via `engine`/`studio`
|
|
1218
|
+
|
|
1219
|
+
**Production Deployment:**
|
|
1220
|
+
1. Deploy v2 to prod environment
|
|
1221
|
+
2. Team testing period (1-2 weeks)
|
|
1222
|
+
3. Promote `engine2` → `engine`, `studio2` → `studio`
|
|
1223
|
+
4. Deprecate v1 commands
|
|
1224
|
+
5. Remove v1 code after stabilization
|
|
1225
|
+
|
|
1226
|
+
**Why v2?**
|
|
1227
|
+
- Real-time progress eliminates "is it stuck?" questions
|
|
1228
|
+
- Detailed idle detector visibility prevents false shutdowns
|
|
1229
|
+
- Better error messages reduce debugging time
|
|
1230
|
+
- Click architecture enables better tooling integration
|