dayhoff-tools 1.1.10__py3-none-any.whl → 1.13.12__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (41) hide show
  1. dayhoff_tools/__init__.py +10 -0
  2. dayhoff_tools/cli/cloud_commands.py +179 -43
  3. dayhoff_tools/cli/engine1/__init__.py +323 -0
  4. dayhoff_tools/cli/engine1/engine_core.py +703 -0
  5. dayhoff_tools/cli/engine1/engine_lifecycle.py +136 -0
  6. dayhoff_tools/cli/engine1/engine_maintenance.py +431 -0
  7. dayhoff_tools/cli/engine1/engine_management.py +505 -0
  8. dayhoff_tools/cli/engine1/shared.py +501 -0
  9. dayhoff_tools/cli/engine1/studio_commands.py +825 -0
  10. dayhoff_tools/cli/engines_studios/__init__.py +6 -0
  11. dayhoff_tools/cli/engines_studios/api_client.py +351 -0
  12. dayhoff_tools/cli/engines_studios/auth.py +144 -0
  13. dayhoff_tools/cli/engines_studios/engine-studio-cli.md +1230 -0
  14. dayhoff_tools/cli/engines_studios/engine_commands.py +1151 -0
  15. dayhoff_tools/cli/engines_studios/progress.py +260 -0
  16. dayhoff_tools/cli/engines_studios/simulators/cli-simulators.md +151 -0
  17. dayhoff_tools/cli/engines_studios/simulators/demo.sh +75 -0
  18. dayhoff_tools/cli/engines_studios/simulators/engine_list_simulator.py +319 -0
  19. dayhoff_tools/cli/engines_studios/simulators/engine_status_simulator.py +369 -0
  20. dayhoff_tools/cli/engines_studios/simulators/idle_status_simulator.py +476 -0
  21. dayhoff_tools/cli/engines_studios/simulators/simulator_utils.py +180 -0
  22. dayhoff_tools/cli/engines_studios/simulators/studio_list_simulator.py +374 -0
  23. dayhoff_tools/cli/engines_studios/simulators/studio_status_simulator.py +164 -0
  24. dayhoff_tools/cli/engines_studios/studio_commands.py +755 -0
  25. dayhoff_tools/cli/main.py +106 -7
  26. dayhoff_tools/cli/utility_commands.py +896 -179
  27. dayhoff_tools/deployment/base.py +70 -6
  28. dayhoff_tools/deployment/deploy_aws.py +165 -25
  29. dayhoff_tools/deployment/deploy_gcp.py +78 -5
  30. dayhoff_tools/deployment/deploy_utils.py +20 -7
  31. dayhoff_tools/deployment/job_runner.py +9 -4
  32. dayhoff_tools/deployment/processors.py +230 -418
  33. dayhoff_tools/deployment/swarm.py +47 -12
  34. dayhoff_tools/embedders.py +28 -26
  35. dayhoff_tools/fasta.py +181 -64
  36. dayhoff_tools/warehouse.py +268 -1
  37. {dayhoff_tools-1.1.10.dist-info → dayhoff_tools-1.13.12.dist-info}/METADATA +20 -5
  38. dayhoff_tools-1.13.12.dist-info/RECORD +54 -0
  39. {dayhoff_tools-1.1.10.dist-info → dayhoff_tools-1.13.12.dist-info}/WHEEL +1 -1
  40. dayhoff_tools-1.1.10.dist-info/RECORD +0 -32
  41. {dayhoff_tools-1.1.10.dist-info → dayhoff_tools-1.13.12.dist-info}/entry_points.txt +0 -0
@@ -0,0 +1,1230 @@
1
+ # Engine & Studio CLI Commands (v2)
2
+
3
+ Comprehensive CLI for managing ephemeral compute engines and persistent studio volumes with **real-time progress tracking and enhanced observability**.
4
+
5
+ ## Overview
6
+
7
+ This is the **new implementation** of the engines/studios CLI, currently accessed via `dh engine2` and `dh studio2` during the migration period.
8
+
9
+ ### Key Improvements Over v1
10
+ - ✅ **Real-time progress tracking** for launch and attach operations
11
+ - ✅ **Detailed idle detector visibility** with sensor-level information
12
+ - ✅ **Click-based architecture** for better composability
13
+ - ✅ **Comprehensive error messages** with actionable guidance
14
+ - ✅ **Environment flag support** across all commands
15
+
16
+ ### Command Migration
17
+
18
+ **Current (during transition):**
19
+ - `dh engine` / `dh studio` → Legacy Typer-based commands (v1)
20
+ - `dh engine2` / `dh studio2` → New Click-based commands with progress (v2)
21
+
22
+ **After production deployment:**
23
+ - `dh engine2` will become `dh engine`
24
+ - `dh studio2` will become `dh studio`
25
+ - v1 commands will be deprecated
26
+
27
+ ### System Components
28
+ - **Engines**: Ephemeral EC2 instances for compute (CPU, GPU types)
29
+ - **Studios**: Persistent EBS volumes that attach/detach from engines
30
+ - **Auto-shutdown**: Modular idle detection prevents runaway costs
31
+ - **Progress APIs**: Real-time status updates during async operations
32
+
33
+ ## Global Options
34
+
35
+ All commands support:
36
+ - `--env <dev|sand|prod>` - Target environment (default: dev)
37
+ - `--help` - Show command help
38
+
39
+ ## Engine Commands
40
+
41
+ ### Lifecycle Management
42
+
43
+ #### `dh engine2 launch`
44
+
45
+ Launch a new engine and wait for it to be ready with real-time progress tracking.
46
+
47
+ **Usage:**
48
+ ```bash
49
+ dh engine2 launch <name> --type <type> [options]
50
+ ```
51
+
52
+ **Arguments:**
53
+ - `name` - Unique name for the engine (used for SSH, identification)
54
+
55
+ **Options:**
56
+ - `--type <type>` - **Required.** Engine type:
57
+ - `cpu` - r6i.2xlarge (8 vCPU, 64GB RAM)
58
+ - `cpumax` - r7i.8xlarge (32 vCPU, 256GB RAM)
59
+ - `t4` - g4dn.2xlarge (T4 GPU, 16GB VRAM)
60
+ - `a10g` - g5.2xlarge (A10G GPU, 24GB VRAM)
61
+ - `a100` - p4d.24xlarge (8x A100, 40GB VRAM each)
62
+ - `4_t4`, `8_t4` - Multi-GPU T4 instances
63
+ - `4_a10g`, `8_a10g` - Multi-GPU A10G instances
64
+ - `--size <GB>` - Boot disk size in GB (optional)
65
+ - `--user <username>` - User to launch engine for (defaults to current user, use for testing/admin)
66
+ - `--no-wait` - Return immediately without waiting for readiness
67
+ - `-y, --yes` - Skip confirmation for non-dev environments
68
+ - `--env <env>` - Target environment (default: dev)
69
+
70
+ **Examples:**
71
+ ```bash
72
+ # Launch CPU engine for development
73
+ dh engine2 launch dev-work --type cpu
74
+
75
+ # Launch GPU engine with custom disk size
76
+ dh engine2 launch training-job --type a10g --size 200
77
+
78
+ # Launch without waiting (check status later)
79
+ dh engine2 launch batch-worker --type cpumax --no-wait
80
+
81
+ # Launch engine for test user (testing/admin)
82
+ dh engine2 launch e2e-engine --type cpu --user testuser1
83
+ ```
84
+
85
+ **Output with progress tracking:**
86
+ ```
87
+ 🚀 Launching cpu engine 'my-engine'...
88
+ ✓ EC2 instance launched: i-1234567890abcdef0
89
+
90
+ ⏳ Waiting for engine to be ready (typically 2-3 minutes)...
91
+
92
+ Progress ████████████░░░░░░░░░░ 60%
93
+ [5s] Instance Running
94
+ [8s] Downloading Scripts
95
+ [15s] Installing Packages
96
+ [22s] Mounting Primordial Drive
97
+ [45s] Configuring Idle Detector
98
+ [52s] Finalizing
99
+
100
+ ✓ Engine ready!
101
+
102
+ Connect with:
103
+ dh engine2 config-ssh # Add to SSH config
104
+ ssh my-engine # Then use native SSH
105
+ ```
106
+
107
+ **Bootstrap stages (9 total):**
108
+ 1. Instance running
109
+ 2. Downloading scripts
110
+ 3. Installing packages
111
+ 4. Mounting Primordial Drive
112
+ 5. Installing GPU drivers (if applicable)
113
+ 6. Creating environment
114
+ 7. Configuring idle detector
115
+ 8. Configuring SSH (passwordless access for IDE connections)
116
+ 9. Ready
117
+
118
+ **Bootstrap time:**
119
+ - CPU: 1-2 minutes
120
+ - GPU (first boot): 3-5 minutes (driver installation + reboot)
121
+ - GPU (from GAMI): 1-2 minutes
122
+
123
+ ---
124
+
125
+ #### `dh engine2 start`
126
+
127
+ Start a stopped engine.
128
+
129
+ **Usage:**
130
+ ```bash
131
+ dh engine2 start <name-or-id> [options]
132
+ ```
133
+
134
+ **Arguments:**
135
+ - `name-or-id` - Engine name or EC2 instance ID
136
+
137
+ **Options:**
138
+ - `--no-wait` - Return immediately without waiting for readiness
139
+ - `--skip-ssh-config` - Don't automatically update SSH config
140
+ - `-y, --yes` - Skip confirmation for non-dev environments
141
+ - `--env <env>` - Target environment
142
+
143
+ **Examples:**
144
+ ```bash
145
+ dh engine2 start my-engine
146
+ dh engine2 start i-1234567890abcdef0
147
+ ```
148
+
149
+ **Output:**
150
+ ```
151
+ Starting engine 'my-engine'...
152
+ ✓ Engine 'my-engine' is starting
153
+ ```
154
+
155
+ **Note:** Starting an engine does not re-run bootstrap. The engine resumes from its previous state.
156
+
157
+ ---
158
+
159
+ #### `dh engine2 stop`
160
+
161
+ Stop a running engine (keeps EBS boot disk).
162
+
163
+ **Usage:**
164
+ ```bash
165
+ dh engine2 stop <name-or-id> [options]
166
+ ```
167
+
168
+ **Arguments:**
169
+ - `name-or-id` - Engine name or EC2 instance ID
170
+
171
+ **Options:**
172
+ - `-y, --yes` - Skip confirmation for non-dev environments
173
+ - `--env <env>` - Target environment
174
+
175
+ **Examples:**
176
+ ```bash
177
+ dh engine2 stop my-engine
178
+ ```
179
+
180
+ **Output:**
181
+ ```
182
+ Stopping engine 'my-engine'...
183
+ ✓ Engine 'my-engine' is stopping
184
+ ```
185
+
186
+ **Note:** Stopped engines still incur EBS storage costs (~$0.08/GB-month for boot disk). Studios must be detached before stopping.
187
+
188
+ ---
189
+
190
+ #### `dh engine2 terminate`
191
+
192
+ Permanently terminate an engine (deletes EBS boot disk).
193
+
194
+ **Usage:**
195
+ ```bash
196
+ dh engine2 terminate <name-or-id> [options]
197
+ ```
198
+
199
+ **Arguments:**
200
+ - `name-or-id` - Engine name or EC2 instance ID
201
+
202
+ **Options:**
203
+ - `-y, --yes` - Skip confirmation prompt
204
+ - `--env <env>` - Target environment
205
+
206
+ **Examples:**
207
+ ```bash
208
+ # With confirmation
209
+ dh engine2 terminate my-engine
210
+
211
+ # Skip confirmation
212
+ dh engine2 terminate my-engine -y
213
+ ```
214
+
215
+ **Output:**
216
+ ```
217
+ Terminate engine 'my-engine' (i-1234567890abcdef0)? [y/N]: y
218
+ ✓ Engine 'my-engine' is terminating
219
+ ```
220
+
221
+ **Warning:** This permanently deletes the engine's boot disk. Any data not in studios or Primordial Drive will be lost.
222
+
223
+ ---
224
+
225
+ ### Status and Information
226
+
227
+ #### `dh engine2 status`
228
+
229
+ Show comprehensive engine status including idle detector state with real-time sensor data.
230
+
231
+ **Usage:**
232
+ ```bash
233
+ dh engine2 status <name-or-id> [options]
234
+ ```
235
+
236
+ **Arguments:**
237
+ - `name-or-id` - Engine name or EC2 instance ID
238
+
239
+ **Options:**
240
+ - `--detailed` - Show detailed sensor information with confidence levels
241
+ - `--env <env>` - Target environment
242
+
243
+ **Examples:**
244
+ ```bash
245
+ # Basic status
246
+ dh engine2 status my-engine
247
+
248
+ # Detailed status with sensor breakdown
249
+ dh engine2 status my-engine --detailed
250
+ ```
251
+
252
+ **Output (basic):**
253
+ ```
254
+ Engine: my-engine
255
+ Instance ID: i-1234567890abcdef0
256
+ Type: cpu
257
+ State: running
258
+ Public IP: 54.123.45.67
259
+ Launched: 2 hours ago
260
+
261
+ Idle Status: 🟢 ACTIVE
262
+ Reason: ssh: 1 active SSH session(s)
263
+ ```
264
+
265
+ **Output (detailed):**
266
+ ```
267
+ Engine: my-engine
268
+ Instance ID: i-1234567890abcdef0
269
+ Type: cpu
270
+ State: running
271
+
272
+ Idle Status: 🟢 ACTIVE
273
+ Reason: ssh: 1 active SSH session(s)
274
+
275
+ ============================================================
276
+ Activity Sensors:
277
+ ============================================================
278
+
279
+ ✓ SSH (HIGH confidence)
280
+ 1 active SSH session(s)
281
+ sessions:
282
+ - alice pts/0 2025-11-10 14:30 old 12345
283
+
284
+ ✗ IDE (MEDIUM confidence)
285
+ No IDE connections found
286
+
287
+ ✗ DOCKER (MEDIUM confidence)
288
+ No workload containers
289
+ ignored:
290
+ - devcontainer-1 (dev-container)
291
+
292
+ ✗ COFFEE (HIGH confidence)
293
+ No coffee lock
294
+ ```
295
+
296
+ **Idle detector sensors:**
297
+ - **SSH** (HIGH confidence) - Detects active SSH sessions via `who -u`
298
+ - **IDE** (MEDIUM confidence) - Detects VS Code/Cursor remote connections
299
+ - **Docker** (MEDIUM confidence) - Detects non-dev workload containers
300
+ - **Coffee** (HIGH confidence) - Explicit user keep-alive lock
301
+
302
+ ---
303
+
304
+ #### `dh engine2 list`
305
+
306
+ List all engines in the environment.
307
+
308
+ **Usage:**
309
+ ```bash
310
+ dh engine2 list [--env <env>]
311
+ ```
312
+
313
+ **Examples:**
314
+ ```bash
315
+ # List engines in dev
316
+ dh engine2 list
317
+
318
+ # List engines in sand
319
+ dh engine2 list --env sand
320
+ ```
321
+
322
+ **Output:**
323
+ ```
324
+
325
+ Engines for AWS Account dev
326
+ ╭──────────────┬─────────────┬─────────────┬─────────────┬─────────────────────╮
327
+ │ Name │ State │ User │ Type │ Instance ID │
328
+ ├──────────────┼─────────────┼─────────────┼─────────────┼─────────────────────┤
329
+ │ alice-work │ running │ alice │ cpu │ i-0123456789abcdef0 │
330
+ │ bob-training │ running │ bob │ a10g │ i-0fedcba987654321 │
331
+ │ batch-worker │ stopped │ charlie │ cpumax │ i-0abc123def456789 │
332
+ ╰──────────────┴─────────────┴─────────────┴─────────────┴─────────────────────╯
333
+ Total: 3
334
+
335
+ ```
336
+
337
+ **Formatting:**
338
+ - Full table borders with Unicode box-drawing characters
339
+ - Engine names are displayed in blue
340
+ - State is color-coded: green for "running", yellow for "starting/stopping", grey for "stopped"
341
+ - Instance IDs are displayed in grey
342
+ - Name column width adjusts dynamically to fit the longest engine name
343
+
344
+ ---
345
+
346
+ ### Access
347
+
348
+ #### `dh engine2 config-ssh`
349
+
350
+ Update `~/.ssh/config` with entries for all running engines.
351
+
352
+ **Usage:**
353
+ ```bash
354
+ dh engine2 config-ssh [options]
355
+ ```
356
+
357
+ **Options:**
358
+ - `--clean` - Remove all managed entries (doesn't add new ones)
359
+ - `--all` - Include engines from all users (default: only your engines)
360
+ - `--admin` - Use `ec2-user` instead of owner username
361
+ - `--env <env>` - Target environment
362
+
363
+ **Examples:**
364
+ ```bash
365
+ # Add your running engines
366
+ dh engine2 config-ssh
367
+
368
+ # Add all engines (all users)
369
+ dh engine2 config-ssh --all
370
+
371
+ # Remove managed entries
372
+ dh engine2 config-ssh --clean
373
+
374
+ # Add engines as admin user
375
+ dh engine2 config-ssh --admin
376
+ ```
377
+
378
+ **Output:**
379
+ ```
380
+ ✓ Updated SSH config with 3 engine(s)
381
+ ```
382
+
383
+ **Managed section in ~/.ssh/config:**
384
+ ```
385
+ # BEGIN DAYHOFF ENGINES
386
+
387
+ Host my-engine
388
+ HostName i-1234567890abcdef0
389
+ User alice
390
+ ProxyCommand aws ssm start-session --target %h --document-name AWS-StartSSHSession --parameters 'portNumber=%p' --profile dev-devaccess
391
+
392
+ # END DAYHOFF ENGINES
393
+ ```
394
+
395
+ **Note**: The `--profile` flag is automatically added based on `--env`:
396
+ - `--env dev` → `--profile dev-devaccess`
397
+ - `--env sand` → `--profile sand-devaccess`
398
+ - `--env prod` → `--profile prod-devaccess`
399
+
400
+ This ensures GUI applications like VS Code and Cursor can connect without inheriting shell environment variables.
401
+
402
+ **Usage after config - Standard SSH:**
403
+ ```bash
404
+ # Interactive SSH
405
+ ssh my-engine
406
+
407
+ # Execute remote commands
408
+ ssh my-engine "ls /studios"
409
+
410
+ # File transfer
411
+ scp local-file my-engine:/studios/alice/
412
+ rsync -avz project/ my-engine:/studios/alice/project/
413
+
414
+ # Port forwarding
415
+ ssh -L 8080:localhost:8080 my-engine
416
+
417
+ # VS Code Remote SSH
418
+ code --remote ssh-remote+my-engine /studios/alice/project
419
+
420
+ # VS Code Remote - Tunnels
421
+ code tunnel --name my-engine
422
+ ```
423
+
424
+ **All standard SSH features work:**
425
+ - Command execution: `ssh <engine> "<command>"`
426
+ - File transfer: `scp`, `rsync`, `sftp`
427
+ - Port forwarding: `-L`, `-R`, `-D` flags
428
+ - IDE remote development: VS Code, Cursor, PyCharm
429
+ - SSH agent forwarding: `-A` flag
430
+ - Config file options: ControlMaster, compression, etc.
431
+
432
+ **Recommended workflow:**
433
+ 1. Run `dh engine2 config-ssh` once after launching engines
434
+ 2. Use native `ssh <engine-name>` for all access
435
+ 3. Rerun `config-ssh` if you launch new engines
436
+
437
+ ---
438
+
439
+ ### Idle Detection Control
440
+
441
+ #### `dh engine2 coffee`
442
+
443
+ Set or cancel a "coffee lock" to prevent idle shutdown.
444
+
445
+ **Usage:**
446
+ ```bash
447
+ dh engine2 coffee <name-or-id> <duration> [options]
448
+ dh engine2 coffee <name-or-id> --cancel [options]
449
+ ```
450
+
451
+ **Arguments:**
452
+ - `name-or-id` - Engine name or EC2 instance ID
453
+ - `duration` - How long to keep alive (e.g., `4h`, `2h30m`, `45m`)
454
+
455
+ **Options:**
456
+ - `--cancel` - Cancel existing coffee lock
457
+ - `--env <env>` - Target environment
458
+
459
+ **Examples:**
460
+ ```bash
461
+ # Keep alive for 4 hours
462
+ dh engine2 coffee my-engine 4h
463
+
464
+ # Keep alive for 2.5 hours
465
+ dh engine2 coffee my-engine 2h30m
466
+
467
+ # Cancel coffee lock
468
+ dh engine2 coffee my-engine --cancel
469
+ ```
470
+
471
+ **Output:**
472
+ ```
473
+ ✓ Coffee lock set for 'my-engine': 4h
474
+ ```
475
+
476
+ **Use cases:**
477
+ - Long-running training jobs without active SSH
478
+ - Batch processing where idle detector might trigger
479
+ - Overnight jobs that don't show activity
480
+
481
+ **Note:** Coffee lock is HIGH confidence - overrides all other sensors.
482
+
483
+ ---
484
+
485
+ #### `dh engine2 idle`
486
+
487
+ Show or configure idle timeout settings.
488
+
489
+ **Usage:**
490
+ ```bash
491
+ dh engine2 idle <name-or-id> [options]
492
+ ```
493
+
494
+ **Arguments:**
495
+ - `name-or-id` - Engine name or EC2 instance ID
496
+
497
+ **Options:**
498
+ - `--set <duration>` - Set new timeout (e.g., `2h`, `45m`)
499
+ - `--slack <none|default|all>` - Configure Slack notifications
500
+ - `--env <env>` - Target environment
501
+
502
+ **Examples:**
503
+ ```bash
504
+ # Show current settings
505
+ dh engine2 idle my-engine
506
+
507
+ # Set 2-hour timeout
508
+ dh engine2 idle my-engine --set 2h
509
+
510
+ # Configure Slack notifications
511
+ dh engine2 idle my-engine --slack all
512
+ ```
513
+
514
+ **Output:**
515
+ ```
516
+ Idle Settings for 'my-engine':
517
+ Timeout: 30 minutes
518
+ Current State: ACTIVE
519
+ ```
520
+
521
+ **Default timeout:** 30 minutes (1800 seconds)
522
+
523
+ ---
524
+
525
+ ### Maintenance
526
+
527
+ #### `dh engine2 resize`
528
+
529
+ Resize an engine's boot disk.
530
+
531
+ **Usage:**
532
+ ```bash
533
+ dh engine2 resize <name-or-id> --size <GB> [options]
534
+ ```
535
+
536
+ **Arguments:**
537
+ - `name-or-id` - Engine name or EC2 instance ID
538
+
539
+ **Options:**
540
+ - `-s, --size <GB>` - **Required.** New size in GB
541
+ - `--online` - Resize while running (requires manual filesystem expansion)
542
+ - `-f, --force` - Skip confirmation
543
+ - `--env <env>` - Target environment
544
+
545
+ **Examples:**
546
+ ```bash
547
+ # Offline resize (requires stop/start)
548
+ dh engine2 resize my-engine --size 200 --force
549
+
550
+ # Online resize (advanced)
551
+ dh engine2 resize my-engine --size 200 --online --force
552
+ ```
553
+
554
+ **Output (offline resize):**
555
+ ```
556
+ ✓ Boot disk resize initiated for 'my-engine'
557
+ Engine will be stopped and restarted
558
+ Filesystem will be automatically expanded
559
+ ```
560
+
561
+ **Output (online resize):**
562
+ ```
563
+ ✓ Boot disk resize initiated for 'my-engine'
564
+
565
+ ⚠ Manual filesystem expansion required:
566
+ ssh my-engine
567
+ sudo growpart /dev/nvme0n1 1
568
+ sudo xfs_growfs /
569
+ df -h # Verify new size
570
+ ```
571
+
572
+ **Note:** Online resize keeps the engine running but requires manual steps. Offline resize (default) stops/starts the engine but handles filesystem expansion automatically.
573
+
574
+ ---
575
+
576
+ #### `dh engine2 debug`
577
+
578
+ Debug engine bootstrap status and show detailed stage information.
579
+
580
+ **Usage:**
581
+ ```bash
582
+ dh engine2 debug <name-or-id> [--env <env>]
583
+ ```
584
+
585
+ **Arguments:**
586
+ - `name-or-id` - Engine name or EC2 instance ID
587
+
588
+ **Examples:**
589
+ ```bash
590
+ dh engine2 debug my-engine
591
+ ```
592
+
593
+ **Output:**
594
+ ```
595
+ Engine: i-1234567890abcdef0
596
+ Ready: False
597
+ Current Stage: installing_packages
598
+
599
+ Bootstrap Stages:
600
+ ✓ 1. instance_running (30.0s)
601
+ ✓ 2. downloading_scripts (8.0s)
602
+ ⏳ 3. installing_packages
603
+ ```
604
+
605
+ **Use case:** Troubleshooting stuck or failed bootstrap. Shows exactly which stage failed and timing information.
606
+
607
+ ---
608
+
609
+ ## Studio Commands
610
+
611
+ ### Lifecycle Management
612
+
613
+ #### `dh studio2 create`
614
+
615
+ Create a new studio for the current user (or specified user with `--user` flag).
616
+
617
+ **Usage:**
618
+ ```bash
619
+ dh studio2 create [options]
620
+ ```
621
+
622
+ **Options:**
623
+ - `--size <GB>` - Studio size in GB (default: 100)
624
+ - `--user <username>` - User to create studio for (defaults to current user, use for testing/admin)
625
+ - `--env <env>` - Target environment
626
+
627
+ **Examples:**
628
+ ```bash
629
+ # Create 100GB studio (default)
630
+ dh studio2 create
631
+
632
+ # Create 200GB studio
633
+ dh studio2 create --size 200
634
+
635
+ # Create studio for test user (testing/admin)
636
+ dh studio2 create --user testuser1 --size 50
637
+ ```
638
+
639
+ **Output:**
640
+ ```
641
+ Creating 100GB studio for alice...
642
+ ✓ Studio created: vol-0123456789abcdef0
643
+
644
+ Attach to an engine with:
645
+ dh studio2 attach <engine-name>
646
+ ```
647
+
648
+ **Limits:**
649
+ - One studio per user per environment
650
+ - Studio is encrypted with AWS-managed keys
651
+ - Billed at ~$0.08/GB-month for EBS storage
652
+
653
+ ---
654
+
655
+ #### `dh studio2 delete`
656
+
657
+ Delete your studio (or another user's studio with `--user` flag).
658
+
659
+ **Usage:**
660
+ ```bash
661
+ dh studio2 delete [options]
662
+ ```
663
+
664
+ **Options:**
665
+ - `-y, --yes` - Skip confirmation
666
+ - `--user <username>` - User whose studio to delete (defaults to current user, use for testing/admin)
667
+ - `--env <env>` - Target environment
668
+
669
+ **Examples:**
670
+ ```bash
671
+ # With confirmation
672
+ dh studio2 delete
673
+
674
+ # Skip confirmation
675
+ dh studio2 delete -y
676
+
677
+ # Delete another user's studio (testing/admin)
678
+ dh studio2 delete --user testuser1 -y
679
+ ```
680
+
681
+ **Warning prompt:**
682
+ ```
683
+ ⚠ WARNING: This will permanently delete all data in vol-0123456789abcdef0
684
+ Are you sure? [y/N]:
685
+ ```
686
+
687
+ **Output:**
688
+ ```
689
+ ✓ Studio vol-0123456789abcdef0 deleted
690
+ ```
691
+
692
+ **Requirements:**
693
+ - Studio must be detached (`dh studio2 detach` first)
694
+ - All data in the studio will be permanently lost
695
+
696
+ ---
697
+
698
+ ### Status and Information
699
+
700
+ #### `dh studio2 status`
701
+
702
+ Show information about your studio.
703
+
704
+ **Usage:**
705
+ ```bash
706
+ dh studio2 status [--env <env>]
707
+ ```
708
+
709
+ **Examples:**
710
+ ```bash
711
+ dh studio2 status
712
+ ```
713
+
714
+ **Output (available):**
715
+ ```
716
+ Studio ID: vol-0123456789abcdef0
717
+ User: alice
718
+ Size: 100GB
719
+ Status: available
720
+ Created: 5 days ago
721
+ ```
722
+
723
+ **Output (attached):**
724
+ ```
725
+ Studio ID: vol-0123456789abcdef0
726
+ User: alice
727
+ Size: 100GB
728
+ Status: attached
729
+ Created: 5 days ago
730
+ Attached to: i-0fedcba987654321
731
+ ```
732
+
733
+ **Statuses:**
734
+ - `available` - Ready to attach
735
+ - `attached` - Attached to an engine
736
+ - `attaching` - Attachment in progress
737
+ - `detaching` - Detachment in progress
738
+ - `error` - Stuck state (use `dh studio2 reset`)
739
+
740
+ ---
741
+
742
+ #### `dh studio2 list`
743
+
744
+ List all studios in the environment.
745
+
746
+ **Usage:**
747
+ ```bash
748
+ dh studio2 list [--env <env>]
749
+ ```
750
+
751
+ **Examples:**
752
+ ```bash
753
+ dh studio2 list
754
+ ```
755
+
756
+ **Output:**
757
+ ```
758
+
759
+ Studios for AWS Account dev
760
+ ╭────────┬──────────────┬──────────────┬───────────┬───────────────────────────╮
761
+ │ User │ Status │ Attached To │ Size │ Studio ID │
762
+ ├────────┼──────────────┼──────────────┼───────────┼───────────────────────────┤
763
+ │ alice │ attached │ alice-work │ 100GB │ vol-0123456789abcdef0 │
764
+ │ bob │ available │ - │ 200GB │ vol-0fedcba987654321 │
765
+ │ carol │ attaching │ carol-gpu │ 150GB │ vol-0abc123def456789 │
766
+ ╰────────┴──────────────┴──────────────┴───────────┴───────────────────────────╯
767
+ Total: 3
768
+
769
+ ```
770
+
771
+ **Formatting:**
772
+ - Full table borders with Unicode box-drawing characters
773
+ - User names are displayed in blue
774
+ - Status is color-coded: purple for "attached", green for "available", yellow for "attaching/detaching", red for "error"
775
+ - "Attached To" shows engine name, or "-" if not attached
776
+ - Studio IDs are displayed in grey
777
+ - User column width adjusts dynamically to fit the longest username
778
+ - Attached To column width adjusts dynamically to fit the longest engine name
779
+ - Columns are ordered: User, Status, Attached To, Size, Studio ID
780
+
781
+ ---
782
+
783
+ ### Attachment
784
+
785
+ #### `dh studio2 attach`
786
+
787
+ Attach your studio to an engine with real-time progress tracking through all 6 attachment stages.
788
+
789
+ **Usage:**
790
+ ```bash
791
+ dh studio2 attach <engine-name-or-id> [--env <env>]
792
+ ```
793
+
794
+ **Arguments:**
795
+ - `engine-name-or-id` - Engine name or EC2 instance ID
796
+
797
+ **Examples:**
798
+ ```bash
799
+ dh studio2 attach my-engine
800
+ ```
801
+
802
+ **Output with progress tracking:**
803
+ ```
804
+ 📎 Attaching studio to my-engine...
805
+
806
+ ⏳ Attachment in progress...
807
+
808
+ Progress ████████████████████ 100%
809
+ Validate Engine
810
+ Find Device Slot
811
+ Attach Volume
812
+ Resolve Device
813
+ Mount Filesystem
814
+ Update State
815
+
816
+ ✓ Studio attached successfully!
817
+
818
+ Your files are now available at:
819
+ /studios/alice/
820
+
821
+ Connect with:
822
+ ssh my-engine
823
+ ```
824
+
825
+ **6-step attachment process:**
826
+ 1. **Validate Engine** - Ensure engine is ready (~250ms)
827
+ 2. **Find Device Slot** - Locate available `/dev/sd[f-p]` (~150ms)
828
+ 3. **Attach Volume** - AWS EBS attachment (~8-10s)
829
+ 4. **Resolve Device** - Map to NVMe device path via `/dev/disk/by-id/` (~2s)
830
+ 5. **Mount Filesystem** - Execute mount script via SSM RunCommand (~5s)
831
+ 6. **Update State** - Mark studio as `attached` in DynamoDB (~200ms)
832
+
833
+ **Total time:** ~15-20 seconds
834
+
835
+ **Requirements:**
836
+ - Studio must be in `available` status
837
+ - Engine must be in `ready` state
838
+ - Engine can have max 10 studios attached (device slots)
839
+
840
+ **Error handling:**
841
+ If attachment fails, shows detailed error with failed step:
842
+ ```
843
+ ✗ Attachment failed: Mount filesystem timeout
844
+
845
+ Failed at step: mount_filesystem
846
+ Error: SSM command timeout after 30s
847
+ ```
848
+
849
+ ---
850
+
851
+ #### `dh studio2 detach`
852
+
853
+ Detach your studio from its engine.
854
+
855
+ **Usage:**
856
+ ```bash
857
+ dh studio2 detach [--env <env>]
858
+ ```
859
+
860
+ **Examples:**
861
+ ```bash
862
+ dh studio2 detach
863
+ ```
864
+
865
+ **Output:**
866
+ ```
867
+ Detaching studio vol-0123456789abcdef0...
868
+ ✓ Studio detached
869
+ ```
870
+
871
+ **Process:**
872
+ 1. Clean unmount with `sync`
873
+ 2. AWS EBS detachment
874
+ 3. Update studio status to `available`
875
+
876
+ **Use cases:**
877
+ - Moving studio to a different engine
878
+ - Shutting down engine but preserving studio data
879
+ - Preparing for studio deletion or resize
880
+
881
+ ---
882
+
883
+ ### Maintenance
884
+
885
+ #### `dh studio2 resize`
886
+
887
+ Resize your studio volume (requires detachment).
888
+
889
+ **Usage:**
890
+ ```bash
891
+ dh studio2 resize --size <GB> [options]
892
+ ```
893
+
894
+ **Options:**
895
+ - `-s, --size <GB>` - **Required.** New size in GB
896
+ - `-y, --yes` - Skip confirmation
897
+ - `--user <username>` - User whose studio to resize (defaults to current user, use for testing/admin)
898
+ - `--env <env>` - Target environment
899
+
900
+ **Examples:**
901
+ ```bash
902
+ # With confirmation
903
+ dh studio2 resize --size 200
904
+
905
+ # Skip confirmation
906
+ dh studio2 resize --size 200 -y
907
+
908
+ # Resize test user's studio (testing/admin)
909
+ dh studio2 resize --size 200 -y --user testuser1
910
+ ```
911
+
912
+ **Output:**
913
+ ```
914
+ Resize studio from 100GB to 200GB? [y/N]: y
915
+ ✓ Studio resize initiated: 100GB → 200GB
916
+ ```
917
+
918
+ **Requirements:**
919
+ - Studio must be detached
920
+ - New size must be larger than current size (no shrinking)
921
+ - Filesystem automatically expands on next attach
922
+
923
+ **Note:** You're billed for the new size immediately (~$0.08/GB-month).
924
+
925
+ ---
926
+
927
+ #### `dh studio2 reset`
928
+
929
+ Reset a stuck studio to `available` status (admin operation).
930
+
931
+ **Usage:**
932
+ ```bash
933
+ dh studio2 reset [options]
934
+ ```
935
+
936
+ **Options:**
937
+ - `-y, --yes` - Skip confirmation
938
+ - `--user <username>` - User whose studio to reset (defaults to current user, use for testing/admin)
939
+ - `--env <env>` - Target environment
940
+
941
+ **Examples:**
942
+ ```bash
943
+ # Reset your own studio (with confirmation)
944
+ dh studio2 reset
945
+
946
+ # Skip confirmation
947
+ dh studio2 reset -y
948
+
949
+ # Reset test user's studio (testing/admin)
950
+ dh studio2 reset -y --user testuser1
951
+ ```
952
+
953
+ **Output:**
954
+ ```
955
+ Studio: vol-0123456789abcdef0
956
+ Current Status: attaching
957
+
958
+ Reset studio status to 'available'? [y/N]: y
959
+ ✓ Studio reset to 'available' status
960
+ Note: Manual cleanup may be required on engines
961
+ ```
962
+
963
+ **Use cases:**
964
+ - Studio stuck in `attaching` or `detaching`
965
+ - Attachment operation failed and didn't revert
966
+ - DynamoDB state out of sync with actual state
967
+
968
+ **Warning:** This only resets the DynamoDB state. If the volume is actually attached, you'll need to manually detach via AWS console or unmount on the engine.
969
+
970
+ ---
971
+
972
+ ## Common Workflows
973
+
974
+ ### Daily Development
975
+
976
+ ```bash
977
+ # Launch engine
978
+ dh engine2 launch dev-work --type cpu
979
+
980
+ # Add to SSH config
981
+ dh engine2 config-ssh
982
+
983
+ # Create studio (first time only)
984
+ dh studio2 create --size 100
985
+
986
+ # Attach studio
987
+ dh studio2 attach dev-work
988
+
989
+ # Connect with native SSH
990
+ ssh dev-work
991
+
992
+ # When done, detach and terminate
993
+ dh studio2 detach
994
+ dh engine2 terminate dev-work -y
995
+ ```
996
+
997
+ ### GPU Training with Coffee Lock
998
+
999
+ ```bash
1000
+ # Launch GPU engine
1001
+ dh engine2 launch training --type a10g
1002
+
1003
+ # Add to SSH config
1004
+ dh engine2 config-ssh
1005
+
1006
+ # Set coffee lock for long job
1007
+ dh engine2 coffee training 8h
1008
+
1009
+ # Attach and start work
1010
+ dh studio2 attach training
1011
+ ssh training
1012
+
1013
+ # Job runs without idle shutdown
1014
+ # When done:
1015
+ dh engine2 coffee training --cancel
1016
+ dh studio2 detach
1017
+ dh engine2 terminate training -y
1018
+ ```
1019
+
1020
+ ### Multi-Engine Development
1021
+
1022
+ ```bash
1023
+ # Launch multiple engines
1024
+ dh engine2 launch frontend --type cpu
1025
+ dh engine2 launch backend --type cpu
1026
+ dh engine2 launch ml --type t4
1027
+
1028
+ # Update SSH config for easy access
1029
+ dh engine2 config-ssh
1030
+
1031
+ # Now use direct SSH
1032
+ ssh frontend
1033
+ ssh backend
1034
+ ssh ml
1035
+ ```
1036
+
1037
+ ### Monitoring Idle Detection
1038
+
1039
+ ```bash
1040
+ # Check basic idle status
1041
+ dh engine2 status my-engine
1042
+
1043
+ # Check detailed sensor information
1044
+ dh engine2 status my-engine --detailed
1045
+
1046
+ # Shows all 4 sensors with confidence levels:
1047
+ # - SSH (HIGH)
1048
+ # - IDE (MEDIUM)
1049
+ # - Docker (MEDIUM)
1050
+ # - Coffee (HIGH)
1051
+ ```
1052
+
1053
+ ---
1054
+
1055
+ ## Error Handling
1056
+
1057
+ ### Common Errors
1058
+
1059
+ **"You already have a studio"**
1060
+ ```bash
1061
+ ✗ You already have a studio: vol-0123456789abcdef0
1062
+ Use 'dh studio2 delete' to remove it first
1063
+ ```
1064
+ Solution: Delete existing studio or use existing one.
1065
+
1066
+ **"Studio must be detached before deletion"**
1067
+ ```bash
1068
+ ✗ Studio must be detached before deletion
1069
+ Run: dh studio2 detach
1070
+ ```
1071
+ Solution: Detach studio first.
1072
+
1073
+ **"Studio is not available"**
1074
+ ```bash
1075
+ ✗ Studio is not available (status: attaching)
1076
+ ```
1077
+ Solution: Wait for current operation to complete or use `dh studio2 reset`.
1078
+
1079
+ **"Could not fetch API URL"**
1080
+ ```bash
1081
+ ✗ Could not fetch API URL from /dev/studio-manager/api-url
1082
+ ```
1083
+ Solution: Ensure you're authenticated to AWS and the environment is deployed.
1084
+
1085
+ **"Attachment failed"**
1086
+ Shows detailed error with failed step:
1087
+ ```bash
1088
+ ✗ Attachment failed: Mount filesystem timeout
1089
+
1090
+ Failed at step: mount_filesystem
1091
+ Error: SSM command timeout after 30s
1092
+ ```
1093
+ Solution: Check engine is ready and SSM agent is running. Retry the attachment.
1094
+
1095
+ ---
1096
+
1097
+ ## Progress Tracking Features
1098
+
1099
+ The v2 implementation includes real-time progress tracking for long-running operations:
1100
+
1101
+ ### Launch Progress
1102
+ - Shows progress through 8 bootstrap stages
1103
+ - Real-time percentage completion
1104
+ - Stage timing information
1105
+ - Estimated time remaining
1106
+
1107
+ ### Attachment Progress
1108
+ - Shows progress through 6 attachment steps
1109
+ - Visual progress bar
1110
+ - Step-by-step updates
1111
+ - Detailed error reporting if failure occurs
1112
+
1113
+ ### Status Visibility
1114
+ - Real-time idle detector sensor states
1115
+ - Confidence levels for each sensor
1116
+ - Detailed activity information with `--detailed` flag
1117
+ - Clear indication of what's keeping engine awake
1118
+
1119
+ ---
1120
+
1121
+ ## Idle Detection Architecture
1122
+
1123
+ The v2 implementation uses a modular sensor-based idle detector with confidence levels:
1124
+
1125
+ ### 4 Independent Sensors
1126
+
1127
+ 1. **SSH Sensor** (HIGH confidence)
1128
+ - Uses `who -u` to detect active sessions
1129
+ - Filters out system users
1130
+ - HIGH confidence: presence/absence is definitive
1131
+
1132
+ 2. **IDE Sensor** (MEDIUM confidence)
1133
+ - Detects VS Code/Cursor remote connections
1134
+ - Uses `ss -tanpo` to inspect TCP connections
1135
+ - 3 retries to avoid false positives
1136
+ - MEDIUM confidence: connections can be transient
1137
+
1138
+ 3. **Docker Sensor** (MEDIUM confidence)
1139
+ - Detects non-dev workload containers
1140
+ - Filters out dev containers, system images, transient patterns
1141
+ - MEDIUM confidence: heuristic-based filtering
1142
+
1143
+ 4. **Coffee Lock Sensor** (HIGH confidence)
1144
+ - Explicit user keep-alive via `/var/run/engine-coffee`
1145
+ - Timestamp-based expiration
1146
+ - HIGH confidence: user intent is clear
1147
+
1148
+ ### Decision Logic
1149
+
1150
+ Conservative fail-safe approach:
1151
+ ```
1152
+ if any_sensor_has_high_confidence_activity:
1153
+ return ACTIVE
1154
+ elif any_sensor_has_error (LOW confidence):
1155
+ return ACTIVE # Fail safe - don't shut down on errors
1156
+ elif any_sensor_has_medium_confidence_activity:
1157
+ return ACTIVE
1158
+ else:
1159
+ return IDLE
1160
+ ```
1161
+
1162
+ **Philosophy:** Better to waste a bit of compute than lose user work.
1163
+
1164
+ ### Visibility
1165
+
1166
+ Use `dh engine2 status --detailed` to see:
1167
+ - Current state (ACTIVE/IDLE)
1168
+ - Reason for current state
1169
+ - All 4 sensor states with confidence levels
1170
+ - Detailed activity information for each sensor
1171
+
1172
+ ---
1173
+
1174
+ ## Storage Tiers
1175
+
1176
+ | Storage | Path | Speed | Use Case |
1177
+ |---------|------|-------|----------|
1178
+ | Studios | `/studios/{user}/` | Fast (<1ms) | Personal code, configs, experiments |
1179
+ | Primordial | `/primordial/` | Medium (1-3ms) | Shared datasets, batch I/O |
1180
+ | S3 | `s3://` | Slow (~100ms) | Archives, raw data, final models |
1181
+
1182
+ **Primordial Drive** (shared EFS):
1183
+ - Automatically mounted at `/primordial/` during bootstrap
1184
+ - Intelligent-Tiering: $0.30/GB-month → $0.016/GB-month after 30 days
1185
+ - Available on all engines and batch jobs
1186
+ - Use for datasets shared across users
1187
+
1188
+ ---
1189
+
1190
+ ## Technical Implementation
1191
+
1192
+ ### API Backend
1193
+ - **18 REST endpoints** via API Gateway + Lambda
1194
+ - **3 DynamoDB tables** for state management
1195
+ - **Optimistic locking** prevents race conditions
1196
+ - **Progress tracking** for all async operations stored in DynamoDB
1197
+
1198
+ ### Security
1199
+ - **IAM Identity Center (SSO)** for authentication
1200
+ - **SSM Session Manager** for SSH (no bastion hosts)
1201
+ - **Encrypted EBS volumes** for studios
1202
+ - **Least-privilege IAM roles** for engines
1203
+
1204
+ ### Monitoring
1205
+ - **CloudWatch metrics** for operations
1206
+ - **Slack notifications** (configurable per engine)
1207
+ - **Progress APIs** for real-time status
1208
+ - **Detailed logging** for debugging
1209
+
1210
+ ---
1211
+
1212
+ ## Migration Timeline
1213
+
1214
+ **Current Status:**
1215
+ - v2 implementation complete and tested
1216
+ - All commands available via `engine2`/`studio2`
1217
+ - v1 commands remain available via `engine`/`studio`
1218
+
1219
+ **Production Deployment:**
1220
+ 1. Deploy v2 to prod environment
1221
+ 2. Team testing period (1-2 weeks)
1222
+ 3. Promote `engine2` → `engine`, `studio2` → `studio`
1223
+ 4. Deprecate v1 commands
1224
+ 5. Remove v1 code after stabilization
1225
+
1226
+ **Why v2?**
1227
+ - Real-time progress eliminates "is it stuck?" questions
1228
+ - Detailed idle detector visibility prevents false shutdowns
1229
+ - Better error messages reduce debugging time
1230
+ - Click architecture enables better tooling integration