@synsci/cli-darwin-x64 1.1.58 → 1.1.60

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,519 @@
1
+ ---
2
+ name: tensorpool-gpu-cloud
3
+ description: On-demand GPU clusters and training jobs with git-style interface. Use when you need multi-node GPU clusters (B200, H200, H100), persistent NFS storage, or batch training jobs with the TensorPool CLI.
4
+ version: 1.0.0
5
+ author: Synthetic Sciences
6
+ license: MIT
7
+ tags: [Infrastructure, GPU Cloud, Training, Clusters, Jobs, TensorPool, Multi-Node, Distributed Training, NFS Storage]
8
+ dependencies: [tensorpool]
9
+ ---
10
+
11
+ # TensorPool GPU Cloud
12
+
13
+ On-demand GPU clusters and git-style training jobs via the `tp` CLI. TensorPool provides multi-node GPU clusters with high-speed interconnects, persistent storage, and SLURM for distributed training.
14
+
15
+ ## When to Use TensorPool
16
+
17
+ **Use TensorPool when:**
18
+ - Need on-demand GPU clusters with SSH access (single or multi-node)
19
+ - Running distributed training across multiple nodes (SLURM pre-installed)
20
+ - Want git-style job interface: `tp job push` to submit, `tp job pull` to get results
21
+ - Need persistent NFS storage shared across cluster nodes
22
+ - Require B200, B300, H200, H100, L40S, or MI300X GPUs
23
+ - Want pay-per-second billing with no egress fees
24
+
25
+ **Key features:**
26
+ - **GPU variety**: B300, B200, H200, H100, L40S, MI300X, CPU instances
27
+ - **Multi-node clusters**: 8xB200 and 8xH200 with SLURM + InfiniBand
28
+ - **Jobs**: Git-style `tp job push/pull/listen` for batch experiments
29
+ - **Persistent storage**: Shared NFS volumes (300 GB/s aggregate) or S3-compatible object storage
30
+ - **Simple pricing**: Per-second billing, H100 at $1.99/hr, H200 at $2.99/hr, B200 at $4.99/hr
31
+
32
+ **Use alternatives instead:**
33
+ - **Tinker**: For managed SFT/fine-tuning (no infrastructure management)
34
+ - **Prime Intellect Lab**: For hosted RL training with environments
35
+ - **Modal**: For serverless, auto-scaling GPU workloads
36
+ - **Lambda Labs**: For dedicated instances with persistent filesystems
37
+ - **SkyPilot**: For multi-cloud orchestration and cost optimization
38
+
39
+ ### Decision Matrix
40
+
41
+ | Task | Platform |
42
+ |------|----------|
43
+ | SFT / LoRA fine-tuning | Tinker (default) |
44
+ | Hosted RL with environments | Prime Intellect Lab |
45
+ | On-demand GPU clusters with SSH | **TensorPool** |
46
+ | Batch training jobs (git-style) | **TensorPool** |
47
+ | Multi-node distributed training | **TensorPool** or Lambda (1-Click Clusters) |
48
+ | Serverless auto-scaling | Modal |
49
+ | Multi-cloud cost optimization | SkyPilot |
50
+
51
+ ---
52
+
53
+ ## Quick Start
54
+
55
+ ### Installation
56
+
57
+ ```bash
58
+ pip install tensorpool
59
+ ```
60
+
61
+ ### Authentication
62
+
63
+ ```bash
64
+ # Set API key (synced automatically via SynSci dashboard)
65
+ export TENSORPOOL_KEY="your_api_key_here"
66
+
67
+ # Verify
68
+ [ -n "$TENSORPOOL_KEY" ] && echo "set" || echo "not set"
69
+ ```
70
+
71
+ If connected via the Synthetic Sciences dashboard, `TENSORPOOL_KEY` is injected automatically.
72
+
73
+ ### Create Your First Cluster
74
+
75
+ ```bash
76
+ # Single H100
77
+ tp cluster create -i ~/.ssh/id_ed25519.pub -t 1xH100
78
+
79
+ # Check status
80
+ tp cluster info <cluster_id>
81
+
82
+ # SSH in
83
+ tp ssh <instance_id>
84
+
85
+ # Destroy when done
86
+ tp cluster destroy <cluster_id>
87
+ ```
88
+
89
+ ### Submit a Training Job
90
+
91
+ ```bash
92
+ # Initialize job config
93
+ tp job init
94
+
95
+ # Edit tp.config.toml, then push
96
+ tp job push tp.config.toml
97
+
98
+ # Stream logs
99
+ tp job listen <job_id>
100
+
101
+ # Download results
102
+ tp job pull <job_id>
103
+ ```
104
+
105
+ ---
106
+
107
+ ## Instance Types
108
+
109
+ | Instance Type | Multi-Node Support |
110
+ |---------------|-------------------|
111
+ | `1xB300` / `2xB300` / `4xB300` / `8xB300` | No |
112
+ | `1xB200` / `2xB200` / `4xB200` / `8xB200` | **Yes** (8xB200) |
113
+ | `1xH200` / `2xH200` / `4xH200` / `8xH200` | **Yes** (8xH200) |
114
+ | `1xH100` / `2xH100` / `4xH100` / `8xH100` | No |
115
+ | `1xL40S` | No |
116
+ | `32xCPU` / `64xCPU` | No |
117
+
118
+ ### Pricing (per GPU/hour)
119
+
120
+ | GPU | Price |
121
+ |-----|-------|
122
+ | B300 SXM | $5.49/hr |
123
+ | B200 SXM | $4.99/hr |
124
+ | H200 SXM | $2.99/hr |
125
+ | H100 SXM | $1.99/hr |
126
+ | L40S | $1.49/hr |
127
+ | CPU | $0.015/hr |
128
+
129
+ All charges prorated to the second.
130
+
131
+ ---
132
+
133
+ ## Clusters
134
+
135
+ ### Single-Node Clusters
136
+
137
+ ```bash
138
+ # Various GPU configs
139
+ tp cluster create -i ~/.ssh/id_ed25519.pub -t 1xH100
140
+ tp cluster create -i ~/.ssh/id_ed25519.pub -t 8xH200
141
+ tp cluster create -i ~/.ssh/id_ed25519.pub -t 8xB200
142
+ tp cluster create -i ~/.ssh/id_ed25519.pub -t 1xL40S
143
+
144
+ # With custom name
145
+ tp cluster create -i ~/.ssh/id_ed25519.pub -t 1xH100 --name my-cluster
146
+ ```
147
+
148
+ ### Multi-Node Clusters
149
+
150
+ Multi-node clusters come with **SLURM preinstalled**. Only `8xH200` and `8xB200` support multi-node.
151
+
152
+ ```bash
153
+ # 2-node cluster (16 GPUs total)
154
+ tp cluster create -i ~/.ssh/id_ed25519.pub -t 8xH200 -n 2
155
+
156
+ # 4-node cluster (32 GPUs total)
157
+ tp cluster create -i ~/.ssh/id_ed25519.pub -t 8xB200 -n 4
158
+ ```
159
+
160
+ **Multi-node architecture:**
161
+ - **Jumphost**: `{cluster_id}-jumphost` — SLURM login/controller, public IP
162
+ - **Worker nodes**: `{cluster_id}-0`, `{cluster_id}-1`, etc. — private IPs only
163
+
164
+ ```bash
165
+ # SSH into jumphost first
166
+ tp ssh <jumphost-instance-id>
167
+
168
+ # From jumphost, access workers
169
+ ssh <cluster_id>-0
170
+ ssh <cluster_id>-1
171
+ ```
172
+
173
+ ### Cluster Management
174
+
175
+ ```bash
176
+ tp cluster list # List all clusters
177
+ tp cluster list --org # List organization clusters
178
+ tp cluster info <cluster_id> # Detailed cluster info
179
+ tp cluster edit <cluster_id> --name "new-name"
180
+ tp cluster edit <cluster_id> --deletion-protection true
181
+ tp cluster destroy <cluster_id> # Terminate cluster
182
+ ```
183
+
184
+ ### Cluster Statuses
185
+
186
+ `PENDING` → `PROVISIONING` → `CONFIGURING` → `RUNNING` → `DESTROYING` → `DESTROYED`
187
+
188
+ If any instance fails, cluster shows as `FAILED`.
189
+
190
+ ---
191
+
192
+ ## Jobs
193
+
194
+ Git-style interface for running training experiments on GPUs. Pay only for the time your job runs.
195
+
196
+ ### Job Configuration (tp.config.toml)
197
+
198
+ ```toml
199
+ commands = [
200
+ "pip install -r requirements.txt",
201
+ "python train.py --epochs 100",
202
+ ]
203
+
204
+ instance_type = "1xH100"
205
+
206
+ outputs = [
207
+ "checkpoints/",
208
+ "model.pth",
209
+ "results.json",
210
+ ]
211
+
212
+ ignore = [
213
+ ".venv",
214
+ "venv/",
215
+ "__pycache__/",
216
+ ".git",
217
+ "*.pyc",
218
+ ]
219
+ ```
220
+
221
+ ### Job Commands
222
+
223
+ ```bash
224
+ tp job init # Create tp.config.toml
225
+ tp job push tp.config.toml # Submit job
226
+ tp job list # List your jobs
227
+ tp job list --org # List org jobs
228
+ tp job info <job_id> # Job details
229
+ tp job listen <job_id> # Stream real-time logs
230
+ tp job pull <job_id> # Download output files
231
+ tp job pull <job_id> --force # Overwrite existing files
232
+ tp job cancel <job_id> # Cancel running job
233
+ tp job cancel <job_id> --no-input # Skip confirmation
234
+ ```
235
+
236
+ ### Job Statuses
237
+
238
+ `Pending` → `Running` → `Completed` / `Error` / `Failed` / `Canceled`
239
+
240
+ - **Error**: User-level (non-zero exit code) — check logs
241
+ - **Failed**: System-level (node/GPU failure) — TensorPool investigates
242
+
243
+ ### Multiple Experiments
244
+
245
+ ```bash
246
+ # Create multiple configs
247
+ tp job init # → tp.config.toml (rename to tp.baseline.toml)
248
+ tp job init # → tp.config1.toml (rename to tp.experiment.toml)
249
+
250
+ # Run different experiments
251
+ tp job push tp.baseline.toml
252
+ tp job push tp.experiment.toml
253
+ ```
254
+
255
+ ---
256
+
257
+ ## Storage
258
+
259
+ ### Shared Storage Volumes (NFS)
260
+
261
+ High-performance NFS for multi-node clusters. Up to 300 GB/s aggregate read throughput.
262
+
263
+ ```bash
264
+ # Create 500GB shared volume
265
+ tp storage create -t shared -s 500 --name training-data
266
+
267
+ # Attach to cluster
268
+ tp cluster attach <cluster_id> <storage_id>
269
+
270
+ # Access on cluster at /mnt/shared-<storage_id>
271
+
272
+ # Detach
273
+ tp cluster detach <cluster_id> <storage_id>
274
+
275
+ # Destroy
276
+ tp storage destroy <storage_id>
277
+ ```
278
+
279
+ **Shared storage**: Multi-node only (2+ nodes), $100/TB/month, POSIX compliant.
280
+
281
+ ### Object Storage (S3-compatible)
282
+
283
+ ```bash
284
+ # Create object storage bucket
285
+ tp storage create -t object --name models
286
+
287
+ # Attach to any cluster type
288
+ tp cluster attach <cluster_id> <storage_id>
289
+
290
+ # Mount at /mnt/object-<storage_id> (FUSE)
291
+ # Prefer boto3/rclone over FUSE mount for performance
292
+ ```
293
+
294
+ **Object storage**: All cluster types, $20/TB/month, globally replicated, no ingress/egress fees. Not POSIX compliant.
295
+
296
+ ### Storage Commands
297
+
298
+ ```bash
299
+ tp storage create -t <type> [-s <size>] [--name <name>]
300
+ tp storage list
301
+ tp storage info <storage_id>
302
+ tp storage edit <storage_id> --name "new-name"
303
+ tp storage edit <storage_id> --deletion-protection true
304
+ tp storage destroy <storage_id>
305
+ ```
306
+
307
+ ---
308
+
309
+ ## SSH Keys
310
+
311
+ ```bash
312
+ # Generate if needed
313
+ ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519
314
+
315
+ # Use when creating clusters
316
+ tp cluster create -i ~/.ssh/id_ed25519.pub -t 1xH100
317
+
318
+ # Connect to cluster
319
+ tp ssh <instance_id>
320
+ ```
321
+
322
+ ---
323
+
324
+ ## Common Workflows
325
+
326
+ ### Workflow 1: Single-Node Training Job
327
+
328
+ ```bash
329
+ # 1. Create job config
330
+ tp job init
331
+
332
+ # 2. Configure tp.config.toml
333
+ # commands = ["pip install -r requirements.txt", "python train.py"]
334
+ # instance_type = "1xH100"
335
+ # outputs = ["checkpoints/", "model.pth"]
336
+
337
+ # 3. Submit
338
+ tp job push tp.config.toml
339
+
340
+ # 4. Monitor
341
+ tp job listen <job_id>
342
+
343
+ # 5. Get results
344
+ tp job pull <job_id>
345
+ ```
346
+
347
+ ### Workflow 2: Multi-Node Distributed Training
348
+
349
+ ```bash
350
+ # 1. Create 4-node cluster with storage
351
+ tp cluster create -i ~/.ssh/id_ed25519.pub -t 8xH200 -n 4
352
+ tp storage create -t shared -s 1000 --name dataset
353
+ tp cluster attach <cluster_id> <storage_id>
354
+
355
+ # 2. SSH into jumphost
356
+ tp ssh <jumphost-instance-id>
357
+
358
+ # 3. Upload data to shared storage
359
+ cd /mnt/shared-<storage_id>
360
+ # rsync, wget, or HF download your dataset here
361
+
362
+ # 4. Submit SLURM job
363
+ srun --nodes=4 --ntasks-per-node=8 --gpus-per-node=8 \
364
+ torchrun --nnodes=4 --nproc_per_node=8 \
365
+ --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:29500 \
366
+ train.py
367
+
368
+ # 5. Clean up
369
+ tp cluster detach <cluster_id> <storage_id>
370
+ tp cluster destroy <cluster_id>
371
+ ```
372
+
373
+ ### Workflow 3: Interactive Development
374
+
375
+ ```bash
376
+ # 1. Create single-node cluster
377
+ tp cluster create -i ~/.ssh/id_ed25519.pub -t 1xH100 --name dev-box
378
+
379
+ # 2. SSH in and iterate
380
+ tp ssh <instance_id>
381
+ git clone <repo>
382
+ pip install -r requirements.txt
383
+ python train.py
384
+
385
+ # 3. Destroy when done
386
+ tp cluster destroy <cluster_id>
387
+ ```
388
+
389
+ ---
390
+
391
+ ## Troubleshooting
392
+
393
+ ### Common Issues
394
+
395
+ **1. `TENSORPOOL_KEY` not set**
396
+ ```bash
397
+ [ -n "$TENSORPOOL_KEY" ] && echo "set" || echo "not set"
398
+ # If not set, connect via SynSci dashboard or export manually
399
+ ```
400
+
401
+ **2. Cluster stuck in PENDING/PROVISIONING**
402
+ ```bash
403
+ # Check cluster status
404
+ tp cluster info <cluster_id>
405
+ # Try a different instance type or wait for capacity
406
+ ```
407
+
408
+ **3. Can't SSH into cluster**
409
+ - Wait for status to reach `RUNNING` (can take a few minutes)
410
+ - Verify SSH key was provided at cluster creation
411
+ - For multi-node: SSH into jumphost first, then access workers
412
+
413
+ **4. Multi-node workers not accessible**
414
+ ```bash
415
+ # Workers have private IPs only — must go through jumphost
416
+ tp ssh <jumphost-instance-id>
417
+ ssh <cluster_id>-0 # from jumphost
418
+ ```
419
+
420
+ **5. Storage attachment fails**
421
+ - Shared storage: only multi-node clusters (2+ nodes)
422
+ - Object storage: works on all cluster types
423
+ - Check storage status is `READY` before attaching
424
+
425
+ **6. Job stuck in Pending**
426
+ ```bash
427
+ tp job info <job_id>
428
+ # Check instance type availability
429
+ tp job cancel <job_id> # Cancel and retry if needed
430
+ ```
431
+
432
+ **7. Job Error (non-zero exit code)**
433
+ ```bash
434
+ # Stream logs to see what failed
435
+ tp job listen <job_id>
436
+ # Fix script, re-push
437
+ tp job push tp.config.toml
438
+ ```
439
+
440
+ **8. Object storage slow for small files**
441
+ - Object storage has per-request overhead (HTTP calls)
442
+ - Use `boto3` or `rclone` instead of FUSE mount
443
+ - Don't set up Python venvs on object storage (thousands of small files)
444
+
445
+ ---
446
+
447
+ ## Agent Usage Instructions
448
+
449
+ When the `synsc` agent loads this skill for a user task:
450
+
451
+ 1. **Check credentials first**: Verify `TENSORPOOL_KEY` is set
452
+ 2. **Determine cluster vs job**: Jobs for batch experiments, clusters for interactive work
453
+ 3. **Select instance type**: Match GPU to workload (H100 for training, L40S for inference, B200 for largest models)
454
+ 4. **ALWAYS get user approval before creating resources**: Present instance type, estimated cost/hour, and expected duration. TensorPool bills per-second — user manages their own billing. Wait for explicit approval.
455
+ 5. **For jobs**: Create `tp.config.toml`, show it to user, get approval, then `tp job push`
456
+ 6. **For clusters**: Show the `tp cluster create` command with instance type and cost, get approval first
457
+ 7. **Monitor**: Use `tp job listen` or `tp ssh` to track progress
458
+ 8. **Clean up**: Always destroy clusters and detach storage when done
459
+
460
+ ### Cost Awareness
461
+
462
+ TensorPool charges per GPU/hour, prorated to the second:
463
+ - B200: $4.99/GPU/hr → 8xB200 = ~$40/hr per node
464
+ - H200: $2.99/GPU/hr → 8xH200 = ~$24/hr per node
465
+ - H100: $1.99/GPU/hr → 8xH100 = ~$16/hr per node
466
+ - L40S: $1.49/GPU/hr
467
+ - Storage: Shared $100/TB/month, Object $20/TB/month
468
+
469
+ **ALWAYS present estimated cost before creating any resource.**
470
+
471
+ ### Example Agent Workflow
472
+
473
+ ```
474
+ User: "Set up a 2-node H200 cluster for distributed training"
475
+
476
+ Agent steps:
477
+ 1. Load skill: tensorpool-gpu-cloud
478
+ 2. Check TENSORPOOL_KEY is set
479
+ 3. Present cost estimate: 2x 8xH200 = $47.84/hr ($0.80/min)
480
+ 4. Wait for explicit user approval
481
+ 5. tp cluster create -i ~/.ssh/id_ed25519.pub -t 8xH200 -n 2
482
+ 6. Wait for RUNNING status
483
+ 7. tp ssh <jumphost-instance-id>
484
+ 8. Help user with training setup
485
+ 9. Remind user to destroy cluster when done
486
+ ```
487
+
488
+ ---
489
+
490
+ ## Quick Reference
491
+
492
+ | Command | Description |
493
+ |---------|-------------|
494
+ | `tp cluster create -t <type> [-n <nodes>]` | Create GPU cluster |
495
+ | `tp cluster list` | List clusters |
496
+ | `tp cluster info <id>` | Cluster details |
497
+ | `tp cluster destroy <id>` | Terminate cluster |
498
+ | `tp cluster attach <cluster_id> <storage_id>` | Attach storage |
499
+ | `tp cluster detach <cluster_id> <storage_id>` | Detach storage |
500
+ | `tp job init` | Create job config |
501
+ | `tp job push <config>` | Submit training job |
502
+ | `tp job list` | List jobs |
503
+ | `tp job info <id>` | Job details |
504
+ | `tp job listen <id>` | Stream job logs |
505
+ | `tp job pull <id>` | Download outputs |
506
+ | `tp job cancel <id>` | Cancel job |
507
+ | `tp storage create -t <type> [-s <size>]` | Create storage |
508
+ | `tp storage list` | List storage |
509
+ | `tp storage destroy <id>` | Delete storage |
510
+ | `tp ssh <instance_id>` | SSH to instance |
511
+ | `tp me` | Account info |
512
+
513
+ ## Resources
514
+
515
+ - **Documentation**: https://docs.tensorpool.dev
516
+ - **Dashboard**: https://dashboard.tensorpool.dev
517
+ - **Pricing**: https://tensorpool.dev/pricing
518
+ - **Community**: https://tensorpool.dev/slack
519
+ - **Support**: team@tensorpool.dev
package/bin/synsc CHANGED
Binary file
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@synsci/cli-darwin-x64",
3
- "version": "1.1.58",
3
+ "version": "1.1.60",
4
4
  "os": [
5
5
  "darwin"
6
6
  ],