@synsci/cli-darwin-x64 1.1.58 → 1.1.60
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/bin/skills/grpo-rl-training/README.md +1 -1
- package/bin/skills/hugging-face-evaluation/examples/.env.example +7 -0
- package/bin/skills/hugging-face-evaluation/scripts/test_extraction.py +0 -0
- package/bin/skills/hugging-face-model-trainer/scripts/convert_to_gguf.py +0 -0
- package/bin/skills/hugging-face-model-trainer/scripts/estimate_cost.py +0 -0
- package/bin/skills/hugging-face-paper-publisher/scripts/paper_manager.py +0 -0
- package/bin/skills/hugging-face-trackio/.claude-plugin/plugin.json +19 -0
- package/bin/skills/modal/SKILL.md +316 -275
- package/bin/skills/modal/references/advanced-patterns.md +598 -0
- package/bin/skills/modal/references/examples-catalog.md +423 -0
- package/bin/skills/prime-intellect-lab/README.md +69 -0
- package/bin/skills/prime-intellect-lab/SKILL.md +598 -0
- package/bin/skills/prime-intellect-lab/templates/basic_rl_training.toml +82 -0
- package/bin/skills/tensorpool/SKILL.md +519 -0
- package/bin/synsc +0 -0
- package/package.json +1 -1
- package/bin/skills/modal/references/advanced-usage.md +0 -503
|
@@ -0,0 +1,519 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: tensorpool-gpu-cloud
|
|
3
|
+
description: On-demand GPU clusters and training jobs with git-style interface. Use when you need multi-node GPU clusters (B200, H200, H100), persistent NFS storage, or batch training jobs with the TensorPool CLI.
|
|
4
|
+
version: 1.0.0
|
|
5
|
+
author: Synthetic Sciences
|
|
6
|
+
license: MIT
|
|
7
|
+
tags: [Infrastructure, GPU Cloud, Training, Clusters, Jobs, TensorPool, Multi-Node, Distributed Training, NFS Storage]
|
|
8
|
+
dependencies: [tensorpool]
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
# TensorPool GPU Cloud
|
|
12
|
+
|
|
13
|
+
On-demand GPU clusters and git-style training jobs via the `tp` CLI. TensorPool provides multi-node GPU clusters with high-speed interconnects, persistent storage, and SLURM for distributed training.
|
|
14
|
+
|
|
15
|
+
## When to Use TensorPool
|
|
16
|
+
|
|
17
|
+
**Use TensorPool when:**
|
|
18
|
+
- Need on-demand GPU clusters with SSH access (single or multi-node)
|
|
19
|
+
- Running distributed training across multiple nodes (SLURM pre-installed)
|
|
20
|
+
- Want git-style job interface: `tp job push` to submit, `tp job pull` to get results
|
|
21
|
+
- Need persistent NFS storage shared across cluster nodes
|
|
22
|
+
- Require B200, B300, H200, H100, L40S, or MI300X GPUs
|
|
23
|
+
- Want pay-per-second billing with no egress fees
|
|
24
|
+
|
|
25
|
+
**Key features:**
|
|
26
|
+
- **GPU variety**: B300, B200, H200, H100, L40S, MI300X, CPU instances
|
|
27
|
+
- **Multi-node clusters**: 8xB200 and 8xH200 with SLURM + InfiniBand
|
|
28
|
+
- **Jobs**: Git-style `tp job push/pull/listen` for batch experiments
|
|
29
|
+
- **Persistent storage**: Shared NFS volumes (300 GB/s aggregate) or S3-compatible object storage
|
|
30
|
+
- **Simple pricing**: Per-second billing, H100 at $1.99/hr, H200 at $2.99/hr, B200 at $4.99/hr
|
|
31
|
+
|
|
32
|
+
**Use alternatives instead:**
|
|
33
|
+
- **Tinker**: For managed SFT/fine-tuning (no infrastructure management)
|
|
34
|
+
- **Prime Intellect Lab**: For hosted RL training with environments
|
|
35
|
+
- **Modal**: For serverless, auto-scaling GPU workloads
|
|
36
|
+
- **Lambda Labs**: For dedicated instances with persistent filesystems
|
|
37
|
+
- **SkyPilot**: For multi-cloud orchestration and cost optimization
|
|
38
|
+
|
|
39
|
+
### Decision Matrix
|
|
40
|
+
|
|
41
|
+
| Task | Platform |
|
|
42
|
+
|------|----------|
|
|
43
|
+
| SFT / LoRA fine-tuning | Tinker (default) |
|
|
44
|
+
| Hosted RL with environments | Prime Intellect Lab |
|
|
45
|
+
| On-demand GPU clusters with SSH | **TensorPool** |
|
|
46
|
+
| Batch training jobs (git-style) | **TensorPool** |
|
|
47
|
+
| Multi-node distributed training | **TensorPool** or Lambda (1-Click Clusters) |
|
|
48
|
+
| Serverless auto-scaling | Modal |
|
|
49
|
+
| Multi-cloud cost optimization | SkyPilot |
|
|
50
|
+
|
|
51
|
+
---
|
|
52
|
+
|
|
53
|
+
## Quick Start
|
|
54
|
+
|
|
55
|
+
### Installation
|
|
56
|
+
|
|
57
|
+
```bash
|
|
58
|
+
pip install tensorpool
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
### Authentication
|
|
62
|
+
|
|
63
|
+
```bash
|
|
64
|
+
# Set API key (synced automatically via SynSci dashboard)
|
|
65
|
+
export TENSORPOOL_KEY="your_api_key_here"
|
|
66
|
+
|
|
67
|
+
# Verify
|
|
68
|
+
[ -n "$TENSORPOOL_KEY" ] && echo "set" || echo "not set"
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
If connected via the Synthetic Sciences dashboard, `TENSORPOOL_KEY` is injected automatically.
|
|
72
|
+
|
|
73
|
+
### Create Your First Cluster
|
|
74
|
+
|
|
75
|
+
```bash
|
|
76
|
+
# Single H100
|
|
77
|
+
tp cluster create -i ~/.ssh/id_ed25519.pub -t 1xH100
|
|
78
|
+
|
|
79
|
+
# Check status
|
|
80
|
+
tp cluster info <cluster_id>
|
|
81
|
+
|
|
82
|
+
# SSH in
|
|
83
|
+
tp ssh <instance_id>
|
|
84
|
+
|
|
85
|
+
# Destroy when done
|
|
86
|
+
tp cluster destroy <cluster_id>
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
### Submit a Training Job
|
|
90
|
+
|
|
91
|
+
```bash
|
|
92
|
+
# Initialize job config
|
|
93
|
+
tp job init
|
|
94
|
+
|
|
95
|
+
# Edit tp.config.toml, then push
|
|
96
|
+
tp job push tp.config.toml
|
|
97
|
+
|
|
98
|
+
# Stream logs
|
|
99
|
+
tp job listen <job_id>
|
|
100
|
+
|
|
101
|
+
# Download results
|
|
102
|
+
tp job pull <job_id>
|
|
103
|
+
```
|
|
104
|
+
|
|
105
|
+
---
|
|
106
|
+
|
|
107
|
+
## Instance Types
|
|
108
|
+
|
|
109
|
+
| Instance Type | Multi-Node Support |
|
|
110
|
+
|---------------|-------------------|
|
|
111
|
+
| `1xB300` / `2xB300` / `4xB300` / `8xB300` | No |
|
|
112
|
+
| `1xB200` / `2xB200` / `4xB200` / `8xB200` | **Yes** (8xB200) |
|
|
113
|
+
| `1xH200` / `2xH200` / `4xH200` / `8xH200` | **Yes** (8xH200) |
|
|
114
|
+
| `1xH100` / `2xH100` / `4xH100` / `8xH100` | No |
|
|
115
|
+
| `1xL40S` | No |
|
|
116
|
+
| `32xCPU` / `64xCPU` | No |
|
|
117
|
+
|
|
118
|
+
### Pricing (per GPU/hour)
|
|
119
|
+
|
|
120
|
+
| GPU | Price |
|
|
121
|
+
|-----|-------|
|
|
122
|
+
| B300 SXM | $5.49/hr |
|
|
123
|
+
| B200 SXM | $4.99/hr |
|
|
124
|
+
| H200 SXM | $2.99/hr |
|
|
125
|
+
| H100 SXM | $1.99/hr |
|
|
126
|
+
| L40S | $1.49/hr |
|
|
127
|
+
| CPU | $0.015/hr |
|
|
128
|
+
|
|
129
|
+
All charges prorated to the second.
|
|
130
|
+
|
|
131
|
+
---
|
|
132
|
+
|
|
133
|
+
## Clusters
|
|
134
|
+
|
|
135
|
+
### Single-Node Clusters
|
|
136
|
+
|
|
137
|
+
```bash
|
|
138
|
+
# Various GPU configs
|
|
139
|
+
tp cluster create -i ~/.ssh/id_ed25519.pub -t 1xH100
|
|
140
|
+
tp cluster create -i ~/.ssh/id_ed25519.pub -t 8xH200
|
|
141
|
+
tp cluster create -i ~/.ssh/id_ed25519.pub -t 8xB200
|
|
142
|
+
tp cluster create -i ~/.ssh/id_ed25519.pub -t 1xL40S
|
|
143
|
+
|
|
144
|
+
# With custom name
|
|
145
|
+
tp cluster create -i ~/.ssh/id_ed25519.pub -t 1xH100 --name my-cluster
|
|
146
|
+
```
|
|
147
|
+
|
|
148
|
+
### Multi-Node Clusters
|
|
149
|
+
|
|
150
|
+
Multi-node clusters come with **SLURM preinstalled**. Only `8xH200` and `8xB200` support multi-node.
|
|
151
|
+
|
|
152
|
+
```bash
|
|
153
|
+
# 2-node cluster (16 GPUs total)
|
|
154
|
+
tp cluster create -i ~/.ssh/id_ed25519.pub -t 8xH200 -n 2
|
|
155
|
+
|
|
156
|
+
# 4-node cluster (32 GPUs total)
|
|
157
|
+
tp cluster create -i ~/.ssh/id_ed25519.pub -t 8xB200 -n 4
|
|
158
|
+
```
|
|
159
|
+
|
|
160
|
+
**Multi-node architecture:**
|
|
161
|
+
- **Jumphost**: `{cluster_id}-jumphost` — SLURM login/controller, public IP
|
|
162
|
+
- **Worker nodes**: `{cluster_id}-0`, `{cluster_id}-1`, etc. — private IPs only
|
|
163
|
+
|
|
164
|
+
```bash
|
|
165
|
+
# SSH into jumphost first
|
|
166
|
+
tp ssh <jumphost-instance-id>
|
|
167
|
+
|
|
168
|
+
# From jumphost, access workers
|
|
169
|
+
ssh <cluster_id>-0
|
|
170
|
+
ssh <cluster_id>-1
|
|
171
|
+
```
|
|
172
|
+
|
|
173
|
+
### Cluster Management
|
|
174
|
+
|
|
175
|
+
```bash
|
|
176
|
+
tp cluster list # List all clusters
|
|
177
|
+
tp cluster list --org # List organization clusters
|
|
178
|
+
tp cluster info <cluster_id> # Detailed cluster info
|
|
179
|
+
tp cluster edit <cluster_id> --name "new-name"
|
|
180
|
+
tp cluster edit <cluster_id> --deletion-protection true
|
|
181
|
+
tp cluster destroy <cluster_id> # Terminate cluster
|
|
182
|
+
```
|
|
183
|
+
|
|
184
|
+
### Cluster Statuses
|
|
185
|
+
|
|
186
|
+
`PENDING` → `PROVISIONING` → `CONFIGURING` → `RUNNING` → `DESTROYING` → `DESTROYED`
|
|
187
|
+
|
|
188
|
+
If any instance fails, cluster shows as `FAILED`.
|
|
189
|
+
|
|
190
|
+
---
|
|
191
|
+
|
|
192
|
+
## Jobs
|
|
193
|
+
|
|
194
|
+
Git-style interface for running training experiments on GPUs. Pay only for the time your job runs.
|
|
195
|
+
|
|
196
|
+
### Job Configuration (tp.config.toml)
|
|
197
|
+
|
|
198
|
+
```toml
|
|
199
|
+
commands = [
|
|
200
|
+
"pip install -r requirements.txt",
|
|
201
|
+
"python train.py --epochs 100",
|
|
202
|
+
]
|
|
203
|
+
|
|
204
|
+
instance_type = "1xH100"
|
|
205
|
+
|
|
206
|
+
outputs = [
|
|
207
|
+
"checkpoints/",
|
|
208
|
+
"model.pth",
|
|
209
|
+
"results.json",
|
|
210
|
+
]
|
|
211
|
+
|
|
212
|
+
ignore = [
|
|
213
|
+
".venv",
|
|
214
|
+
"venv/",
|
|
215
|
+
"__pycache__/",
|
|
216
|
+
".git",
|
|
217
|
+
"*.pyc",
|
|
218
|
+
]
|
|
219
|
+
```
|
|
220
|
+
|
|
221
|
+
### Job Commands
|
|
222
|
+
|
|
223
|
+
```bash
|
|
224
|
+
tp job init # Create tp.config.toml
|
|
225
|
+
tp job push tp.config.toml # Submit job
|
|
226
|
+
tp job list # List your jobs
|
|
227
|
+
tp job list --org # List org jobs
|
|
228
|
+
tp job info <job_id> # Job details
|
|
229
|
+
tp job listen <job_id> # Stream real-time logs
|
|
230
|
+
tp job pull <job_id> # Download output files
|
|
231
|
+
tp job pull <job_id> --force # Overwrite existing files
|
|
232
|
+
tp job cancel <job_id> # Cancel running job
|
|
233
|
+
tp job cancel <job_id> --no-input # Skip confirmation
|
|
234
|
+
```
|
|
235
|
+
|
|
236
|
+
### Job Statuses
|
|
237
|
+
|
|
238
|
+
`Pending` → `Running` → `Completed` / `Error` / `Failed` / `Canceled`
|
|
239
|
+
|
|
240
|
+
- **Error**: User-level (non-zero exit code) — check logs
|
|
241
|
+
- **Failed**: System-level (node/GPU failure) — TensorPool investigates
|
|
242
|
+
|
|
243
|
+
### Multiple Experiments
|
|
244
|
+
|
|
245
|
+
```bash
|
|
246
|
+
# Create multiple configs
|
|
247
|
+
tp job init # → tp.config.toml (rename to tp.baseline.toml)
|
|
248
|
+
tp job init # → tp.config1.toml (rename to tp.experiment.toml)
|
|
249
|
+
|
|
250
|
+
# Run different experiments
|
|
251
|
+
tp job push tp.baseline.toml
|
|
252
|
+
tp job push tp.experiment.toml
|
|
253
|
+
```
|
|
254
|
+
|
|
255
|
+
---
|
|
256
|
+
|
|
257
|
+
## Storage
|
|
258
|
+
|
|
259
|
+
### Shared Storage Volumes (NFS)
|
|
260
|
+
|
|
261
|
+
High-performance NFS for multi-node clusters. Up to 300 GB/s aggregate read throughput.
|
|
262
|
+
|
|
263
|
+
```bash
|
|
264
|
+
# Create 500GB shared volume
|
|
265
|
+
tp storage create -t shared -s 500 --name training-data
|
|
266
|
+
|
|
267
|
+
# Attach to cluster
|
|
268
|
+
tp cluster attach <cluster_id> <storage_id>
|
|
269
|
+
|
|
270
|
+
# Access on cluster at /mnt/shared-<storage_id>
|
|
271
|
+
|
|
272
|
+
# Detach
|
|
273
|
+
tp cluster detach <cluster_id> <storage_id>
|
|
274
|
+
|
|
275
|
+
# Destroy
|
|
276
|
+
tp storage destroy <storage_id>
|
|
277
|
+
```
|
|
278
|
+
|
|
279
|
+
**Shared storage**: Multi-node only (2+ nodes), $100/TB/month, POSIX compliant.
|
|
280
|
+
|
|
281
|
+
### Object Storage (S3-compatible)
|
|
282
|
+
|
|
283
|
+
```bash
|
|
284
|
+
# Create object storage bucket
|
|
285
|
+
tp storage create -t object --name models
|
|
286
|
+
|
|
287
|
+
# Attach to any cluster type
|
|
288
|
+
tp cluster attach <cluster_id> <storage_id>
|
|
289
|
+
|
|
290
|
+
# Mount at /mnt/object-<storage_id> (FUSE)
|
|
291
|
+
# Prefer boto3/rclone over FUSE mount for performance
|
|
292
|
+
```
|
|
293
|
+
|
|
294
|
+
**Object storage**: All cluster types, $20/TB/month, globally replicated, no ingress/egress fees. Not POSIX compliant.
|
|
295
|
+
|
|
296
|
+
### Storage Commands
|
|
297
|
+
|
|
298
|
+
```bash
|
|
299
|
+
tp storage create -t <type> [-s <size>] [--name <name>]
|
|
300
|
+
tp storage list
|
|
301
|
+
tp storage info <storage_id>
|
|
302
|
+
tp storage edit <storage_id> --name "new-name"
|
|
303
|
+
tp storage edit <storage_id> --deletion-protection true
|
|
304
|
+
tp storage destroy <storage_id>
|
|
305
|
+
```
|
|
306
|
+
|
|
307
|
+
---
|
|
308
|
+
|
|
309
|
+
## SSH Keys
|
|
310
|
+
|
|
311
|
+
```bash
|
|
312
|
+
# Generate if needed
|
|
313
|
+
ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519
|
|
314
|
+
|
|
315
|
+
# Use when creating clusters
|
|
316
|
+
tp cluster create -i ~/.ssh/id_ed25519.pub -t 1xH100
|
|
317
|
+
|
|
318
|
+
# Connect to cluster
|
|
319
|
+
tp ssh <instance_id>
|
|
320
|
+
```
|
|
321
|
+
|
|
322
|
+
---
|
|
323
|
+
|
|
324
|
+
## Common Workflows
|
|
325
|
+
|
|
326
|
+
### Workflow 1: Single-Node Training Job
|
|
327
|
+
|
|
328
|
+
```bash
|
|
329
|
+
# 1. Create job config
|
|
330
|
+
tp job init
|
|
331
|
+
|
|
332
|
+
# 2. Configure tp.config.toml
|
|
333
|
+
# commands = ["pip install -r requirements.txt", "python train.py"]
|
|
334
|
+
# instance_type = "1xH100"
|
|
335
|
+
# outputs = ["checkpoints/", "model.pth"]
|
|
336
|
+
|
|
337
|
+
# 3. Submit
|
|
338
|
+
tp job push tp.config.toml
|
|
339
|
+
|
|
340
|
+
# 4. Monitor
|
|
341
|
+
tp job listen <job_id>
|
|
342
|
+
|
|
343
|
+
# 5. Get results
|
|
344
|
+
tp job pull <job_id>
|
|
345
|
+
```
|
|
346
|
+
|
|
347
|
+
### Workflow 2: Multi-Node Distributed Training
|
|
348
|
+
|
|
349
|
+
```bash
|
|
350
|
+
# 1. Create 4-node cluster with storage
|
|
351
|
+
tp cluster create -i ~/.ssh/id_ed25519.pub -t 8xH200 -n 4
|
|
352
|
+
tp storage create -t shared -s 1000 --name dataset
|
|
353
|
+
tp cluster attach <cluster_id> <storage_id>
|
|
354
|
+
|
|
355
|
+
# 2. SSH into jumphost
|
|
356
|
+
tp ssh <jumphost-instance-id>
|
|
357
|
+
|
|
358
|
+
# 3. Upload data to shared storage
|
|
359
|
+
cd /mnt/shared-<storage_id>
|
|
360
|
+
# rsync, wget, or HF download your dataset here
|
|
361
|
+
|
|
362
|
+
# 4. Submit SLURM job
|
|
363
|
+
srun --nodes=4 --ntasks-per-node=8 --gpus-per-node=8 \
|
|
364
|
+
torchrun --nnodes=4 --nproc_per_node=8 \
|
|
365
|
+
--rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:29500 \
|
|
366
|
+
train.py
|
|
367
|
+
|
|
368
|
+
# 5. Clean up
|
|
369
|
+
tp cluster detach <cluster_id> <storage_id>
|
|
370
|
+
tp cluster destroy <cluster_id>
|
|
371
|
+
```
|
|
372
|
+
|
|
373
|
+
### Workflow 3: Interactive Development
|
|
374
|
+
|
|
375
|
+
```bash
|
|
376
|
+
# 1. Create single-node cluster
|
|
377
|
+
tp cluster create -i ~/.ssh/id_ed25519.pub -t 1xH100 --name dev-box
|
|
378
|
+
|
|
379
|
+
# 2. SSH in and iterate
|
|
380
|
+
tp ssh <instance_id>
|
|
381
|
+
git clone <repo>
|
|
382
|
+
pip install -r requirements.txt
|
|
383
|
+
python train.py
|
|
384
|
+
|
|
385
|
+
# 3. Destroy when done
|
|
386
|
+
tp cluster destroy <cluster_id>
|
|
387
|
+
```
|
|
388
|
+
|
|
389
|
+
---
|
|
390
|
+
|
|
391
|
+
## Troubleshooting
|
|
392
|
+
|
|
393
|
+
### Common Issues
|
|
394
|
+
|
|
395
|
+
**1. `TENSORPOOL_KEY` not set**
|
|
396
|
+
```bash
|
|
397
|
+
[ -n "$TENSORPOOL_KEY" ] && echo "set" || echo "not set"
|
|
398
|
+
# If not set, connect via SynSci dashboard or export manually
|
|
399
|
+
```
|
|
400
|
+
|
|
401
|
+
**2. Cluster stuck in PENDING/PROVISIONING**
|
|
402
|
+
```bash
|
|
403
|
+
# Check cluster status
|
|
404
|
+
tp cluster info <cluster_id>
|
|
405
|
+
# Try a different instance type or wait for capacity
|
|
406
|
+
```
|
|
407
|
+
|
|
408
|
+
**3. Can't SSH into cluster**
|
|
409
|
+
- Wait for status to reach `RUNNING` (can take a few minutes)
|
|
410
|
+
- Verify SSH key was provided at cluster creation
|
|
411
|
+
- For multi-node: SSH into jumphost first, then access workers
|
|
412
|
+
|
|
413
|
+
**4. Multi-node workers not accessible**
|
|
414
|
+
```bash
|
|
415
|
+
# Workers have private IPs only — must go through jumphost
|
|
416
|
+
tp ssh <jumphost-instance-id>
|
|
417
|
+
ssh <cluster_id>-0 # from jumphost
|
|
418
|
+
```
|
|
419
|
+
|
|
420
|
+
**5. Storage attachment fails**
|
|
421
|
+
- Shared storage: only multi-node clusters (2+ nodes)
|
|
422
|
+
- Object storage: works on all cluster types
|
|
423
|
+
- Check storage status is `READY` before attaching
|
|
424
|
+
|
|
425
|
+
**6. Job stuck in Pending**
|
|
426
|
+
```bash
|
|
427
|
+
tp job info <job_id>
|
|
428
|
+
# Check instance type availability
|
|
429
|
+
tp job cancel <job_id> # Cancel and retry if needed
|
|
430
|
+
```
|
|
431
|
+
|
|
432
|
+
**7. Job Error (non-zero exit code)**
|
|
433
|
+
```bash
|
|
434
|
+
# Stream logs to see what failed
|
|
435
|
+
tp job listen <job_id>
|
|
436
|
+
# Fix script, re-push
|
|
437
|
+
tp job push tp.config.toml
|
|
438
|
+
```
|
|
439
|
+
|
|
440
|
+
**8. Object storage slow for small files**
|
|
441
|
+
- Object storage has per-request overhead (HTTP calls)
|
|
442
|
+
- Use `boto3` or `rclone` instead of FUSE mount
|
|
443
|
+
- Don't set up Python venvs on object storage (thousands of small files)
|
|
444
|
+
|
|
445
|
+
---
|
|
446
|
+
|
|
447
|
+
## Agent Usage Instructions
|
|
448
|
+
|
|
449
|
+
When the `synsc` agent loads this skill for a user task:
|
|
450
|
+
|
|
451
|
+
1. **Check credentials first**: Verify `TENSORPOOL_KEY` is set
|
|
452
|
+
2. **Determine cluster vs job**: Jobs for batch experiments, clusters for interactive work
|
|
453
|
+
3. **Select instance type**: Match GPU to workload (H100 for training, L40S for inference, B200 for largest models)
|
|
454
|
+
4. **ALWAYS get user approval before creating resources**: Present instance type, estimated cost/hour, and expected duration. TensorPool bills per-second — user manages their own billing. Wait for explicit approval.
|
|
455
|
+
5. **For jobs**: Create `tp.config.toml`, show it to user, get approval, then `tp job push`
|
|
456
|
+
6. **For clusters**: Show the `tp cluster create` command with instance type and cost, get approval first
|
|
457
|
+
7. **Monitor**: Use `tp job listen` or `tp ssh` to track progress
|
|
458
|
+
8. **Clean up**: Always destroy clusters and detach storage when done
|
|
459
|
+
|
|
460
|
+
### Cost Awareness
|
|
461
|
+
|
|
462
|
+
TensorPool charges per GPU/hour, prorated to the second:
|
|
463
|
+
- B200: $4.99/GPU/hr → 8xB200 = ~$40/hr per node
|
|
464
|
+
- H200: $2.99/GPU/hr → 8xH200 = ~$24/hr per node
|
|
465
|
+
- H100: $1.99/GPU/hr → 8xH100 = ~$16/hr per node
|
|
466
|
+
- L40S: $1.49/GPU/hr
|
|
467
|
+
- Storage: Shared $100/TB/month, Object $20/TB/month
|
|
468
|
+
|
|
469
|
+
**ALWAYS present estimated cost before creating any resource.**
|
|
470
|
+
|
|
471
|
+
### Example Agent Workflow
|
|
472
|
+
|
|
473
|
+
```
|
|
474
|
+
User: "Set up a 2-node H200 cluster for distributed training"
|
|
475
|
+
|
|
476
|
+
Agent steps:
|
|
477
|
+
1. Load skill: tensorpool-gpu-cloud
|
|
478
|
+
2. Check TENSORPOOL_KEY is set
|
|
479
|
+
3. Present cost estimate: 2x 8xH200 = $47.84/hr ($0.80/min)
|
|
480
|
+
4. Wait for explicit user approval
|
|
481
|
+
5. tp cluster create -i ~/.ssh/id_ed25519.pub -t 8xH200 -n 2
|
|
482
|
+
6. Wait for RUNNING status
|
|
483
|
+
7. tp ssh <jumphost-instance-id>
|
|
484
|
+
8. Help user with training setup
|
|
485
|
+
9. Remind user to destroy cluster when done
|
|
486
|
+
```
|
|
487
|
+
|
|
488
|
+
---
|
|
489
|
+
|
|
490
|
+
## Quick Reference
|
|
491
|
+
|
|
492
|
+
| Command | Description |
|
|
493
|
+
|---------|-------------|
|
|
494
|
+
| `tp cluster create -t <type> [-n <nodes>]` | Create GPU cluster |
|
|
495
|
+
| `tp cluster list` | List clusters |
|
|
496
|
+
| `tp cluster info <id>` | Cluster details |
|
|
497
|
+
| `tp cluster destroy <id>` | Terminate cluster |
|
|
498
|
+
| `tp cluster attach <cluster_id> <storage_id>` | Attach storage |
|
|
499
|
+
| `tp cluster detach <cluster_id> <storage_id>` | Detach storage |
|
|
500
|
+
| `tp job init` | Create job config |
|
|
501
|
+
| `tp job push <config>` | Submit training job |
|
|
502
|
+
| `tp job list` | List jobs |
|
|
503
|
+
| `tp job info <id>` | Job details |
|
|
504
|
+
| `tp job listen <id>` | Stream job logs |
|
|
505
|
+
| `tp job pull <id>` | Download outputs |
|
|
506
|
+
| `tp job cancel <id>` | Cancel job |
|
|
507
|
+
| `tp storage create -t <type> [-s <size>]` | Create storage |
|
|
508
|
+
| `tp storage list` | List storage |
|
|
509
|
+
| `tp storage destroy <id>` | Delete storage |
|
|
510
|
+
| `tp ssh <instance_id>` | SSH to instance |
|
|
511
|
+
| `tp me` | Account info |
|
|
512
|
+
|
|
513
|
+
## Resources
|
|
514
|
+
|
|
515
|
+
- **Documentation**: https://docs.tensorpool.dev
|
|
516
|
+
- **Dashboard**: https://dashboard.tensorpool.dev
|
|
517
|
+
- **Pricing**: https://tensorpool.dev/pricing
|
|
518
|
+
- **Community**: https://tensorpool.dev/slack
|
|
519
|
+
- **Support**: team@tensorpool.dev
|
package/bin/synsc
CHANGED
|
Binary file
|