@mariozechner/pi 0.1.5 → 0.2.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +98 -2
- package/package.json +1 -1
- package/pi.js +576 -75
- package/pod_setup.sh +55 -114
- package/vllm_manager.py +167 -4
package/README.md
CHANGED
|
@@ -39,7 +39,7 @@ A simple CLI tool that automatically sets up and manages vLLM deployments on GPU
|
|
|
39
39
|
|
|
40
40
|
- **Node.js 14+** - To run the CLI tool on your machine
|
|
41
41
|
- **HuggingFace Token** - Required for downloading models (get one at https://huggingface.co/settings/tokens)
|
|
42
|
-
- **Prime Intellect Account**
|
|
42
|
+
- **Prime Intellect/DataCrunch/Vast.ai Account**
|
|
43
43
|
- **GPU Pod** - At least one running pod with:
|
|
44
44
|
- Ubuntu 22+ image (selected when creating pod)
|
|
45
45
|
- SSH access enabled
|
|
@@ -110,6 +110,33 @@ pi pod remove <pod-name> # Remove pod from config
|
|
|
110
110
|
pi shell # SSH into active pod
|
|
111
111
|
```
|
|
112
112
|
|
|
113
|
+
#### Working with Multiple Pods
|
|
114
|
+
|
|
115
|
+
You can manage models on any pod without switching the active pod by using the `--pod` parameter:
|
|
116
|
+
|
|
117
|
+
```bash
|
|
118
|
+
# List models on a specific pod
|
|
119
|
+
pi list --pod prod
|
|
120
|
+
|
|
121
|
+
# Start a model on a specific pod
|
|
122
|
+
pi start Qwen/Qwen2.5-7B-Instruct --name qwen --pod dev
|
|
123
|
+
|
|
124
|
+
# Stop a model on a specific pod
|
|
125
|
+
pi stop qwen --pod dev
|
|
126
|
+
|
|
127
|
+
# View logs from a specific pod
|
|
128
|
+
pi logs qwen --pod dev
|
|
129
|
+
|
|
130
|
+
# Test a model on a specific pod
|
|
131
|
+
pi prompt qwen "Hello!" --pod dev
|
|
132
|
+
|
|
133
|
+
# SSH into a specific pod
|
|
134
|
+
pi shell --pod prod
|
|
135
|
+
pi ssh --pod prod "nvidia-smi"
|
|
136
|
+
```
|
|
137
|
+
|
|
138
|
+
This allows you to manage multiple environments (dev, staging, production) from a single machine without constantly switching between them.
|
|
139
|
+
|
|
113
140
|
### Model Management
|
|
114
141
|
|
|
115
142
|
Each model runs as a separate vLLM instance with its own port and GPU allocation. The tool automatically manages GPU assignment on multi-GPU systems and ensures models don't conflict. Models are accessed by their short names (either auto-generated or specified with --name).
|
|
@@ -122,12 +149,16 @@ pi start <model> [options] # Start a model with options
|
|
|
122
149
|
--context <size> # Context window: 4k, 8k, 16k, 32k (default: model default)
|
|
123
150
|
--memory <percent> # GPU memory: 30%, 50%, 90% (default: 90%)
|
|
124
151
|
--all-gpus # Use tensor parallelism across all GPUs
|
|
152
|
+
--pod <pod-name> # Run on specific pod (default: active pod)
|
|
125
153
|
--vllm-args # Pass all remaining args directly to vLLM
|
|
126
154
|
pi stop [name] # Stop a model (or all if no name)
|
|
127
155
|
pi logs <name> # View logs with tail -f
|
|
128
156
|
pi prompt <name> "message" # Quick test prompt
|
|
157
|
+
pi downloads [--live] # Check model download progress (--live for continuous monitoring)
|
|
129
158
|
```
|
|
130
159
|
|
|
160
|
+
All model management commands support the `--pod` parameter to target a specific pod without switching the active pod.
|
|
161
|
+
|
|
131
162
|
## Examples
|
|
132
163
|
|
|
133
164
|
### Search for models
|
|
@@ -199,6 +230,11 @@ pi start meta-llama/Llama-3.1-8B --memory 20% # Auto-assigns to GPU 2
|
|
|
199
230
|
pi list
|
|
200
231
|
```
|
|
201
232
|
|
|
233
|
+
## Qwen on a single H200
|
|
234
|
+
```bash
|
|
235
|
+
pi start Qwen/Qwen3-Coder-30B-A3B-Instruct qwen3-30b
|
|
236
|
+
```
|
|
237
|
+
|
|
202
238
|
### Run large models across all GPUs
|
|
203
239
|
```bash
|
|
204
240
|
# Use --all-gpus for tensor parallelism across all available GPUs
|
|
@@ -214,7 +250,7 @@ pi start Qwen/Qwen2.5-72B-Instruct --all-gpus --context 64k
|
|
|
214
250
|
# Qwen3-Coder 480B on 8xH200 with expert parallelism
|
|
215
251
|
pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen-coder --vllm-args \
|
|
216
252
|
--data-parallel-size 8 --enable-expert-parallel \
|
|
217
|
-
--tool-call-parser qwen3_coder --enable-auto-tool-choice --max-model-len 200000
|
|
253
|
+
--tool-call-parser qwen3_coder --enable-auto-tool-choice --gpu-memory-utilization 0.95 --max-model-len 200000
|
|
218
254
|
|
|
219
255
|
# DeepSeek with custom quantization
|
|
220
256
|
pi start deepseek-ai/DeepSeek-Coder-V2-Instruct --name deepseek --vllm-args \
|
|
@@ -225,6 +261,12 @@ pi start mistralai/Mixtral-8x22B-Instruct-v0.1 --name mixtral --vllm-args \
|
|
|
225
261
|
--tensor-parallel-size 8 --pipeline-parallel-size 2
|
|
226
262
|
```
|
|
227
263
|
|
|
264
|
+
**Note on Special Models**: Some models require specific vLLM arguments to run properly:
|
|
265
|
+
- **Qwen3-Coder 480B**: Requires `--enable-expert-parallel` for MoE support
|
|
266
|
+
- **Kimi K2**: May require custom arguments - check the model's documentation
|
|
267
|
+
- **DeepSeek V3**: Often needs `--trust-remote-code` for custom architectures
|
|
268
|
+
- When in doubt, consult the model's HuggingFace page or documentation for recommended vLLM settings
|
|
269
|
+
|
|
228
270
|
### Check GPU usage
|
|
229
271
|
```bash
|
|
230
272
|
pi ssh "nvidia-smi"
|
|
@@ -307,6 +349,33 @@ The tool automatically detects the model and tries to use an appropriate parser:
|
|
|
307
349
|
|
|
308
350
|
Remember: Tool calling is still an evolving feature in the LLM ecosystem. What works today might break tomorrow with a model update.
|
|
309
351
|
|
|
352
|
+
## Monitoring Downloads
|
|
353
|
+
|
|
354
|
+
Use `pi downloads` to check the progress of model downloads in the HuggingFace cache:
|
|
355
|
+
|
|
356
|
+
```bash
|
|
357
|
+
pi downloads # Check downloads on active pod
|
|
358
|
+
pi downloads --live # Live monitoring (updates every 2 seconds)
|
|
359
|
+
pi downloads --pod 8h200 # Check downloads on specific pod
|
|
360
|
+
pi downloads --live --pod 8h200 # Live monitoring on specific pod
|
|
361
|
+
```
|
|
362
|
+
|
|
363
|
+
The command shows:
|
|
364
|
+
- Model name and current size
|
|
365
|
+
- Download progress (files downloaded / total files)
|
|
366
|
+
- Download status (⏬ Downloading or ⏸ Idle)
|
|
367
|
+
- Estimated total size (if available from HuggingFace)
|
|
368
|
+
|
|
369
|
+
**Tip for large models**: When starting models like Qwen-480B that take time to download, run `pi start` in one terminal and `pi downloads --live` in another to monitor progress. This is especially helpful since the log output during downloads can be minimal.
|
|
370
|
+
|
|
371
|
+
**Downloads stalled?** If downloads appear stuck (e.g., at 92%), you can safely stop and restart:
|
|
372
|
+
```bash
|
|
373
|
+
pi stop <model-name> # Stop the current process
|
|
374
|
+
pi downloads # Verify progress (e.g., 45/49 files)
|
|
375
|
+
pi start <same-command> # Restart with the same command
|
|
376
|
+
```
|
|
377
|
+
vLLM will automatically use the already-downloaded files and continue from where it left off. This often resolves network or CDN throttling issues.
|
|
378
|
+
|
|
310
379
|
## Troubleshooting
|
|
311
380
|
|
|
312
381
|
- **OOM Errors**: Reduce gpu_fraction or use a smaller model
|
|
@@ -315,3 +384,30 @@ Remember: Tool calling is still an evolving feature in the LLM ecosystem. What w
|
|
|
315
384
|
- **HF Token Issues**: Ensure HF_TOKEN is set before running setup
|
|
316
385
|
- **Access Denied**: Some models (like Llama, Mistral) require completing an access request on HuggingFace first. Visit the model page and click "Request access"
|
|
317
386
|
- **Tool Calling Errors**: See the Tool Calling section above - consider disabling it or using a different model
|
|
387
|
+
- **Model Won't Stop**: If `pi stop` fails, force kill all Python processes and verify GPU is free:
|
|
388
|
+
```bash
|
|
389
|
+
pi ssh "killall -9 python3"
|
|
390
|
+
pi ssh "nvidia-smi" # Should show no processes using GPU
|
|
391
|
+
```
|
|
392
|
+
- **Model Deployment Fails**: Pi currently does not check GPU memory utilization before starting models. If deploying a model fails:
|
|
393
|
+
1. Check if GPUs are full with other models: `pi ssh "nvidia-smi"`
|
|
394
|
+
2. If memory is insufficient, make room by stopping running models: `pi stop <model_name>`
|
|
395
|
+
3. If the error persists with sufficient memory, copy the error output and feed it to an LLM for troubleshooting assistance
|
|
396
|
+
|
|
397
|
+
## Timing notes
|
|
398
|
+
- 8x B200 on DataCrunch, Spot instance
|
|
399
|
+
- pi setup
|
|
400
|
+
- 1:27 min
|
|
401
|
+
- pi start Qwen/Qwen3-Coder-30B-A3B-Instruct
|
|
402
|
+
- (cold start incl. HF download, kernel warmup) 7:32m
|
|
403
|
+
- (warm start, HF model already in cache) 1:02m
|
|
404
|
+
|
|
405
|
+
- 8x H200 on DataCrunch, Spot instance
|
|
406
|
+
- pi setup
|
|
407
|
+
-2:04m
|
|
408
|
+
- pi start Qwen/Qwen3-Coder-30B-A3B-Instruct
|
|
409
|
+
- (cold start incl. HF download, kernel warmup) 9:30m
|
|
410
|
+
- (warm start, HF model already in cache) 1:14m
|
|
411
|
+
- pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 ...
|
|
412
|
+
- (cold start incl. HF download, kernel warmup)
|
|
413
|
+
- (warm start, HF model already in cache)
|