@mariozechner/pi 0.1.5 → 0.2.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -39,7 +39,7 @@ A simple CLI tool that automatically sets up and manages vLLM deployments on GPU
39
39
 
40
40
  - **Node.js 14+** - To run the CLI tool on your machine
41
41
  - **HuggingFace Token** - Required for downloading models (get one at https://huggingface.co/settings/tokens)
42
- - **Prime Intellect Account** - Sign up at https://app.primeintellect.ai
42
+ - **Prime Intellect/DataCrunch/Vast.ai Account**
43
43
  - **GPU Pod** - At least one running pod with:
44
44
  - Ubuntu 22+ image (selected when creating pod)
45
45
  - SSH access enabled
@@ -110,6 +110,33 @@ pi pod remove <pod-name> # Remove pod from config
110
110
  pi shell # SSH into active pod
111
111
  ```
112
112
 
113
+ #### Working with Multiple Pods
114
+
115
+ You can manage models on any pod without switching the active pod by using the `--pod` parameter:
116
+
117
+ ```bash
118
+ # List models on a specific pod
119
+ pi list --pod prod
120
+
121
+ # Start a model on a specific pod
122
+ pi start Qwen/Qwen2.5-7B-Instruct --name qwen --pod dev
123
+
124
+ # Stop a model on a specific pod
125
+ pi stop qwen --pod dev
126
+
127
+ # View logs from a specific pod
128
+ pi logs qwen --pod dev
129
+
130
+ # Test a model on a specific pod
131
+ pi prompt qwen "Hello!" --pod dev
132
+
133
+ # SSH into a specific pod
134
+ pi shell --pod prod
135
+ pi ssh --pod prod "nvidia-smi"
136
+ ```
137
+
138
+ This allows you to manage multiple environments (dev, staging, production) from a single machine without constantly switching between them.
139
+
113
140
  ### Model Management
114
141
 
115
142
  Each model runs as a separate vLLM instance with its own port and GPU allocation. The tool automatically manages GPU assignment on multi-GPU systems and ensures models don't conflict. Models are accessed by their short names (either auto-generated or specified with --name).
@@ -122,12 +149,16 @@ pi start <model> [options] # Start a model with options
122
149
  --context <size> # Context window: 4k, 8k, 16k, 32k (default: model default)
123
150
  --memory <percent> # GPU memory: 30%, 50%, 90% (default: 90%)
124
151
  --all-gpus # Use tensor parallelism across all GPUs
152
+ --pod <pod-name> # Run on specific pod (default: active pod)
125
153
  --vllm-args # Pass all remaining args directly to vLLM
126
154
  pi stop [name] # Stop a model (or all if no name)
127
155
  pi logs <name> # View logs with tail -f
128
156
  pi prompt <name> "message" # Quick test prompt
157
+ pi downloads [--live] # Check model download progress (--live for continuous monitoring)
129
158
  ```
130
159
 
160
+ All model management commands support the `--pod` parameter to target a specific pod without switching the active pod.
161
+
131
162
  ## Examples
132
163
 
133
164
  ### Search for models
@@ -199,6 +230,11 @@ pi start meta-llama/Llama-3.1-8B --memory 20% # Auto-assigns to GPU 2
199
230
  pi list
200
231
  ```
201
232
 
233
+ ## Qwen on a single H200
234
+ ```bash
235
+ pi start Qwen/Qwen3-Coder-30B-A3B-Instruct qwen3-30b
236
+ ```
237
+
202
238
  ### Run large models across all GPUs
203
239
  ```bash
204
240
  # Use --all-gpus for tensor parallelism across all available GPUs
@@ -214,7 +250,7 @@ pi start Qwen/Qwen2.5-72B-Instruct --all-gpus --context 64k
214
250
  # Qwen3-Coder 480B on 8xH200 with expert parallelism
215
251
  pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen-coder --vllm-args \
216
252
  --data-parallel-size 8 --enable-expert-parallel \
217
- --tool-call-parser qwen3_coder --enable-auto-tool-choice --max-model-len 200000
253
+ --tool-call-parser qwen3_coder --enable-auto-tool-choice --gpu-memory-utilization 0.95 --max-model-len 200000
218
254
 
219
255
  # DeepSeek with custom quantization
220
256
  pi start deepseek-ai/DeepSeek-Coder-V2-Instruct --name deepseek --vllm-args \
@@ -225,6 +261,12 @@ pi start mistralai/Mixtral-8x22B-Instruct-v0.1 --name mixtral --vllm-args \
225
261
  --tensor-parallel-size 8 --pipeline-parallel-size 2
226
262
  ```
227
263
 
264
+ **Note on Special Models**: Some models require specific vLLM arguments to run properly:
265
+ - **Qwen3-Coder 480B**: Requires `--enable-expert-parallel` for MoE support
266
+ - **Kimi K2**: May require custom arguments - check the model's documentation
267
+ - **DeepSeek V3**: Often needs `--trust-remote-code` for custom architectures
268
+ - When in doubt, consult the model's HuggingFace page or documentation for recommended vLLM settings
269
+
228
270
  ### Check GPU usage
229
271
  ```bash
230
272
  pi ssh "nvidia-smi"
@@ -307,6 +349,33 @@ The tool automatically detects the model and tries to use an appropriate parser:
307
349
 
308
350
  Remember: Tool calling is still an evolving feature in the LLM ecosystem. What works today might break tomorrow with a model update.
309
351
 
352
+ ## Monitoring Downloads
353
+
354
+ Use `pi downloads` to check the progress of model downloads in the HuggingFace cache:
355
+
356
+ ```bash
357
+ pi downloads # Check downloads on active pod
358
+ pi downloads --live # Live monitoring (updates every 2 seconds)
359
+ pi downloads --pod 8h200 # Check downloads on specific pod
360
+ pi downloads --live --pod 8h200 # Live monitoring on specific pod
361
+ ```
362
+
363
+ The command shows:
364
+ - Model name and current size
365
+ - Download progress (files downloaded / total files)
366
+ - Download status (⏬ Downloading or ⏸ Idle)
367
+ - Estimated total size (if available from HuggingFace)
368
+
369
+ **Tip for large models**: When starting models like Qwen-480B that take time to download, run `pi start` in one terminal and `pi downloads --live` in another to monitor progress. This is especially helpful since the log output during downloads can be minimal.
370
+
371
+ **Downloads stalled?** If downloads appear stuck (e.g., at 92%), you can safely stop and restart:
372
+ ```bash
373
+ pi stop <model-name> # Stop the current process
374
+ pi downloads # Verify progress (e.g., 45/49 files)
375
+ pi start <same-command> # Restart with the same command
376
+ ```
377
+ vLLM will automatically use the already-downloaded files and continue from where it left off. This often resolves network or CDN throttling issues.
378
+
310
379
  ## Troubleshooting
311
380
 
312
381
  - **OOM Errors**: Reduce gpu_fraction or use a smaller model
@@ -315,3 +384,30 @@ Remember: Tool calling is still an evolving feature in the LLM ecosystem. What w
315
384
  - **HF Token Issues**: Ensure HF_TOKEN is set before running setup
316
385
  - **Access Denied**: Some models (like Llama, Mistral) require completing an access request on HuggingFace first. Visit the model page and click "Request access"
317
386
  - **Tool Calling Errors**: See the Tool Calling section above - consider disabling it or using a different model
387
+ - **Model Won't Stop**: If `pi stop` fails, force kill all Python processes and verify GPU is free:
388
+ ```bash
389
+ pi ssh "killall -9 python3"
390
+ pi ssh "nvidia-smi" # Should show no processes using GPU
391
+ ```
392
+ - **Model Deployment Fails**: Pi currently does not check GPU memory utilization before starting models. If deploying a model fails:
393
+ 1. Check if GPUs are full with other models: `pi ssh "nvidia-smi"`
394
+ 2. If memory is insufficient, make room by stopping running models: `pi stop <model_name>`
395
+ 3. If the error persists with sufficient memory, copy the error output and feed it to an LLM for troubleshooting assistance
396
+
397
+ ## Timing notes
398
+ - 8x B200 on DataCrunch, Spot instance
399
+ - pi setup
400
+ - 1:27 min
401
+ - pi start Qwen/Qwen3-Coder-30B-A3B-Instruct
402
+ - (cold start incl. HF download, kernel warmup) 7:32m
403
+ - (warm start, HF model already in cache) 1:02m
404
+
405
+ - 8x H200 on DataCrunch, Spot instance
406
+ - pi setup
407
+ -2:04m
408
+ - pi start Qwen/Qwen3-Coder-30B-A3B-Instruct
409
+ - (cold start incl. HF download, kernel warmup) 9:30m
410
+ - (warm start, HF model already in cache) 1:14m
411
+ - pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 ...
412
+ - (cold start incl. HF download, kernel warmup)
413
+ - (warm start, HF model already in cache)
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@mariozechner/pi",
3
- "version": "0.1.5",
3
+ "version": "0.2.4",
4
4
  "description": "CLI tool for managing vLLM deployments on GPU pods from Prime Intellect, Vast.ai, etc.",
5
5
  "main": "pi.js",
6
6
  "bin": {