npm - @mariozechner/pi - Versions diffs - 0.1.5 → 0.2.4 - Mend

@mariozechner/pi 0.1.5 → 0.2.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

package/README.md CHANGED Viewed

@@ -39,7 +39,7 @@ A simple CLI tool that automatically sets up and manages vLLM deployments on GPU
 - **Node.js 14+** - To run the CLI tool on your machine
 - **HuggingFace Token** - Required for downloading models (get one at https://huggingface.co/settings/tokens)
-- **Prime Intellect Account** - Sign up at https://app.primeintellect.ai
+- **Prime Intellect/DataCrunch/Vast.ai Account**
 - **GPU Pod** - At least one running pod with:
   - Ubuntu 22+ image (selected when creating pod)
   - SSH access enabled
@@ -110,6 +110,33 @@ pi pod remove <pod-name>             # Remove pod from config
 pi shell                             # SSH into active pod
 ```
+#### Working with Multiple Pods
+You can manage models on any pod without switching the active pod by using the `--pod` parameter:
+```bash
+# List models on a specific pod
+pi list --pod prod
+# Start a model on a specific pod
+pi start Qwen/Qwen2.5-7B-Instruct --name qwen --pod dev
+# Stop a model on a specific pod
+pi stop qwen --pod dev
+# View logs from a specific pod
+pi logs qwen --pod dev
+# Test a model on a specific pod
+pi prompt qwen "Hello!" --pod dev
+# SSH into a specific pod
+pi shell --pod prod
+pi ssh --pod prod "nvidia-smi"
+```
+This allows you to manage multiple environments (dev, staging, production) from a single machine without constantly switching between them.
 ### Model Management
 Each model runs as a separate vLLM instance with its own port and GPU allocation. The tool automatically manages GPU assignment on multi-GPU systems and ensures models don't conflict. Models are accessed by their short names (either auto-generated or specified with --name).
@@ -122,12 +149,16 @@ pi start <model> [options]           # Start a model with options
   --context <size>                   # Context window: 4k, 8k, 16k, 32k (default: model default)
   --memory <percent>                 # GPU memory: 30%, 50%, 90% (default: 90%)
   --all-gpus                         # Use tensor parallelism across all GPUs
+  --pod <pod-name>                   # Run on specific pod (default: active pod)
   --vllm-args                        # Pass all remaining args directly to vLLM
 pi stop [name]                       # Stop a model (or all if no name)
 pi logs <name>                       # View logs with tail -f
 pi prompt <name> "message"           # Quick test prompt
+pi downloads [--live]                # Check model download progress (--live for continuous monitoring)
 ```
+All model management commands support the `--pod` parameter to target a specific pod without switching the active pod.
 ## Examples
 ### Search for models
@@ -199,6 +230,11 @@ pi start meta-llama/Llama-3.1-8B --memory 20%          # Auto-assigns to GPU 2
 pi list
 ```
+## Qwen on a single H200
+```bash
+pi start Qwen/Qwen3-Coder-30B-A3B-Instruct qwen3-30b
+```
 ### Run large models across all GPUs
 ```bash
 # Use --all-gpus for tensor parallelism across all available GPUs
@@ -214,7 +250,7 @@ pi start Qwen/Qwen2.5-72B-Instruct --all-gpus --context 64k
 # Qwen3-Coder 480B on 8xH200 with expert parallelism
 pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen-coder --vllm-args \
   --data-parallel-size 8 --enable-expert-parallel \
-  --tool-call-parser qwen3_coder --enable-auto-tool-choice --max-model-len 200000
+  --tool-call-parser qwen3_coder --enable-auto-tool-choice --gpu-memory-utilization 0.95 --max-model-len 200000
 # DeepSeek with custom quantization
 pi start deepseek-ai/DeepSeek-Coder-V2-Instruct --name deepseek --vllm-args \
@@ -225,6 +261,12 @@ pi start mistralai/Mixtral-8x22B-Instruct-v0.1 --name mixtral --vllm-args \
   --tensor-parallel-size 8 --pipeline-parallel-size 2
 ```
+**Note on Special Models**: Some models require specific vLLM arguments to run properly:
+- **Qwen3-Coder 480B**: Requires `--enable-expert-parallel` for MoE support
+- **Kimi K2**: May require custom arguments - check the model's documentation
+- **DeepSeek V3**: Often needs `--trust-remote-code` for custom architectures
+- When in doubt, consult the model's HuggingFace page or documentation for recommended vLLM settings
 ### Check GPU usage
 ```bash
 pi ssh "nvidia-smi"
@@ -307,6 +349,33 @@ The tool automatically detects the model and tries to use an appropriate parser:
 Remember: Tool calling is still an evolving feature in the LLM ecosystem. What works today might break tomorrow with a model update.
+## Monitoring Downloads
+Use `pi downloads` to check the progress of model downloads in the HuggingFace cache:
+```bash
+pi downloads                         # Check downloads on active pod
+pi downloads --live                  # Live monitoring (updates every 2 seconds)
+pi downloads --pod 8h200            # Check downloads on specific pod
+pi downloads --live --pod 8h200     # Live monitoring on specific pod
+```
+The command shows:
+- Model name and current size
+- Download progress (files downloaded / total files)
+- Download status (⏬ Downloading or ⏸ Idle)
+- Estimated total size (if available from HuggingFace)
+**Tip for large models**: When starting models like Qwen-480B that take time to download, run `pi start` in one terminal and `pi downloads --live` in another to monitor progress. This is especially helpful since the log output during downloads can be minimal.
+**Downloads stalled?** If downloads appear stuck (e.g., at 92%), you can safely stop and restart:
+```bash
+pi stop <model-name>         # Stop the current process
+pi downloads                 # Verify progress (e.g., 45/49 files)
+pi start <same-command>      # Restart with the same command
+```
+vLLM will automatically use the already-downloaded files and continue from where it left off. This often resolves network or CDN throttling issues.
 ## Troubleshooting
 - **OOM Errors**: Reduce gpu_fraction or use a smaller model
@@ -315,3 +384,30 @@ Remember: Tool calling is still an evolving feature in the LLM ecosystem. What w
 - **HF Token Issues**: Ensure HF_TOKEN is set before running setup
 - **Access Denied**: Some models (like Llama, Mistral) require completing an access request on HuggingFace first. Visit the model page and click "Request access"
 - **Tool Calling Errors**: See the Tool Calling section above - consider disabling it or using a different model
+- **Model Won't Stop**: If `pi stop` fails, force kill all Python processes and verify GPU is free:
+  ```bash
+  pi ssh "killall -9 python3"
+  pi ssh "nvidia-smi"  # Should show no processes using GPU
+  ```
+- **Model Deployment Fails**: Pi currently does not check GPU memory utilization before starting models. If deploying a model fails:
+  1. Check if GPUs are full with other models: `pi ssh "nvidia-smi"`
+  2. If memory is insufficient, make room by stopping running models: `pi stop <model_name>`
+  3. If the error persists with sufficient memory, copy the error output and feed it to an LLM for troubleshooting assistance
+## Timing notes
+- 8x B200 on DataCrunch, Spot instance
+   - pi setup
+      - 1:27 min
+   - pi start Qwen/Qwen3-Coder-30B-A3B-Instruct
+      - (cold start incl. HF download, kernel warmup) 7:32m
+      - (warm start, HF model already in cache) 1:02m
+- 8x H200 on DataCrunch, Spot instance
+   - pi setup
+      -2:04m
+   - pi start Qwen/Qwen3-Coder-30B-A3B-Instruct
+      - (cold start incl. HF download, kernel warmup) 9:30m
+      - (warm start, HF model already in cache) 1:14m
+   - pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 ...
+      - (cold start incl. HF download, kernel warmup)
+      - (warm start, HF model already in cache)

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@mariozechner/pi",
-  "version": "0.1.5",
+  "version": "0.2.4",
   "description": "CLI tool for managing vLLM deployments on GPU pods from Prime Intellect, Vast.ai, etc.",
   "main": "pi.js",
   "bin": {