@mariozechner/pi 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Mario Zechner
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,317 @@
1
+ # GPU Pod Manager
2
+
3
+ Quickly deploy LLMs on GPU pods from [Prime Intellect](https://www.primeintellect.ai/), [Vast.ai](https://vast.ai/), [DataCrunch](datacrunch.io), AWS, etc., for local coding agents and AI assistants.
4
+
5
+ ## Installation
6
+
7
+ ```bash
8
+ npm install -g @badlogic/pi
9
+ ```
10
+
11
+ Or run directly with npx:
12
+ ```bash
13
+ npx @badlogic/pi
14
+ ```
15
+
16
+ ## What This Is
17
+
18
+ A simple CLI tool that automatically sets up and manages vLLM deployments on GPU pods. Start from a clean Ubuntu pod and have multiple models running in minutes. A GPU pod is defined as an Ubuntu machine with root access, one or more GPUs, and Cuda drivers installed. It is aimed at individuals who are limited by local hardware and want to experiment with large open weight LLMs for their coding assistent workflows.
19
+
20
+ **Key Features:**
21
+ - **Zero to LLM in minutes** - Automatically installs vLLM and all dependencies on clean pods
22
+ - **Multi-model management** - Run multiple models concurrently on a single pod
23
+ - **Smart GPU allocation** - Round robin assigns models to available GPUs on multi-GPU pods
24
+ - **Tensor parallelism** - Run large models across multiple GPUs with `--all-gpus`
25
+ - **OpenAI-compatible API** - Drop-in replacement for OpenAI API clients with automatic tool/function calling support
26
+ - **No complex setup** - Just SSH access, no Kubernetes or Docker required
27
+ - **Privacy first** - vLLM telemetry disabled by default
28
+
29
+ **Limitations:**
30
+ - OpenAI endpoints exposed to the public internet (yolo)
31
+ - Requires manual pod creation via Prime Intellect, Vast.ai, AWS, etc.
32
+ - Assumes Ubuntu 22 image when creating pods
33
+
34
+ ## What this is not
35
+ - A provisioning manager for pods. You need to provision the pods on the respective provider themselves.
36
+ - Super optimized LLM deployment infrastructure for absolute best performance. This is for individuals who want to quickly spin up large open weights models for local LLM loads.
37
+
38
+ ## Requirements
39
+
40
+ - **Node.js 14+** - To run the CLI tool on your machine
41
+ - **HuggingFace Token** - Required for downloading models (get one at https://huggingface.co/settings/tokens)
42
+ - **Prime Intellect Account** - Sign up at https://app.primeintellect.ai
43
+ - **GPU Pod** - At least one running pod with:
44
+ - Ubuntu 22+ image (selected when creating pod)
45
+ - SSH access enabled
46
+ - Clean state (no manual vLLM installation needed)
47
+ - **Note**: B200 GPUs require PyTorch nightly with CUDA 12.8+ (automatically installed if detected). However, vLLM may need to be built from source for full compatibility.
48
+
49
+ ## Quick Start
50
+
51
+ ```bash
52
+ # 1. Get a GPU pod from Prime Intellect
53
+ # Visit https://app.primeintellect.ai or https://vast.ai/ or https://datacrunch.io and create a pod (use Ubuntu 22+ image)
54
+ # Providers usually give you an SSH command with which to log into the machine. Copy that command.
55
+
56
+ # 2. On your local machine, run the following to setup the remote pod. The Hugging Face token
57
+ # is required for model download.
58
+ export HF_TOKEN=your_huggingface_token
59
+ pi setup my-pod-name "ssh root@135.181.71.41 -p 22"
60
+
61
+ # 3. Start a model (automatically manages GPU assignment)
62
+ pi start microsoft/Phi-3-mini-128k-instruct --name phi3 --memory 20%
63
+
64
+ # 4. Test the model with a prompt
65
+ pi prompt phi3 "What is 2+2?"
66
+ # Response: The answer is 4.
67
+
68
+ # 5. Start another model (automatically uses next available GPU on multi-GPU pods)
69
+ pi start Qwen/Qwen2.5-7B-Instruct --name qwen --memory 30%
70
+
71
+ # 6. Check running models
72
+ pi list
73
+
74
+ # 7. Use with your coding agent
75
+ export OPENAI_BASE_URL='http://135.181.71.41:8001/v1' # For first model
76
+ export OPENAI_API_KEY='dummy'
77
+ ```
78
+
79
+ ## How It Works
80
+
81
+ 1. **Automatic Setup**: When you run `pi setup`, it:
82
+ - Connects to your clean Ubuntu pod
83
+ - Installs Python, CUDA drivers, and vLLM
84
+ - Configures HuggingFace tokens
85
+ - Sets up the model manager
86
+
87
+ 2. **Model Management**: Each `pi start` command:
88
+ - Automatically finds an available GPU (on multi-GPU systems)
89
+ - Allocates the specified memory fraction
90
+ - Starts a separate vLLM instance on a unique port accessible via the OpenAI API protocol
91
+ - Manages logs and process lifecycle
92
+
93
+ 3. **Multi-GPU Support**: On pods with multiple GPUs:
94
+ - Single models automatically distribute across available GPUs
95
+ - Large models can use tensor parallelism with `--all-gpus`
96
+ - View GPU assignments with `pi list`
97
+
98
+
99
+ ## Commands
100
+
101
+ ### Pod Management
102
+
103
+ The tool supports managing multiple Prime Intellect pods from a single machine. Each pod is identified by a name you choose (e.g., "prod", "dev", "h200"). While all your pods continue running independently, the tool operates on one "active" pod at a time - all model commands (start, stop, list, etc.) are directed to this active pod. You can easily switch which pod is active to manage models on different machines.
104
+
105
+ ```bash
106
+ pi setup <pod-name> "<ssh_command>" # Configure and activate a pod
107
+ pi pods # List all pods (active pod marked)
108
+ pi pod <pod-name> # Switch active pod
109
+ pi pod remove <pod-name> # Remove pod from config
110
+ pi shell # SSH into active pod
111
+ ```
112
+
113
+ ### Model Management
114
+
115
+ Each model runs as a separate vLLM instance with its own port and GPU allocation. The tool automatically manages GPU assignment on multi-GPU systems and ensures models don't conflict. Models are accessed by their short names (either auto-generated or specified with --name).
116
+
117
+ ```bash
118
+ pi list # List running models on active pod
119
+ pi search <query> # Search HuggingFace models
120
+ pi start <model> [options] # Start a model with options
121
+ --name <name> # Short alias (default: auto-generated)
122
+ --context <size> # Context window: 4k, 8k, 16k, 32k (default: model default)
123
+ --memory <percent> # GPU memory: 30%, 50%, 90% (default: 90%)
124
+ --all-gpus # Use tensor parallelism across all GPUs
125
+ --vllm-args # Pass all remaining args directly to vLLM
126
+ pi stop [name] # Stop a model (or all if no name)
127
+ pi logs <name> # View logs with tail -f
128
+ pi prompt <name> "message" # Quick test prompt
129
+ ```
130
+
131
+ ## Examples
132
+
133
+ ### Search for models
134
+ ```bash
135
+ pi search codellama
136
+ pi search deepseek
137
+ pi search qwen
138
+ ```
139
+
140
+ **Note**: vLLM does not support formats like GGUF. Read the [docs](https://docs.vllm.ai/en/latest/)
141
+
142
+ ### A100 80GB scenarios
143
+ ```bash
144
+ # Small model, high concurrency (~30-50 concurrent requests)
145
+ pi start microsoft/Phi-3-mini-128k-instruct --name phi3 --memory 30%
146
+
147
+ # Medium model, balanced (~10-20 concurrent requests)
148
+ pi start meta-llama/Llama-3.1-8B-Instruct --name llama8b --memory 50%
149
+
150
+ # Large model, limited concurrency (~5-10 concurrent requests)
151
+ pi start meta-llama/Llama-3.1-70B-Instruct --name llama70b --memory 90%
152
+
153
+ # Run multiple small models
154
+ pi start Qwen/Qwen2.5-Coder-1.5B --name coder1 --memory 15%
155
+ pi start microsoft/Phi-3-mini-128k-instruct --name phi3 --memory 15%
156
+ ```
157
+
158
+ ## Understanding Context and Memory
159
+
160
+ ### Context Window vs Output Tokens
161
+ Models are loaded with their default context length. You can use the `context` parameter to specify a lower or higher context length. The `context` parameter sets the **total** token budget for input + output combined:
162
+ - Starting a model with `context=8k` means 8,192 tokens total
163
+ - If your prompt uses 6,000 tokens, you have 2,192 tokens left for the response
164
+ - Each OpenAI API request to the model can specify `max_output_tokens` to control output length within this budget
165
+
166
+ Example:
167
+ ```bash
168
+ # Start model with 32k total context
169
+ pi start meta-llama/Llama-3.1-8B --name llama --context 32k --memory 50%
170
+
171
+ # When calling the API, you control output length per request:
172
+ # - Send 20k token prompt
173
+ # - Request max_tokens=4000
174
+ # - Total = 24k (fits within 32k context)
175
+ ```
176
+
177
+ ### GPU Memory and Concurrency
178
+ vLLM pre-allocates GPU memory controlled by `gpu_fraction`. This matters for coding agents that spawn sub-agents, as each connection needs memory.
179
+
180
+ Example: On an A100 80GB with a 7B model (FP16, ~14GB weights):
181
+ - `gpu_fraction=0.3` (24GB): ~10GB for KV cache → ~30-50 concurrent requests
182
+ - `gpu_fraction=0.5` (40GB): ~26GB for KV cache → ~50-80 concurrent requests
183
+ - `gpu_fraction=0.9` (72GB): ~58GB for KV cache → ~100+ concurrent requests
184
+
185
+ Models load in their native precision from HuggingFace (usually FP16/BF16). Check the model card's "Files and versions" tab - look for file sizes: 7B models are ~14GB, 13B are ~26GB, 70B are ~140GB. Quantized models (AWQ, GPTQ) in the name use less memory but may have quality trade-offs.
186
+
187
+ ## Multi-GPU Support
188
+
189
+ For pods with multiple GPUs, the tool automatically manages GPU assignment:
190
+
191
+ ### Automatic GPU assignment for multiple models
192
+ ```bash
193
+ # Each model automatically uses the next available GPU
194
+ pi start microsoft/Phi-3-mini-128k-instruct --memory 20% # Auto-assigns to GPU 0
195
+ pi start Qwen/Qwen2.5-7B-Instruct --memory 20% # Auto-assigns to GPU 1
196
+ pi start meta-llama/Llama-3.1-8B --memory 20% # Auto-assigns to GPU 2
197
+
198
+ # Check which GPU each model is using
199
+ pi list
200
+ ```
201
+
202
+ ### Run large models across all GPUs
203
+ ```bash
204
+ # Use --all-gpus for tensor parallelism across all available GPUs
205
+ pi start meta-llama/Llama-3.1-70B-Instruct --all-gpus
206
+ pi start Qwen/Qwen2.5-72B-Instruct --all-gpus --context 64k
207
+ ```
208
+
209
+ ### Advanced: Custom vLLM arguments
210
+ ```bash
211
+ # Pass custom arguments directly to vLLM with --vllm-args
212
+ # Everything after --vllm-args is passed to vLLM unchanged
213
+
214
+ # Qwen3-Coder 480B on 8xH200 with expert parallelism
215
+ pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen-coder --vllm-args \
216
+ --data-parallel-size 8 --enable-expert-parallel \
217
+ --tool-call-parser qwen3_coder --enable-auto-tool-choice --max-model-len 200000
218
+
219
+ # DeepSeek with custom quantization
220
+ pi start deepseek-ai/DeepSeek-Coder-V2-Instruct --name deepseek --vllm-args \
221
+ --tensor-parallel-size 4 --quantization fp8 --trust-remote-code
222
+
223
+ # Mixtral with pipeline parallelism
224
+ pi start mistralai/Mixtral-8x22B-Instruct-v0.1 --name mixtral --vllm-args \
225
+ --tensor-parallel-size 8 --pipeline-parallel-size 2
226
+ ```
227
+
228
+ ### Check GPU usage
229
+ ```bash
230
+ pi ssh "nvidia-smi"
231
+ ```
232
+
233
+ ## Architecture Notes
234
+
235
+ - **Multi-Pod Support**: The tool stores multiple pod configurations in `~/.pi_config` with one active pod at a time.
236
+ - **Port Allocation**: Each model runs on a separate port (8001, 8002, etc.) allowing multiple models on one GPU.
237
+ - **Memory Management**: vLLM uses PagedAttention for efficient memory use with less than 4% waste.
238
+ - **Model Caching**: Models are downloaded once and cached on the pod.
239
+ - **Tool Parser Auto-Detection**: The tool automatically selects the appropriate tool parser based on the model:
240
+ - Qwen models: `hermes` (Qwen3-Coder: `qwen3_coder` if available)
241
+ - Mistral models: `mistral` with optimized chat template
242
+ - Llama models: `llama3_json` or `llama4_pythonic` based on version
243
+ - InternLM models: `internlm`
244
+ - Phi models: Tool calling disabled by default (no compatible tokens)
245
+ - Override with `--vllm-args --tool-call-parser <parser> --enable-auto-tool-choice`
246
+
247
+
248
+ ## Tool Calling (Function Calling)
249
+
250
+ Tool calling allows LLMs to request the use of external functions/APIs, but it's a complex feature with many caveats:
251
+
252
+ ### The Reality of Tool Calling
253
+
254
+ 1. **Model Compatibility**: Not all models support tool calling, even if they claim to. Many models lack the special tokens or training needed for reliable tool parsing.
255
+
256
+ 2. **Parser Mismatches**: Different models use different tool calling formats:
257
+ - Hermes format (XML-like)
258
+ - Mistral format (specific JSON structure)
259
+ - Llama format (JSON-based or pythonic)
260
+ - Custom formats for each model family
261
+
262
+ 3. **Common Issues**:
263
+ - "Could not locate tool call start/end tokens" - Model doesn't have required special tokens
264
+ - Malformed JSON/XML output - Model wasn't trained for the parser format
265
+ - Tool calls when you don't want them - Model overeager to use tools
266
+ - No tool calls when you need them - Model doesn't understand when to use tools
267
+
268
+ ### How We Handle It
269
+
270
+ The tool automatically detects the model and tries to use an appropriate parser:
271
+ - **Qwen models**: `hermes` parser (Qwen3-Coder uses `qwen3_coder`)
272
+ - **Mistral models**: `mistral` parser with custom template
273
+ - **Llama models**: `llama3_json` or `llama4_pythonic` based on version
274
+ - **Phi models**: Tool calling disabled (no compatible tokens)
275
+
276
+ ### Your Options
277
+
278
+ 1. **Let auto-detection handle it** (default):
279
+ ```bash
280
+ pi start meta-llama/Llama-3.1-8B-Instruct --name llama
281
+ ```
282
+
283
+ 2. **Force a specific parser** (if you know better):
284
+ ```bash
285
+ pi start model/name --name mymodel --vllm-args \
286
+ --tool-call-parser mistral --enable-auto-tool-choice
287
+ ```
288
+
289
+ 3. **Disable tool calling entirely** (most reliable):
290
+ ```bash
291
+ pi start model/name --name mymodel --vllm-args \
292
+ --disable-tool-call-parser
293
+ ```
294
+
295
+ 4. **Handle tools in your application** (recommended for production):
296
+ - Send regular prompts asking the model to output JSON
297
+ - Parse the response in your code
298
+ - More control, more reliable
299
+
300
+ ### Best Practices
301
+
302
+ - **Test first**: Try a simple tool call to see if it works with your model
303
+ - **Have a fallback**: Be prepared for tool calling to fail
304
+ - **Consider alternatives**: Sometimes a well-crafted prompt works better than tool calling
305
+ - **Read the docs**: Check the model card for tool calling examples
306
+ - **Monitor logs**: Check `~/.vllm_logs/` for parser errors
307
+
308
+ Remember: Tool calling is still an evolving feature in the LLM ecosystem. What works today might break tomorrow with a model update.
309
+
310
+ ## Troubleshooting
311
+
312
+ - **OOM Errors**: Reduce gpu_fraction or use a smaller model
313
+ - **Slow Inference**: Could be too many concurrent requests, try increasing gpu_fraction
314
+ - **Connection Refused**: Check pod is running and port is correct
315
+ - **HF Token Issues**: Ensure HF_TOKEN is set before running setup
316
+ - **Access Denied**: Some models (like Llama, Mistral) require completing an access request on HuggingFace first. Visit the model page and click "Request access"
317
+ - **Tool Calling Errors**: See the Tool Calling section above - consider disabling it or using a different model
package/package.json ADDED
@@ -0,0 +1,42 @@
1
+ {
2
+ "name": "@mariozechner/pi",
3
+ "version": "0.1.2",
4
+ "description": "CLI tool for managing vLLM deployments on GPU pods from Prime Intellect, Vast.ai, etc.",
5
+ "main": "pi",
6
+ "bin": {
7
+ "pi": "pi"
8
+ },
9
+ "scripts": {
10
+ "test": "echo \"Error: no test specified\" && exit 1"
11
+ },
12
+ "keywords": [
13
+ "llm",
14
+ "vllm",
15
+ "gpu",
16
+ "prime-intellect",
17
+ "ai",
18
+ "ml",
19
+ "cli"
20
+ ],
21
+ "author": "Mario Zechner",
22
+ "license": "MIT",
23
+ "repository": {
24
+ "type": "git",
25
+ "url": "git+https://github.com/badlogic/pi.git"
26
+ },
27
+ "bugs": {
28
+ "url": "https://github.com/badlogic/pi/issues"
29
+ },
30
+ "homepage": "https://github.com/badlogic/pi#readme",
31
+ "engines": {
32
+ "node": ">=14.0.0"
33
+ },
34
+ "preferGlobal": true,
35
+ "files": [
36
+ "pi",
37
+ "pod_setup.sh",
38
+ "vllm_manager.py",
39
+ "README.md",
40
+ "LICENSE"
41
+ ]
42
+ }