@mariozechner/pi 0.2.4 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,6 +1,6 @@
1
- # GPU Pod Manager
1
+ # pi
2
2
 
3
- Quickly deploy LLMs on GPU pods from [Prime Intellect](https://www.primeintellect.ai/), [Vast.ai](https://vast.ai/), [DataCrunch](datacrunch.io), AWS, etc., for local coding agents and AI assistants.
3
+ Deploy and manage LLMs on GPU pods with automatic vLLM configuration for agentic workloads.
4
4
 
5
5
  ## Installation
6
6
 
@@ -8,406 +8,504 @@ Quickly deploy LLMs on GPU pods from [Prime Intellect](https://www.primeintellec
8
8
  npm install -g @mariozechner/pi
9
9
  ```
10
10
 
11
- Or run directly with npx:
12
- ```bash
13
- npx @mariozechner/pi
14
- ```
15
-
16
- ## What This Is
11
+ ## What is pi?
17
12
 
18
- A simple CLI tool that automatically sets up and manages vLLM deployments on GPU pods. Start from a clean Ubuntu pod and have multiple models running in minutes. A GPU pod is defined as an Ubuntu machine with root access, one or more GPUs, and Cuda drivers installed. It is aimed at individuals who are limited by local hardware and want to experiment with large open weight LLMs for their coding assistent workflows.
13
+ `pi` simplifies running large language models on remote GPU pods. It automatically:
14
+ - Sets up vLLM on fresh Ubuntu pods
15
+ - Configures tool calling for agentic models (Qwen, GPT-OSS, GLM, etc.)
16
+ - Manages multiple models on the same pod with "smart" GPU allocation
17
+ - Provides OpenAI-compatible API endpoints for each model
18
+ - Includes an interactive agent with file system tools for testing
19
19
 
20
- **Key Features:**
21
- - **Zero to LLM in minutes** - Automatically installs vLLM and all dependencies on clean pods
22
- - **Multi-model management** - Run multiple models concurrently on a single pod
23
- - **Smart GPU allocation** - Round robin assigns models to available GPUs on multi-GPU pods
24
- - **Tensor parallelism** - Run large models across multiple GPUs with `--all-gpus`
25
- - **OpenAI-compatible API** - Drop-in replacement for OpenAI API clients with automatic tool/function calling support
26
- - **No complex setup** - Just SSH access, no Kubernetes or Docker required
27
- - **Privacy first** - vLLM telemetry disabled by default
20
+ ## Quick Start
28
21
 
29
- **Limitations:**
30
- - OpenAI endpoints exposed to the public internet (yolo)
31
- - Requires manual pod creation via Prime Intellect, Vast.ai, AWS, etc.
32
- - Assumes Ubuntu 22 image when creating pods
22
+ ```bash
23
+ # Set required environment variables
24
+ export HF_TOKEN=your_huggingface_token # Get from https://huggingface.co/settings/tokens
25
+ export PI_API_KEY=your_api_key # Any string you want for API authentication
33
26
 
34
- ## What this is not
35
- - A provisioning manager for pods. You need to provision the pods on the respective provider themselves.
36
- - Super optimized LLM deployment infrastructure for absolute best performance. This is for individuals who want to quickly spin up large open weights models for local LLM loads.
27
+ # Setup a DataCrunch pod with NFS storage (models path auto-extracted)
28
+ pi pods setup dc1 "ssh root@1.2.3.4" \
29
+ --mount "sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/your-pseudo /mnt/hf-models"
37
30
 
38
- ## Requirements
31
+ # Start a model (automatic configuration for known models)
32
+ pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen
39
33
 
40
- - **Node.js 14+** - To run the CLI tool on your machine
41
- - **HuggingFace Token** - Required for downloading models (get one at https://huggingface.co/settings/tokens)
42
- - **Prime Intellect/DataCrunch/Vast.ai Account**
43
- - **GPU Pod** - At least one running pod with:
44
- - Ubuntu 22+ image (selected when creating pod)
45
- - SSH access enabled
46
- - Clean state (no manual vLLM installation needed)
47
- - **Note**: B200 GPUs require PyTorch nightly with CUDA 12.8+ (automatically installed if detected). However, vLLM may need to be built from source for full compatibility.
34
+ # Send a single message to the model
35
+ pi agent qwen "What is the Fibonacci sequence?"
48
36
 
49
- ## Quick Start
37
+ # Interactive chat mode with file system tools
38
+ pi agent qwen -i
50
39
 
51
- ```bash
52
- # 1. Get a GPU pod from Prime Intellect
53
- # Visit https://app.primeintellect.ai or https://vast.ai/ or https://datacrunch.io and create a pod (use Ubuntu 22+ image)
54
- # Providers usually give you an SSH command with which to log into the machine. Copy that command.
40
+ # Use with any OpenAI-compatible client
41
+ export OPENAI_BASE_URL='http://1.2.3.4:8001/v1'
42
+ export OPENAI_API_KEY=$PI_API_KEY
43
+ ```
55
44
 
56
- # 2. On your local machine, run the following to setup the remote pod. The Hugging Face token
57
- # is required for model download.
58
- export HF_TOKEN=your_huggingface_token
59
- pi setup my-pod-name "ssh root@135.181.71.41 -p 22"
45
+ ## Prerequisites
60
46
 
61
- # 3. Start a model (automatically manages GPU assignment)
62
- pi start microsoft/Phi-3-mini-128k-instruct --name phi3 --memory 20%
47
+ - Node.js 18+
48
+ - HuggingFace token (for model downloads)
49
+ - GPU pod with:
50
+ - Ubuntu 22.04 or 24.04
51
+ - SSH root access
52
+ - NVIDIA drivers installed
53
+ - Persistent storage for models
63
54
 
64
- # 4. Test the model with a prompt
65
- pi prompt phi3 "What is 2+2?"
66
- # Response: The answer is 4.
55
+ ## Supported Providers
67
56
 
68
- # 5. Start another model (automatically uses next available GPU on multi-GPU pods)
69
- pi start Qwen/Qwen2.5-7B-Instruct --name qwen --memory 30%
57
+ ### Primary Support
70
58
 
71
- # 6. Check running models
72
- pi list
59
+ **DataCrunch** - Best for shared model storage
60
+ - NFS volumes sharable across multiple pods in same region
61
+ - Models download once, use everywhere
62
+ - Ideal for teams or multiple experiments
73
63
 
74
- # 7. Use with your coding agent
75
- export OPENAI_BASE_URL='http://135.181.71.41:8001/v1' # For first model
76
- export OPENAI_API_KEY='dummy'
77
- ```
64
+ **RunPod** - Good persistent storage
65
+ - Network volumes persist independently
66
+ - Cannot share between running pods simultaneously
67
+ - Good for single-pod workflows
78
68
 
79
- ## How It Works
69
+ ### Also Works With
70
+ - Vast.ai (volumes locked to specific machine)
71
+ - Prime Intellect (no persistent storage)
72
+ - AWS EC2 (with EFS setup)
73
+ - Any Ubuntu machine with NVIDIA GPUs, CUDA driver, and SSH
80
74
 
81
- 1. **Automatic Setup**: When you run `pi setup`, it:
82
- - Connects to your clean Ubuntu pod
83
- - Installs Python, CUDA drivers, and vLLM
84
- - Configures HuggingFace tokens
85
- - Sets up the model manager
75
+ ## Commands
86
76
 
87
- 2. **Model Management**: Each `pi start` command:
88
- - Automatically finds an available GPU (on multi-GPU systems)
89
- - Allocates the specified memory fraction
90
- - Starts a separate vLLM instance on a unique port accessible via the OpenAI API protocol
91
- - Manages logs and process lifecycle
77
+ ### Pod Management
92
78
 
93
- 3. **Multi-GPU Support**: On pods with multiple GPUs:
94
- - Single models automatically distribute across available GPUs
95
- - Large models can use tensor parallelism with `--all-gpus`
96
- - View GPU assignments with `pi list`
79
+ ```bash
80
+ pi pods setup <name> "<ssh>" [options] # Setup new pod
81
+ --mount "<mount_command>" # Run mount command during setup
82
+ --models-path <path> # Override extracted path (optional)
83
+ --vllm release|nightly|gpt-oss # vLLM version (default: release)
84
+
85
+ pi pods # List all configured pods
86
+ pi pods active <name> # Switch active pod
87
+ pi pods remove <name> # Remove pod from local config
88
+ pi shell [<name>] # SSH into pod
89
+ pi ssh [<name>] "<command>" # Run command on pod
90
+ ```
97
91
 
92
+ **Note**: When using `--mount`, the models path is automatically extracted from the mount command's target directory. You only need `--models-path` if not using `--mount` or to override the extracted path.
98
93
 
99
- ## Commands
94
+ #### vLLM Version Options
100
95
 
101
- ### Pod Management
96
+ - `release` (default): Stable vLLM release, recommended for most users
97
+ - `nightly`: Latest vLLM features, needed for newest models like GLM-4.5
98
+ - `gpt-oss`: Special build for OpenAI's GPT-OSS models only
102
99
 
103
- The tool supports managing multiple Prime Intellect pods from a single machine. Each pod is identified by a name you choose (e.g., "prod", "dev", "h200"). While all your pods continue running independently, the tool operates on one "active" pod at a time - all model commands (start, stop, list, etc.) are directed to this active pod. You can easily switch which pod is active to manage models on different machines.
100
+ ### Model Management
104
101
 
105
102
  ```bash
106
- pi setup <pod-name> "<ssh_command>" # Configure and activate a pod
107
- pi pods # List all pods (active pod marked)
108
- pi pod <pod-name> # Switch active pod
109
- pi pod remove <pod-name> # Remove pod from config
110
- pi shell # SSH into active pod
103
+ pi start <model> --name <name> [options] # Start a model
104
+ --memory <percent> # GPU memory: 30%, 50%, 90% (default: 90%)
105
+ --context <size> # Context window: 4k, 8k, 16k, 32k, 64k, 128k
106
+ --gpus <count> # Number of GPUs to use (predefined models only)
107
+ --pod <name> # Target specific pod (overrides active)
108
+ --vllm <args...> # Pass custom args directly to vLLM
109
+
110
+ pi stop [<name>] # Stop model (or all if no name given)
111
+ pi list # List running models with status
112
+ pi logs <name> # Stream model logs (tail -f)
111
113
  ```
112
114
 
113
- #### Working with Multiple Pods
114
-
115
- You can manage models on any pod without switching the active pod by using the `--pod` parameter:
115
+ ### Agent & Chat Interface
116
116
 
117
117
  ```bash
118
- # List models on a specific pod
119
- pi list --pod prod
118
+ pi agent <name> "<message>" # Single message to model
119
+ pi agent <name> "<msg1>" "<msg2>" # Multiple messages in sequence
120
+ pi agent <name> -i # Interactive chat mode
121
+ pi agent <name> -i -c # Continue previous session
122
+
123
+ # Standalone OpenAI-compatible agent (works with any API)
124
+ pi-agent --base-url http://localhost:8000/v1 --model llama-3.1 "Hello"
125
+ pi-agent --api-key sk-... "What is 2+2?" # Uses OpenAI by default
126
+ pi-agent --json "What is 2+2?" # Output event stream as JSONL
127
+ pi-agent -i # Interactive mode
128
+ ```
120
129
 
121
- # Start a model on a specific pod
122
- pi start Qwen/Qwen2.5-7B-Instruct --name qwen --pod dev
130
+ The agent includes tools for file operations (read, list, bash, glob, rg) to test agentic capabilities, particularly useful for code navigation and analysis tasks.
123
131
 
124
- # Stop a model on a specific pod
125
- pi stop qwen --pod dev
132
+ ## Predefined Model Configurations
126
133
 
127
- # View logs from a specific pod
128
- pi logs qwen --pod dev
134
+ `pi` includes predefined configurations for popular agentic models, so you do not have to specify `--vllm` arguments manually. `pi` will also check if the model you selected can actually run on your pod with respect to the number of GPUs and available VRAM. Run `pi start` without additional arguments to see a list of predefined models that can run on the active pod.
135
+
136
+ ### Qwen Models
137
+ ```bash
138
+ # Qwen2.5-Coder-32B - Excellent coding model, fits on single H100/H200
139
+ pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen
129
140
 
130
- # Test a model on a specific pod
131
- pi prompt qwen "Hello!" --pod dev
141
+ # Qwen3-Coder-30B - Advanced reasoning with tool use
142
+ pi start Qwen/Qwen3-Coder-30B-A3B-Instruct --name qwen3
132
143
 
133
- # SSH into a specific pod
134
- pi shell --pod prod
135
- pi ssh --pod prod "nvidia-smi"
144
+ # Qwen3-Coder-480B - State-of-the-art on 8xH200 (data-parallel mode)
145
+ pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen-480b
136
146
  ```
137
147
 
138
- This allows you to manage multiple environments (dev, staging, production) from a single machine without constantly switching between them.
148
+ ### GPT-OSS Models
149
+ ```bash
150
+ # Requires special vLLM build during setup
151
+ pi pods setup gpt-pod "ssh root@1.2.3.4" --models-path /workspace --vllm gpt-oss
139
152
 
140
- ### Model Management
153
+ # GPT-OSS-20B - Fits on 16GB+ VRAM
154
+ pi start openai/gpt-oss-20b --name gpt20
141
155
 
142
- Each model runs as a separate vLLM instance with its own port and GPU allocation. The tool automatically manages GPU assignment on multi-GPU systems and ensures models don't conflict. Models are accessed by their short names (either auto-generated or specified with --name).
156
+ # GPT-OSS-120B - Needs 60GB+ VRAM
157
+ pi start openai/gpt-oss-120b --name gpt120
158
+ ```
143
159
 
160
+ ### GLM Models
144
161
  ```bash
145
- pi list # List running models on active pod
146
- pi search <query> # Search HuggingFace models
147
- pi start <model> [options] # Start a model with options
148
- --name <name> # Short alias (default: auto-generated)
149
- --context <size> # Context window: 4k, 8k, 16k, 32k (default: model default)
150
- --memory <percent> # GPU memory: 30%, 50%, 90% (default: 90%)
151
- --all-gpus # Use tensor parallelism across all GPUs
152
- --pod <pod-name> # Run on specific pod (default: active pod)
153
- --vllm-args # Pass all remaining args directly to vLLM
154
- pi stop [name] # Stop a model (or all if no name)
155
- pi logs <name> # View logs with tail -f
156
- pi prompt <name> "message" # Quick test prompt
157
- pi downloads [--live] # Check model download progress (--live for continuous monitoring)
162
+ # GLM-4.5 - Requires 8-16 GPUs, includes thinking mode
163
+ pi start zai-org/GLM-4.5 --name glm
164
+
165
+ # GLM-4.5-Air - Smaller version, 1-2 GPUs
166
+ pi start zai-org/GLM-4.5-Air --name glm-air
158
167
  ```
159
168
 
160
- All model management commands support the `--pod` parameter to target a specific pod without switching the active pod.
169
+ ### Custom Models with --vllm
161
170
 
162
- ## Examples
171
+ For models not in the predefined list, use `--vllm` to pass arguments directly to vLLM:
163
172
 
164
- ### Search for models
165
173
  ```bash
166
- pi search codellama
167
- pi search deepseek
168
- pi search qwen
174
+ # DeepSeek with custom settings
175
+ pi start deepseek-ai/DeepSeek-V3 --name deepseek --vllm \
176
+ --tensor-parallel-size 4 --trust-remote-code
177
+
178
+ # Mistral with pipeline parallelism
179
+ pi start mistralai/Mixtral-8x22B-Instruct-v0.1 --name mixtral --vllm \
180
+ --tensor-parallel-size 8 --pipeline-parallel-size 2
181
+
182
+ # Any model with specific tool parser
183
+ pi start some/model --name mymodel --vllm \
184
+ --tool-call-parser hermes --enable-auto-tool-choice
169
185
  ```
170
186
 
171
- **Note**: vLLM does not support formats like GGUF. Read the [docs](https://docs.vllm.ai/en/latest/)
187
+ ## DataCrunch Setup
172
188
 
173
- ### A100 80GB scenarios
174
- ```bash
175
- # Small model, high concurrency (~30-50 concurrent requests)
176
- pi start microsoft/Phi-3-mini-128k-instruct --name phi3 --memory 30%
189
+ DataCrunch offers the best experience with shared NFS storage across pods:
177
190
 
178
- # Medium model, balanced (~10-20 concurrent requests)
179
- pi start meta-llama/Llama-3.1-8B-Instruct --name llama8b --memory 50%
191
+ ### 1. Create Shared Filesystem (SFS)
192
+ - Go to DataCrunch dashboard Storage → Create SFS
193
+ - Choose size and datacenter
194
+ - Note the mount command (e.g., `sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/hf-models-fin02-8ac1bab7 /mnt/hf-models-fin02`)
180
195
 
181
- # Large model, limited concurrency (~5-10 concurrent requests)
182
- pi start meta-llama/Llama-3.1-70B-Instruct --name llama70b --memory 90%
196
+ ### 2. Create GPU Instance
197
+ - Create instance in same datacenter as SFS
198
+ - Share the SFS with the instance
199
+ - Get SSH command from dashboard
200
+
201
+ ### 3. Setup with pi
202
+ ```bash
203
+ # Get mount command from DataCrunch dashboard
204
+ pi pods setup dc1 "ssh root@instance.datacrunch.io" \
205
+ --mount "sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/your-pseudo /mnt/hf-models"
183
206
 
184
- # Run multiple small models
185
- pi start Qwen/Qwen2.5-Coder-1.5B --name coder1 --memory 15%
186
- pi start microsoft/Phi-3-mini-128k-instruct --name phi3 --memory 15%
207
+ # Models automatically stored in /mnt/hf-models (extracted from mount command)
187
208
  ```
188
209
 
189
- ## Understanding Context and Memory
210
+ ### 4. Benefits
211
+ - Models persist across instance restarts
212
+ - Share models between multiple instances in same datacenter
213
+ - Download once, use everywhere
214
+ - Pay only for storage, not compute time during downloads
190
215
 
191
- ### Context Window vs Output Tokens
192
- Models are loaded with their default context length. You can use the `context` parameter to specify a lower or higher context length. The `context` parameter sets the **total** token budget for input + output combined:
193
- - Starting a model with `context=8k` means 8,192 tokens total
194
- - If your prompt uses 6,000 tokens, you have 2,192 tokens left for the response
195
- - Each OpenAI API request to the model can specify `max_output_tokens` to control output length within this budget
216
+ ## RunPod Setup
196
217
 
197
- Example:
198
- ```bash
199
- # Start model with 32k total context
200
- pi start meta-llama/Llama-3.1-8B --name llama --context 32k --memory 50%
218
+ RunPod offers good persistent storage with network volumes:
201
219
 
202
- # When calling the API, you control output length per request:
203
- # - Send 20k token prompt
204
- # - Request max_tokens=4000
205
- # - Total = 24k (fits within 32k context)
206
- ```
220
+ ### 1. Create Network Volume (optional)
221
+ - Go to RunPod dashboard → Storage → Create Network Volume
222
+ - Choose size and region
207
223
 
208
- ### GPU Memory and Concurrency
209
- vLLM pre-allocates GPU memory controlled by `gpu_fraction`. This matters for coding agents that spawn sub-agents, as each connection needs memory.
224
+ ### 2. Create GPU Pod
225
+ - Select "Network Volume" during pod creation (if using)
226
+ - Attach your volume to `/runpod-volume`
227
+ - Get SSH command from pod details
210
228
 
211
- Example: On an A100 80GB with a 7B model (FP16, ~14GB weights):
212
- - `gpu_fraction=0.3` (24GB): ~10GB for KV cache → ~30-50 concurrent requests
213
- - `gpu_fraction=0.5` (40GB): ~26GB for KV cache → ~50-80 concurrent requests
214
- - `gpu_fraction=0.9` (72GB): ~58GB for KV cache → ~100+ concurrent requests
229
+ ### 3. Setup with pi
230
+ ```bash
231
+ # With network volume
232
+ pi pods setup runpod "ssh root@pod.runpod.io" --models-path /runpod-volume
233
+
234
+ # Or use workspace (persists with pod but not shareable)
235
+ pi pods setup runpod "ssh root@pod.runpod.io" --models-path /workspace
236
+ ```
215
237
 
216
- Models load in their native precision from HuggingFace (usually FP16/BF16). Check the model card's "Files and versions" tab - look for file sizes: 7B models are ~14GB, 13B are ~26GB, 70B are ~140GB. Quantized models (AWQ, GPTQ) in the name use less memory but may have quality trade-offs.
217
238
 
218
239
  ## Multi-GPU Support
219
240
 
220
- For pods with multiple GPUs, the tool automatically manages GPU assignment:
241
+ ### Automatic GPU Assignment
242
+ When running multiple models, pi automatically assigns them to different GPUs:
243
+ ```bash
244
+ pi start model1 --name m1 # Auto-assigns to GPU 0
245
+ pi start model2 --name m2 # Auto-assigns to GPU 1
246
+ pi start model3 --name m3 # Auto-assigns to GPU 2
247
+ ```
221
248
 
222
- ### Automatic GPU assignment for multiple models
249
+ ### Specify GPU Count for Predefined Models
250
+ For predefined models with multiple configurations, use `--gpus` to control GPU usage:
223
251
  ```bash
224
- # Each model automatically uses the next available GPU
225
- pi start microsoft/Phi-3-mini-128k-instruct --memory 20% # Auto-assigns to GPU 0
226
- pi start Qwen/Qwen2.5-7B-Instruct --memory 20% # Auto-assigns to GPU 1
227
- pi start meta-llama/Llama-3.1-8B --memory 20% # Auto-assigns to GPU 2
252
+ # Run Qwen on 1 GPU instead of all available
253
+ pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen --gpus 1
228
254
 
229
- # Check which GPU each model is using
230
- pi list
255
+ # Run GLM-4.5 on 8 GPUs (if it has an 8-GPU config)
256
+ pi start zai-org/GLM-4.5 --name glm --gpus 8
231
257
  ```
232
258
 
233
- ## Qwen on a single H200
259
+ If the model doesn't have a configuration for the requested GPU count, you'll see available options.
260
+
261
+ ### Tensor Parallelism for Large Models
262
+ For models that don't fit on a single GPU:
234
263
  ```bash
235
- pi start Qwen/Qwen3-Coder-30B-A3B-Instruct qwen3-30b
264
+ # Use all available GPUs
265
+ pi start meta-llama/Llama-3.1-70B-Instruct --name llama70b --vllm \
266
+ --tensor-parallel-size 4
267
+
268
+ # Specific GPU count
269
+ pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen480 --vllm \
270
+ --data-parallel-size 8 --enable-expert-parallel
236
271
  ```
237
272
 
238
- ### Run large models across all GPUs
239
- ```bash
240
- # Use --all-gpus for tensor parallelism across all available GPUs
241
- pi start meta-llama/Llama-3.1-70B-Instruct --all-gpus
242
- pi start Qwen/Qwen2.5-72B-Instruct --all-gpus --context 64k
273
+ ## API Integration
274
+
275
+ All models expose OpenAI-compatible endpoints:
276
+
277
+ ```python
278
+ from openai import OpenAI
279
+
280
+ client = OpenAI(
281
+ base_url="http://your-pod-ip:8001/v1",
282
+ api_key="your-pi-api-key"
283
+ )
284
+
285
+ # Chat completion with tool calling
286
+ response = client.chat.completions.create(
287
+ model="Qwen/Qwen2.5-Coder-32B-Instruct",
288
+ messages=[
289
+ {"role": "user", "content": "Write a Python function to calculate fibonacci"}
290
+ ],
291
+ tools=[{
292
+ "type": "function",
293
+ "function": {
294
+ "name": "execute_code",
295
+ "description": "Execute Python code",
296
+ "parameters": {
297
+ "type": "object",
298
+ "properties": {
299
+ "code": {"type": "string"}
300
+ },
301
+ "required": ["code"]
302
+ }
303
+ }
304
+ }],
305
+ tool_choice="auto"
306
+ )
243
307
  ```
244
308
 
245
- ### Advanced: Custom vLLM arguments
309
+ ## Standalone Agent CLI
310
+
311
+ `pi` includes a standalone OpenAI-compatible agent that can work with any API:
312
+
246
313
  ```bash
247
- # Pass custom arguments directly to vLLM with --vllm-args
248
- # Everything after --vllm-args is passed to vLLM unchanged
314
+ # Install globally to get pi-agent command
315
+ npm install -g @mariozechner/pi
249
316
 
250
- # Qwen3-Coder 480B on 8xH200 with expert parallelism
251
- pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen-coder --vllm-args \
252
- --data-parallel-size 8 --enable-expert-parallel \
253
- --tool-call-parser qwen3_coder --enable-auto-tool-choice --gpu-memory-utilization 0.95 --max-model-len 200000
317
+ # Use with OpenAI
318
+ pi-agent --api-key sk-... "What is machine learning?"
254
319
 
255
- # DeepSeek with custom quantization
256
- pi start deepseek-ai/DeepSeek-Coder-V2-Instruct --name deepseek --vllm-args \
257
- --tensor-parallel-size 4 --quantization fp8 --trust-remote-code
320
+ # Use with local vLLM
321
+ pi-agent --base-url http://localhost:8000/v1 \
322
+ --model meta-llama/Llama-3.1-8B-Instruct \
323
+ --api-key dummy \
324
+ "Explain quantum computing"
258
325
 
259
- # Mixtral with pipeline parallelism
260
- pi start mistralai/Mixtral-8x22B-Instruct-v0.1 --name mixtral --vllm-args \
261
- --tensor-parallel-size 8 --pipeline-parallel-size 2
326
+ # Interactive mode
327
+ pi-agent -i
328
+
329
+ # Continue previous session
330
+ pi-agent --continue "Follow up question"
331
+
332
+ # Custom system prompt
333
+ pi-agent --system-prompt "You are a Python expert" "Write a web scraper"
334
+
335
+ # Use responses API (for GPT-OSS models)
336
+ pi-agent --api responses --model openai/gpt-oss-20b "Hello"
262
337
  ```
263
338
 
264
- **Note on Special Models**: Some models require specific vLLM arguments to run properly:
265
- - **Qwen3-Coder 480B**: Requires `--enable-expert-parallel` for MoE support
266
- - **Kimi K2**: May require custom arguments - check the model's documentation
267
- - **DeepSeek V3**: Often needs `--trust-remote-code` for custom architectures
268
- - When in doubt, consult the model's HuggingFace page or documentation for recommended vLLM settings
339
+ The agent supports:
340
+ - Session persistence across conversations
341
+ - Interactive TUI mode with syntax highlighting
342
+ - File system tools (read, list, bash, glob, rg) for code navigation
343
+ - Both Chat Completions and Responses API formats
344
+ - Custom system prompts
345
+
346
+ ## Tool Calling Support
269
347
 
270
- ### Check GPU usage
348
+ `pi` automatically configures appropriate tool calling parsers for known models:
349
+
350
+ - **Qwen models**: `hermes` parser (Qwen3-Coder uses `qwen3_coder`)
351
+ - **GLM models**: `glm4_moe` parser with reasoning support
352
+ - **GPT-OSS models**: Uses `/v1/responses` endpoint, as tool calling (function calling in OpenAI parlance) is currently a [WIP with the `v1/chat/completions` endpoint](https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#tool-use).
353
+ - **Custom models**: Specify with `--vllm --tool-call-parser <parser> --enable-auto-tool-choice`
354
+
355
+ To disable tool calling:
271
356
  ```bash
272
- pi ssh "nvidia-smi"
357
+ pi start model --name mymodel --vllm --disable-tool-call-parser
273
358
  ```
274
359
 
275
- ## Architecture Notes
360
+ ## Memory and Context Management
276
361
 
277
- - **Multi-Pod Support**: The tool stores multiple pod configurations in `~/.pi_config` with one active pod at a time.
278
- - **Port Allocation**: Each model runs on a separate port (8001, 8002, etc.) allowing multiple models on one GPU.
279
- - **Memory Management**: vLLM uses PagedAttention for efficient memory use with less than 4% waste.
280
- - **Model Caching**: Models are downloaded once and cached on the pod.
281
- - **Tool Parser Auto-Detection**: The tool automatically selects the appropriate tool parser based on the model:
282
- - Qwen models: `hermes` (Qwen3-Coder: `qwen3_coder` if available)
283
- - Mistral models: `mistral` with optimized chat template
284
- - Llama models: `llama3_json` or `llama4_pythonic` based on version
285
- - InternLM models: `internlm`
286
- - Phi models: Tool calling disabled by default (no compatible tokens)
287
- - Override with `--vllm-args --tool-call-parser <parser> --enable-auto-tool-choice`
362
+ ### GPU Memory Allocation
363
+ Controls how much GPU memory vLLM pre-allocates:
364
+ - `--memory 30%`: High concurrency, limited context
365
+ - `--memory 50%`: Balanced (default)
366
+ - `--memory 90%`: Maximum context, low concurrency
288
367
 
368
+ ### Context Window
369
+ Sets maximum input + output tokens:
370
+ - `--context 4k`: 4,096 tokens total
371
+ - `--context 32k`: 32,768 tokens total
372
+ - `--context 128k`: 131,072 tokens total
289
373
 
290
- ## Tool Calling (Function Calling)
374
+ Example for coding workload:
375
+ ```bash
376
+ # Large context for code analysis, moderate concurrency
377
+ pi start Qwen/Qwen2.5-Coder-32B-Instruct --name coder \
378
+ --context 64k --memory 70%
379
+ ```
291
380
 
292
- Tool calling allows LLMs to request the use of external functions/APIs, but it's a complex feature with many caveats:
381
+ **Note**: When using `--vllm`, the `--memory`, `--context`, and `--gpus` parameters are ignored. You'll see a warning if you try to use them together.
293
382
 
294
- ### The Reality of Tool Calling
383
+ ## Session Persistence
295
384
 
296
- 1. **Model Compatibility**: Not all models support tool calling, even if they claim to. Many models lack the special tokens or training needed for reliable tool parsing.
385
+ The interactive agent mode (`-i`) saves sessions for each project directory:
297
386
 
298
- 2. **Parser Mismatches**: Different models use different tool calling formats:
299
- - Hermes format (XML-like)
300
- - Mistral format (specific JSON structure)
301
- - Llama format (JSON-based or pythonic)
302
- - Custom formats for each model family
387
+ ```bash
388
+ # Start new session
389
+ pi agent qwen -i
303
390
 
304
- 3. **Common Issues**:
305
- - "Could not locate tool call start/end tokens" - Model doesn't have required special tokens
306
- - Malformed JSON/XML output - Model wasn't trained for the parser format
307
- - Tool calls when you don't want them - Model overeager to use tools
308
- - No tool calls when you need them - Model doesn't understand when to use tools
391
+ # Continue previous session (maintains chat history)
392
+ pi agent qwen -i -c
393
+ ```
309
394
 
310
- ### How We Handle It
395
+ Sessions are stored in `~/.pi/sessions/` organized by project path and include:
396
+ - Complete conversation history
397
+ - Tool call results
398
+ - Token usage statistics
311
399
 
312
- The tool automatically detects the model and tries to use an appropriate parser:
313
- - **Qwen models**: `hermes` parser (Qwen3-Coder uses `qwen3_coder`)
314
- - **Mistral models**: `mistral` parser with custom template
315
- - **Llama models**: `llama3_json` or `llama4_pythonic` based on version
316
- - **Phi models**: Tool calling disabled (no compatible tokens)
400
+ ## Architecture & Event System
317
401
 
318
- ### Your Options
402
+ The agent uses a unified event-based architecture where all interactions flow through `AgentEvent` types. This enables:
403
+ - Consistent UI rendering across console and TUI modes
404
+ - Session recording and replay
405
+ - Clean separation between API calls and UI updates
406
+ - JSON output mode for programmatic integration
319
407
 
320
- 1. **Let auto-detection handle it** (default):
321
- ```bash
322
- pi start meta-llama/Llama-3.1-8B-Instruct --name llama
323
- ```
408
+ Events are automatically converted to the appropriate API format (Chat Completions or Responses) based on the model type.
324
409
 
325
- 2. **Force a specific parser** (if you know better):
326
- ```bash
327
- pi start model/name --name mymodel --vllm-args \
328
- --tool-call-parser mistral --enable-auto-tool-choice
329
- ```
410
+ ### JSON Output Mode
330
411
 
331
- 3. **Disable tool calling entirely** (most reliable):
332
- ```bash
333
- pi start model/name --name mymodel --vllm-args \
334
- --disable-tool-call-parser
335
- ```
412
+ Use `--json` flag to output the event stream as JSONL (JSON Lines) for programmatic consumption:
413
+ ```bash
414
+ pi-agent --api-key sk-... --json "What is 2+2?"
415
+ ```
336
416
 
337
- 4. **Handle tools in your application** (recommended for production):
338
- - Send regular prompts asking the model to output JSON
339
- - Parse the response in your code
340
- - More control, more reliable
417
+ Each line is a complete JSON object representing an event:
418
+ ```jsonl
419
+ {"type":"user_message","text":"What is 2+2?"}
420
+ {"type":"assistant_start"}
421
+ {"type":"assistant_message","text":"2 + 2 = 4"}
422
+ {"type":"token_usage","inputTokens":10,"outputTokens":5,"totalTokens":15,"cacheReadTokens":0,"cacheWriteTokens":0}
423
+ ```
341
424
 
342
- ### Best Practices
425
+ ## Troubleshooting
426
+
427
+ ### OOM (Out of Memory) Errors
428
+ - Reduce `--memory` percentage
429
+ - Use smaller model or quantized version (FP8)
430
+ - Reduce `--context` size
431
+
432
+ ### Model Won't Start
433
+ ```bash
434
+ # Check GPU usage
435
+ pi ssh "nvidia-smi"
343
436
 
344
- - **Test first**: Try a simple tool call to see if it works with your model
345
- - **Have a fallback**: Be prepared for tool calling to fail
346
- - **Consider alternatives**: Sometimes a well-crafted prompt works better than tool calling
347
- - **Read the docs**: Check the model card for tool calling examples
348
- - **Monitor logs**: Check `~/.vllm_logs/` for parser errors
437
+ # Check if port is in use
438
+ pi list
439
+
440
+ # Force stop all models
441
+ pi stop
442
+ ```
349
443
 
350
- Remember: Tool calling is still an evolving feature in the LLM ecosystem. What works today might break tomorrow with a model update.
444
+ ### Tool Calling Issues
445
+ - Not all models support tool calling reliably
446
+ - Try different parser: `--vllm --tool-call-parser mistral`
447
+ - Or disable: `--vllm --disable-tool-call-parser`
351
448
 
352
- ## Monitoring Downloads
449
+ ### Access Denied for Models
450
+ Some models (Llama, Mistral) require HuggingFace access approval. Visit the model page and click "Request access".
353
451
 
354
- Use `pi downloads` to check the progress of model downloads in the HuggingFace cache:
452
+ ### vLLM Build Issues
453
+ If using `--vllm nightly` fails, try:
454
+ - Use `--vllm release` for stable version
455
+ - Check CUDA compatibility with `pi ssh "nvidia-smi"`
355
456
 
457
+ ### Agent Not Finding Messages
458
+ If the agent shows configuration instead of your message, ensure quotes around messages with special characters:
356
459
  ```bash
357
- pi downloads # Check downloads on active pod
358
- pi downloads --live # Live monitoring (updates every 2 seconds)
359
- pi downloads --pod 8h200 # Check downloads on specific pod
360
- pi downloads --live --pod 8h200 # Live monitoring on specific pod
460
+ # Good
461
+ pi agent qwen "What is this file about?"
462
+
463
+ # Bad (shell might interpret special chars)
464
+ pi agent qwen What is this file about?
361
465
  ```
362
466
 
363
- The command shows:
364
- - Model name and current size
365
- - Download progress (files downloaded / total files)
366
- - Download status (⏬ Downloading or ⏸ Idle)
367
- - Estimated total size (if available from HuggingFace)
467
+ ## Advanced Usage
468
+
469
+ ### Working with Multiple Pods
470
+ ```bash
471
+ # Override active pod for any command
472
+ pi start model --name test --pod dev-pod
473
+ pi list --pod prod-pod
474
+ pi stop test --pod dev-pod
475
+ ```
368
476
 
369
- **Tip for large models**: When starting models like Qwen-480B that take time to download, run `pi start` in one terminal and `pi downloads --live` in another to monitor progress. This is especially helpful since the log output during downloads can be minimal.
477
+ ### Custom vLLM Arguments
478
+ ```bash
479
+ # Pass any vLLM argument after --vllm
480
+ pi start model --name custom --vllm \
481
+ --quantization awq \
482
+ --enable-prefix-caching \
483
+ --max-num-seqs 256 \
484
+ --gpu-memory-utilization 0.95
485
+ ```
370
486
 
371
- **Downloads stalled?** If downloads appear stuck (e.g., at 92%), you can safely stop and restart:
487
+ ### Monitoring
372
488
  ```bash
373
- pi stop <model-name> # Stop the current process
374
- pi downloads # Verify progress (e.g., 45/49 files)
375
- pi start <same-command> # Restart with the same command
489
+ # Watch GPU utilization
490
+ pi ssh "watch -n 1 nvidia-smi"
491
+
492
+ # Check model downloads
493
+ pi ssh "du -sh ~/.cache/huggingface/hub/*"
494
+
495
+ # View all logs
496
+ pi ssh "ls -la ~/.vllm_logs/"
497
+
498
+ # Check agent session history
499
+ ls -la ~/.pi/sessions/
376
500
  ```
377
- vLLM will automatically use the already-downloaded files and continue from where it left off. This often resolves network or CDN throttling issues.
378
501
 
379
- ## Troubleshooting
502
+ ## Environment Variables
503
+
504
+ - `HF_TOKEN` - HuggingFace token for model downloads
505
+ - `PI_API_KEY` - API key for vLLM endpoints
506
+ - `PI_CONFIG_DIR` - Config directory (default: `~/.pi`)
507
+ - `OPENAI_API_KEY` - Used by `pi-agent` when no `--api-key` provided
508
+
509
+ ## License
380
510
 
381
- - **OOM Errors**: Reduce gpu_fraction or use a smaller model
382
- - **Slow Inference**: Could be too many concurrent requests, try increasing gpu_fraction
383
- - **Connection Refused**: Check pod is running and port is correct
384
- - **HF Token Issues**: Ensure HF_TOKEN is set before running setup
385
- - **Access Denied**: Some models (like Llama, Mistral) require completing an access request on HuggingFace first. Visit the model page and click "Request access"
386
- - **Tool Calling Errors**: See the Tool Calling section above - consider disabling it or using a different model
387
- - **Model Won't Stop**: If `pi stop` fails, force kill all Python processes and verify GPU is free:
388
- ```bash
389
- pi ssh "killall -9 python3"
390
- pi ssh "nvidia-smi" # Should show no processes using GPU
391
- ```
392
- - **Model Deployment Fails**: Pi currently does not check GPU memory utilization before starting models. If deploying a model fails:
393
- 1. Check if GPUs are full with other models: `pi ssh "nvidia-smi"`
394
- 2. If memory is insufficient, make room by stopping running models: `pi stop <model_name>`
395
- 3. If the error persists with sufficient memory, copy the error output and feed it to an LLM for troubleshooting assistance
396
-
397
- ## Timing notes
398
- - 8x B200 on DataCrunch, Spot instance
399
- - pi setup
400
- - 1:27 min
401
- - pi start Qwen/Qwen3-Coder-30B-A3B-Instruct
402
- - (cold start incl. HF download, kernel warmup) 7:32m
403
- - (warm start, HF model already in cache) 1:02m
404
-
405
- - 8x H200 on DataCrunch, Spot instance
406
- - pi setup
407
- -2:04m
408
- - pi start Qwen/Qwen3-Coder-30B-A3B-Instruct
409
- - (cold start incl. HF download, kernel warmup) 9:30m
410
- - (warm start, HF model already in cache) 1:14m
411
- - pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 ...
412
- - (cold start incl. HF download, kernel warmup)
413
- - (warm start, HF model already in cache)
511
+ MIT