@mariozechner/pi 0.2.4 → 0.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +392 -294
- package/dist/cli.d.ts +3 -0
- package/dist/cli.d.ts.map +1 -0
- package/dist/cli.js +348 -0
- package/dist/cli.js.map +1 -0
- package/dist/commands/models.d.ts +39 -0
- package/dist/commands/models.d.ts.map +1 -0
- package/dist/commands/models.js +612 -0
- package/dist/commands/models.js.map +1 -0
- package/dist/commands/pods.d.ts +21 -0
- package/dist/commands/pods.d.ts.map +1 -0
- package/dist/commands/pods.js +175 -0
- package/dist/commands/pods.js.map +1 -0
- package/dist/commands/prompt.d.ts +7 -0
- package/dist/commands/prompt.d.ts.map +1 -0
- package/dist/commands/prompt.js +55 -0
- package/dist/commands/prompt.js.map +1 -0
- package/dist/config.d.ts +11 -0
- package/dist/config.d.ts.map +1 -0
- package/dist/config.js +74 -0
- package/dist/config.js.map +1 -0
- package/dist/index.d.ts +2 -0
- package/dist/index.d.ts.map +1 -0
- package/dist/index.js +3 -0
- package/dist/index.js.map +1 -0
- package/dist/model-configs.d.ts +22 -0
- package/dist/model-configs.d.ts.map +1 -0
- package/dist/model-configs.js +75 -0
- package/dist/model-configs.js.map +1 -0
- package/dist/models.json +305 -0
- package/dist/ssh.d.ts +24 -0
- package/dist/ssh.d.ts.map +1 -0
- package/dist/ssh.js +115 -0
- package/dist/ssh.js.map +1 -0
- package/dist/types.d.ts +23 -0
- package/dist/types.d.ts.map +1 -0
- package/dist/types.js +3 -0
- package/dist/types.js.map +1 -0
- package/package.json +38 -40
- package/LICENSE +0 -21
- package/pi.js +0 -1379
- package/pod_setup.sh +0 -74
- package/vllm_manager.py +0 -662
package/README.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
|
1
|
-
#
|
|
1
|
+
# pi
|
|
2
2
|
|
|
3
|
-
|
|
3
|
+
Deploy and manage LLMs on GPU pods with automatic vLLM configuration for agentic workloads.
|
|
4
4
|
|
|
5
5
|
## Installation
|
|
6
6
|
|
|
@@ -8,406 +8,504 @@ Quickly deploy LLMs on GPU pods from [Prime Intellect](https://www.primeintellec
|
|
|
8
8
|
npm install -g @mariozechner/pi
|
|
9
9
|
```
|
|
10
10
|
|
|
11
|
-
|
|
12
|
-
```bash
|
|
13
|
-
npx @mariozechner/pi
|
|
14
|
-
```
|
|
15
|
-
|
|
16
|
-
## What This Is
|
|
11
|
+
## What is pi?
|
|
17
12
|
|
|
18
|
-
|
|
13
|
+
`pi` simplifies running large language models on remote GPU pods. It automatically:
|
|
14
|
+
- Sets up vLLM on fresh Ubuntu pods
|
|
15
|
+
- Configures tool calling for agentic models (Qwen, GPT-OSS, GLM, etc.)
|
|
16
|
+
- Manages multiple models on the same pod with "smart" GPU allocation
|
|
17
|
+
- Provides OpenAI-compatible API endpoints for each model
|
|
18
|
+
- Includes an interactive agent with file system tools for testing
|
|
19
19
|
|
|
20
|
-
|
|
21
|
-
- **Zero to LLM in minutes** - Automatically installs vLLM and all dependencies on clean pods
|
|
22
|
-
- **Multi-model management** - Run multiple models concurrently on a single pod
|
|
23
|
-
- **Smart GPU allocation** - Round robin assigns models to available GPUs on multi-GPU pods
|
|
24
|
-
- **Tensor parallelism** - Run large models across multiple GPUs with `--all-gpus`
|
|
25
|
-
- **OpenAI-compatible API** - Drop-in replacement for OpenAI API clients with automatic tool/function calling support
|
|
26
|
-
- **No complex setup** - Just SSH access, no Kubernetes or Docker required
|
|
27
|
-
- **Privacy first** - vLLM telemetry disabled by default
|
|
20
|
+
## Quick Start
|
|
28
21
|
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
|
|
22
|
+
```bash
|
|
23
|
+
# Set required environment variables
|
|
24
|
+
export HF_TOKEN=your_huggingface_token # Get from https://huggingface.co/settings/tokens
|
|
25
|
+
export PI_API_KEY=your_api_key # Any string you want for API authentication
|
|
33
26
|
|
|
34
|
-
|
|
35
|
-
|
|
36
|
-
|
|
27
|
+
# Setup a DataCrunch pod with NFS storage (models path auto-extracted)
|
|
28
|
+
pi pods setup dc1 "ssh root@1.2.3.4" \
|
|
29
|
+
--mount "sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/your-pseudo /mnt/hf-models"
|
|
37
30
|
|
|
38
|
-
|
|
31
|
+
# Start a model (automatic configuration for known models)
|
|
32
|
+
pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen
|
|
39
33
|
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
- **Prime Intellect/DataCrunch/Vast.ai Account**
|
|
43
|
-
- **GPU Pod** - At least one running pod with:
|
|
44
|
-
- Ubuntu 22+ image (selected when creating pod)
|
|
45
|
-
- SSH access enabled
|
|
46
|
-
- Clean state (no manual vLLM installation needed)
|
|
47
|
-
- **Note**: B200 GPUs require PyTorch nightly with CUDA 12.8+ (automatically installed if detected). However, vLLM may need to be built from source for full compatibility.
|
|
34
|
+
# Send a single message to the model
|
|
35
|
+
pi agent qwen "What is the Fibonacci sequence?"
|
|
48
36
|
|
|
49
|
-
|
|
37
|
+
# Interactive chat mode with file system tools
|
|
38
|
+
pi agent qwen -i
|
|
50
39
|
|
|
51
|
-
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
|
|
40
|
+
# Use with any OpenAI-compatible client
|
|
41
|
+
export OPENAI_BASE_URL='http://1.2.3.4:8001/v1'
|
|
42
|
+
export OPENAI_API_KEY=$PI_API_KEY
|
|
43
|
+
```
|
|
55
44
|
|
|
56
|
-
|
|
57
|
-
# is required for model download.
|
|
58
|
-
export HF_TOKEN=your_huggingface_token
|
|
59
|
-
pi setup my-pod-name "ssh root@135.181.71.41 -p 22"
|
|
45
|
+
## Prerequisites
|
|
60
46
|
|
|
61
|
-
|
|
62
|
-
|
|
47
|
+
- Node.js 18+
|
|
48
|
+
- HuggingFace token (for model downloads)
|
|
49
|
+
- GPU pod with:
|
|
50
|
+
- Ubuntu 22.04 or 24.04
|
|
51
|
+
- SSH root access
|
|
52
|
+
- NVIDIA drivers installed
|
|
53
|
+
- Persistent storage for models
|
|
63
54
|
|
|
64
|
-
|
|
65
|
-
pi prompt phi3 "What is 2+2?"
|
|
66
|
-
# Response: The answer is 4.
|
|
55
|
+
## Supported Providers
|
|
67
56
|
|
|
68
|
-
|
|
69
|
-
pi start Qwen/Qwen2.5-7B-Instruct --name qwen --memory 30%
|
|
57
|
+
### Primary Support
|
|
70
58
|
|
|
71
|
-
|
|
72
|
-
|
|
59
|
+
**DataCrunch** - Best for shared model storage
|
|
60
|
+
- NFS volumes sharable across multiple pods in same region
|
|
61
|
+
- Models download once, use everywhere
|
|
62
|
+
- Ideal for teams or multiple experiments
|
|
73
63
|
|
|
74
|
-
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
|
|
64
|
+
**RunPod** - Good persistent storage
|
|
65
|
+
- Network volumes persist independently
|
|
66
|
+
- Cannot share between running pods simultaneously
|
|
67
|
+
- Good for single-pod workflows
|
|
78
68
|
|
|
79
|
-
|
|
69
|
+
### Also Works With
|
|
70
|
+
- Vast.ai (volumes locked to specific machine)
|
|
71
|
+
- Prime Intellect (no persistent storage)
|
|
72
|
+
- AWS EC2 (with EFS setup)
|
|
73
|
+
- Any Ubuntu machine with NVIDIA GPUs, CUDA driver, and SSH
|
|
80
74
|
|
|
81
|
-
|
|
82
|
-
- Connects to your clean Ubuntu pod
|
|
83
|
-
- Installs Python, CUDA drivers, and vLLM
|
|
84
|
-
- Configures HuggingFace tokens
|
|
85
|
-
- Sets up the model manager
|
|
75
|
+
## Commands
|
|
86
76
|
|
|
87
|
-
|
|
88
|
-
- Automatically finds an available GPU (on multi-GPU systems)
|
|
89
|
-
- Allocates the specified memory fraction
|
|
90
|
-
- Starts a separate vLLM instance on a unique port accessible via the OpenAI API protocol
|
|
91
|
-
- Manages logs and process lifecycle
|
|
77
|
+
### Pod Management
|
|
92
78
|
|
|
93
|
-
|
|
94
|
-
|
|
95
|
-
|
|
96
|
-
|
|
79
|
+
```bash
|
|
80
|
+
pi pods setup <name> "<ssh>" [options] # Setup new pod
|
|
81
|
+
--mount "<mount_command>" # Run mount command during setup
|
|
82
|
+
--models-path <path> # Override extracted path (optional)
|
|
83
|
+
--vllm release|nightly|gpt-oss # vLLM version (default: release)
|
|
84
|
+
|
|
85
|
+
pi pods # List all configured pods
|
|
86
|
+
pi pods active <name> # Switch active pod
|
|
87
|
+
pi pods remove <name> # Remove pod from local config
|
|
88
|
+
pi shell [<name>] # SSH into pod
|
|
89
|
+
pi ssh [<name>] "<command>" # Run command on pod
|
|
90
|
+
```
|
|
97
91
|
|
|
92
|
+
**Note**: When using `--mount`, the models path is automatically extracted from the mount command's target directory. You only need `--models-path` if not using `--mount` or to override the extracted path.
|
|
98
93
|
|
|
99
|
-
|
|
94
|
+
#### vLLM Version Options
|
|
100
95
|
|
|
101
|
-
|
|
96
|
+
- `release` (default): Stable vLLM release, recommended for most users
|
|
97
|
+
- `nightly`: Latest vLLM features, needed for newest models like GLM-4.5
|
|
98
|
+
- `gpt-oss`: Special build for OpenAI's GPT-OSS models only
|
|
102
99
|
|
|
103
|
-
|
|
100
|
+
### Model Management
|
|
104
101
|
|
|
105
102
|
```bash
|
|
106
|
-
pi
|
|
107
|
-
|
|
108
|
-
|
|
109
|
-
|
|
110
|
-
|
|
103
|
+
pi start <model> --name <name> [options] # Start a model
|
|
104
|
+
--memory <percent> # GPU memory: 30%, 50%, 90% (default: 90%)
|
|
105
|
+
--context <size> # Context window: 4k, 8k, 16k, 32k, 64k, 128k
|
|
106
|
+
--gpus <count> # Number of GPUs to use (predefined models only)
|
|
107
|
+
--pod <name> # Target specific pod (overrides active)
|
|
108
|
+
--vllm <args...> # Pass custom args directly to vLLM
|
|
109
|
+
|
|
110
|
+
pi stop [<name>] # Stop model (or all if no name given)
|
|
111
|
+
pi list # List running models with status
|
|
112
|
+
pi logs <name> # Stream model logs (tail -f)
|
|
111
113
|
```
|
|
112
114
|
|
|
113
|
-
|
|
114
|
-
|
|
115
|
-
You can manage models on any pod without switching the active pod by using the `--pod` parameter:
|
|
115
|
+
### Agent & Chat Interface
|
|
116
116
|
|
|
117
117
|
```bash
|
|
118
|
-
|
|
119
|
-
pi
|
|
118
|
+
pi agent <name> "<message>" # Single message to model
|
|
119
|
+
pi agent <name> "<msg1>" "<msg2>" # Multiple messages in sequence
|
|
120
|
+
pi agent <name> -i # Interactive chat mode
|
|
121
|
+
pi agent <name> -i -c # Continue previous session
|
|
122
|
+
|
|
123
|
+
# Standalone OpenAI-compatible agent (works with any API)
|
|
124
|
+
pi-agent --base-url http://localhost:8000/v1 --model llama-3.1 "Hello"
|
|
125
|
+
pi-agent --api-key sk-... "What is 2+2?" # Uses OpenAI by default
|
|
126
|
+
pi-agent --json "What is 2+2?" # Output event stream as JSONL
|
|
127
|
+
pi-agent -i # Interactive mode
|
|
128
|
+
```
|
|
120
129
|
|
|
121
|
-
|
|
122
|
-
pi start Qwen/Qwen2.5-7B-Instruct --name qwen --pod dev
|
|
130
|
+
The agent includes tools for file operations (read, list, bash, glob, rg) to test agentic capabilities, particularly useful for code navigation and analysis tasks.
|
|
123
131
|
|
|
124
|
-
|
|
125
|
-
pi stop qwen --pod dev
|
|
132
|
+
## Predefined Model Configurations
|
|
126
133
|
|
|
127
|
-
|
|
128
|
-
|
|
134
|
+
`pi` includes predefined configurations for popular agentic models, so you do not have to specify `--vllm` arguments manually. `pi` will also check if the model you selected can actually run on your pod with respect to the number of GPUs and available VRAM. Run `pi start` without additional arguments to see a list of predefined models that can run on the active pod.
|
|
135
|
+
|
|
136
|
+
### Qwen Models
|
|
137
|
+
```bash
|
|
138
|
+
# Qwen2.5-Coder-32B - Excellent coding model, fits on single H100/H200
|
|
139
|
+
pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen
|
|
129
140
|
|
|
130
|
-
#
|
|
131
|
-
pi
|
|
141
|
+
# Qwen3-Coder-30B - Advanced reasoning with tool use
|
|
142
|
+
pi start Qwen/Qwen3-Coder-30B-A3B-Instruct --name qwen3
|
|
132
143
|
|
|
133
|
-
#
|
|
134
|
-
pi
|
|
135
|
-
pi ssh --pod prod "nvidia-smi"
|
|
144
|
+
# Qwen3-Coder-480B - State-of-the-art on 8xH200 (data-parallel mode)
|
|
145
|
+
pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen-480b
|
|
136
146
|
```
|
|
137
147
|
|
|
138
|
-
|
|
148
|
+
### GPT-OSS Models
|
|
149
|
+
```bash
|
|
150
|
+
# Requires special vLLM build during setup
|
|
151
|
+
pi pods setup gpt-pod "ssh root@1.2.3.4" --models-path /workspace --vllm gpt-oss
|
|
139
152
|
|
|
140
|
-
|
|
153
|
+
# GPT-OSS-20B - Fits on 16GB+ VRAM
|
|
154
|
+
pi start openai/gpt-oss-20b --name gpt20
|
|
141
155
|
|
|
142
|
-
|
|
156
|
+
# GPT-OSS-120B - Needs 60GB+ VRAM
|
|
157
|
+
pi start openai/gpt-oss-120b --name gpt120
|
|
158
|
+
```
|
|
143
159
|
|
|
160
|
+
### GLM Models
|
|
144
161
|
```bash
|
|
145
|
-
|
|
146
|
-
pi
|
|
147
|
-
|
|
148
|
-
|
|
149
|
-
|
|
150
|
-
--memory <percent> # GPU memory: 30%, 50%, 90% (default: 90%)
|
|
151
|
-
--all-gpus # Use tensor parallelism across all GPUs
|
|
152
|
-
--pod <pod-name> # Run on specific pod (default: active pod)
|
|
153
|
-
--vllm-args # Pass all remaining args directly to vLLM
|
|
154
|
-
pi stop [name] # Stop a model (or all if no name)
|
|
155
|
-
pi logs <name> # View logs with tail -f
|
|
156
|
-
pi prompt <name> "message" # Quick test prompt
|
|
157
|
-
pi downloads [--live] # Check model download progress (--live for continuous monitoring)
|
|
162
|
+
# GLM-4.5 - Requires 8-16 GPUs, includes thinking mode
|
|
163
|
+
pi start zai-org/GLM-4.5 --name glm
|
|
164
|
+
|
|
165
|
+
# GLM-4.5-Air - Smaller version, 1-2 GPUs
|
|
166
|
+
pi start zai-org/GLM-4.5-Air --name glm-air
|
|
158
167
|
```
|
|
159
168
|
|
|
160
|
-
|
|
169
|
+
### Custom Models with --vllm
|
|
161
170
|
|
|
162
|
-
|
|
171
|
+
For models not in the predefined list, use `--vllm` to pass arguments directly to vLLM:
|
|
163
172
|
|
|
164
|
-
### Search for models
|
|
165
173
|
```bash
|
|
166
|
-
|
|
167
|
-
pi
|
|
168
|
-
|
|
174
|
+
# DeepSeek with custom settings
|
|
175
|
+
pi start deepseek-ai/DeepSeek-V3 --name deepseek --vllm \
|
|
176
|
+
--tensor-parallel-size 4 --trust-remote-code
|
|
177
|
+
|
|
178
|
+
# Mistral with pipeline parallelism
|
|
179
|
+
pi start mistralai/Mixtral-8x22B-Instruct-v0.1 --name mixtral --vllm \
|
|
180
|
+
--tensor-parallel-size 8 --pipeline-parallel-size 2
|
|
181
|
+
|
|
182
|
+
# Any model with specific tool parser
|
|
183
|
+
pi start some/model --name mymodel --vllm \
|
|
184
|
+
--tool-call-parser hermes --enable-auto-tool-choice
|
|
169
185
|
```
|
|
170
186
|
|
|
171
|
-
|
|
187
|
+
## DataCrunch Setup
|
|
172
188
|
|
|
173
|
-
|
|
174
|
-
```bash
|
|
175
|
-
# Small model, high concurrency (~30-50 concurrent requests)
|
|
176
|
-
pi start microsoft/Phi-3-mini-128k-instruct --name phi3 --memory 30%
|
|
189
|
+
DataCrunch offers the best experience with shared NFS storage across pods:
|
|
177
190
|
|
|
178
|
-
|
|
179
|
-
|
|
191
|
+
### 1. Create Shared Filesystem (SFS)
|
|
192
|
+
- Go to DataCrunch dashboard → Storage → Create SFS
|
|
193
|
+
- Choose size and datacenter
|
|
194
|
+
- Note the mount command (e.g., `sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/hf-models-fin02-8ac1bab7 /mnt/hf-models-fin02`)
|
|
180
195
|
|
|
181
|
-
|
|
182
|
-
|
|
196
|
+
### 2. Create GPU Instance
|
|
197
|
+
- Create instance in same datacenter as SFS
|
|
198
|
+
- Share the SFS with the instance
|
|
199
|
+
- Get SSH command from dashboard
|
|
200
|
+
|
|
201
|
+
### 3. Setup with pi
|
|
202
|
+
```bash
|
|
203
|
+
# Get mount command from DataCrunch dashboard
|
|
204
|
+
pi pods setup dc1 "ssh root@instance.datacrunch.io" \
|
|
205
|
+
--mount "sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/your-pseudo /mnt/hf-models"
|
|
183
206
|
|
|
184
|
-
#
|
|
185
|
-
pi start Qwen/Qwen2.5-Coder-1.5B --name coder1 --memory 15%
|
|
186
|
-
pi start microsoft/Phi-3-mini-128k-instruct --name phi3 --memory 15%
|
|
207
|
+
# Models automatically stored in /mnt/hf-models (extracted from mount command)
|
|
187
208
|
```
|
|
188
209
|
|
|
189
|
-
|
|
210
|
+
### 4. Benefits
|
|
211
|
+
- Models persist across instance restarts
|
|
212
|
+
- Share models between multiple instances in same datacenter
|
|
213
|
+
- Download once, use everywhere
|
|
214
|
+
- Pay only for storage, not compute time during downloads
|
|
190
215
|
|
|
191
|
-
|
|
192
|
-
Models are loaded with their default context length. You can use the `context` parameter to specify a lower or higher context length. The `context` parameter sets the **total** token budget for input + output combined:
|
|
193
|
-
- Starting a model with `context=8k` means 8,192 tokens total
|
|
194
|
-
- If your prompt uses 6,000 tokens, you have 2,192 tokens left for the response
|
|
195
|
-
- Each OpenAI API request to the model can specify `max_output_tokens` to control output length within this budget
|
|
216
|
+
## RunPod Setup
|
|
196
217
|
|
|
197
|
-
|
|
198
|
-
```bash
|
|
199
|
-
# Start model with 32k total context
|
|
200
|
-
pi start meta-llama/Llama-3.1-8B --name llama --context 32k --memory 50%
|
|
218
|
+
RunPod offers good persistent storage with network volumes:
|
|
201
219
|
|
|
202
|
-
|
|
203
|
-
|
|
204
|
-
|
|
205
|
-
# - Total = 24k (fits within 32k context)
|
|
206
|
-
```
|
|
220
|
+
### 1. Create Network Volume (optional)
|
|
221
|
+
- Go to RunPod dashboard → Storage → Create Network Volume
|
|
222
|
+
- Choose size and region
|
|
207
223
|
|
|
208
|
-
###
|
|
209
|
-
|
|
224
|
+
### 2. Create GPU Pod
|
|
225
|
+
- Select "Network Volume" during pod creation (if using)
|
|
226
|
+
- Attach your volume to `/runpod-volume`
|
|
227
|
+
- Get SSH command from pod details
|
|
210
228
|
|
|
211
|
-
|
|
212
|
-
|
|
213
|
-
|
|
214
|
-
|
|
229
|
+
### 3. Setup with pi
|
|
230
|
+
```bash
|
|
231
|
+
# With network volume
|
|
232
|
+
pi pods setup runpod "ssh root@pod.runpod.io" --models-path /runpod-volume
|
|
233
|
+
|
|
234
|
+
# Or use workspace (persists with pod but not shareable)
|
|
235
|
+
pi pods setup runpod "ssh root@pod.runpod.io" --models-path /workspace
|
|
236
|
+
```
|
|
215
237
|
|
|
216
|
-
Models load in their native precision from HuggingFace (usually FP16/BF16). Check the model card's "Files and versions" tab - look for file sizes: 7B models are ~14GB, 13B are ~26GB, 70B are ~140GB. Quantized models (AWQ, GPTQ) in the name use less memory but may have quality trade-offs.
|
|
217
238
|
|
|
218
239
|
## Multi-GPU Support
|
|
219
240
|
|
|
220
|
-
|
|
241
|
+
### Automatic GPU Assignment
|
|
242
|
+
When running multiple models, pi automatically assigns them to different GPUs:
|
|
243
|
+
```bash
|
|
244
|
+
pi start model1 --name m1 # Auto-assigns to GPU 0
|
|
245
|
+
pi start model2 --name m2 # Auto-assigns to GPU 1
|
|
246
|
+
pi start model3 --name m3 # Auto-assigns to GPU 2
|
|
247
|
+
```
|
|
221
248
|
|
|
222
|
-
###
|
|
249
|
+
### Specify GPU Count for Predefined Models
|
|
250
|
+
For predefined models with multiple configurations, use `--gpus` to control GPU usage:
|
|
223
251
|
```bash
|
|
224
|
-
#
|
|
225
|
-
pi start
|
|
226
|
-
pi start Qwen/Qwen2.5-7B-Instruct --memory 20% # Auto-assigns to GPU 1
|
|
227
|
-
pi start meta-llama/Llama-3.1-8B --memory 20% # Auto-assigns to GPU 2
|
|
252
|
+
# Run Qwen on 1 GPU instead of all available
|
|
253
|
+
pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen --gpus 1
|
|
228
254
|
|
|
229
|
-
#
|
|
230
|
-
pi
|
|
255
|
+
# Run GLM-4.5 on 8 GPUs (if it has an 8-GPU config)
|
|
256
|
+
pi start zai-org/GLM-4.5 --name glm --gpus 8
|
|
231
257
|
```
|
|
232
258
|
|
|
233
|
-
|
|
259
|
+
If the model doesn't have a configuration for the requested GPU count, you'll see available options.
|
|
260
|
+
|
|
261
|
+
### Tensor Parallelism for Large Models
|
|
262
|
+
For models that don't fit on a single GPU:
|
|
234
263
|
```bash
|
|
235
|
-
|
|
264
|
+
# Use all available GPUs
|
|
265
|
+
pi start meta-llama/Llama-3.1-70B-Instruct --name llama70b --vllm \
|
|
266
|
+
--tensor-parallel-size 4
|
|
267
|
+
|
|
268
|
+
# Specific GPU count
|
|
269
|
+
pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen480 --vllm \
|
|
270
|
+
--data-parallel-size 8 --enable-expert-parallel
|
|
236
271
|
```
|
|
237
272
|
|
|
238
|
-
|
|
239
|
-
|
|
240
|
-
|
|
241
|
-
|
|
242
|
-
|
|
273
|
+
## API Integration
|
|
274
|
+
|
|
275
|
+
All models expose OpenAI-compatible endpoints:
|
|
276
|
+
|
|
277
|
+
```python
|
|
278
|
+
from openai import OpenAI
|
|
279
|
+
|
|
280
|
+
client = OpenAI(
|
|
281
|
+
base_url="http://your-pod-ip:8001/v1",
|
|
282
|
+
api_key="your-pi-api-key"
|
|
283
|
+
)
|
|
284
|
+
|
|
285
|
+
# Chat completion with tool calling
|
|
286
|
+
response = client.chat.completions.create(
|
|
287
|
+
model="Qwen/Qwen2.5-Coder-32B-Instruct",
|
|
288
|
+
messages=[
|
|
289
|
+
{"role": "user", "content": "Write a Python function to calculate fibonacci"}
|
|
290
|
+
],
|
|
291
|
+
tools=[{
|
|
292
|
+
"type": "function",
|
|
293
|
+
"function": {
|
|
294
|
+
"name": "execute_code",
|
|
295
|
+
"description": "Execute Python code",
|
|
296
|
+
"parameters": {
|
|
297
|
+
"type": "object",
|
|
298
|
+
"properties": {
|
|
299
|
+
"code": {"type": "string"}
|
|
300
|
+
},
|
|
301
|
+
"required": ["code"]
|
|
302
|
+
}
|
|
303
|
+
}
|
|
304
|
+
}],
|
|
305
|
+
tool_choice="auto"
|
|
306
|
+
)
|
|
243
307
|
```
|
|
244
308
|
|
|
245
|
-
|
|
309
|
+
## Standalone Agent CLI
|
|
310
|
+
|
|
311
|
+
`pi` includes a standalone OpenAI-compatible agent that can work with any API:
|
|
312
|
+
|
|
246
313
|
```bash
|
|
247
|
-
#
|
|
248
|
-
|
|
314
|
+
# Install globally to get pi-agent command
|
|
315
|
+
npm install -g @mariozechner/pi
|
|
249
316
|
|
|
250
|
-
#
|
|
251
|
-
pi
|
|
252
|
-
--data-parallel-size 8 --enable-expert-parallel \
|
|
253
|
-
--tool-call-parser qwen3_coder --enable-auto-tool-choice --gpu-memory-utilization 0.95 --max-model-len 200000
|
|
317
|
+
# Use with OpenAI
|
|
318
|
+
pi-agent --api-key sk-... "What is machine learning?"
|
|
254
319
|
|
|
255
|
-
#
|
|
256
|
-
pi
|
|
257
|
-
|
|
320
|
+
# Use with local vLLM
|
|
321
|
+
pi-agent --base-url http://localhost:8000/v1 \
|
|
322
|
+
--model meta-llama/Llama-3.1-8B-Instruct \
|
|
323
|
+
--api-key dummy \
|
|
324
|
+
"Explain quantum computing"
|
|
258
325
|
|
|
259
|
-
#
|
|
260
|
-
pi
|
|
261
|
-
|
|
326
|
+
# Interactive mode
|
|
327
|
+
pi-agent -i
|
|
328
|
+
|
|
329
|
+
# Continue previous session
|
|
330
|
+
pi-agent --continue "Follow up question"
|
|
331
|
+
|
|
332
|
+
# Custom system prompt
|
|
333
|
+
pi-agent --system-prompt "You are a Python expert" "Write a web scraper"
|
|
334
|
+
|
|
335
|
+
# Use responses API (for GPT-OSS models)
|
|
336
|
+
pi-agent --api responses --model openai/gpt-oss-20b "Hello"
|
|
262
337
|
```
|
|
263
338
|
|
|
264
|
-
|
|
265
|
-
-
|
|
266
|
-
-
|
|
267
|
-
-
|
|
268
|
-
-
|
|
339
|
+
The agent supports:
|
|
340
|
+
- Session persistence across conversations
|
|
341
|
+
- Interactive TUI mode with syntax highlighting
|
|
342
|
+
- File system tools (read, list, bash, glob, rg) for code navigation
|
|
343
|
+
- Both Chat Completions and Responses API formats
|
|
344
|
+
- Custom system prompts
|
|
345
|
+
|
|
346
|
+
## Tool Calling Support
|
|
269
347
|
|
|
270
|
-
|
|
348
|
+
`pi` automatically configures appropriate tool calling parsers for known models:
|
|
349
|
+
|
|
350
|
+
- **Qwen models**: `hermes` parser (Qwen3-Coder uses `qwen3_coder`)
|
|
351
|
+
- **GLM models**: `glm4_moe` parser with reasoning support
|
|
352
|
+
- **GPT-OSS models**: Uses `/v1/responses` endpoint, as tool calling (function calling in OpenAI parlance) is currently a [WIP with the `v1/chat/completions` endpoint](https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#tool-use).
|
|
353
|
+
- **Custom models**: Specify with `--vllm --tool-call-parser <parser> --enable-auto-tool-choice`
|
|
354
|
+
|
|
355
|
+
To disable tool calling:
|
|
271
356
|
```bash
|
|
272
|
-
pi
|
|
357
|
+
pi start model --name mymodel --vllm --disable-tool-call-parser
|
|
273
358
|
```
|
|
274
359
|
|
|
275
|
-
##
|
|
360
|
+
## Memory and Context Management
|
|
276
361
|
|
|
277
|
-
|
|
278
|
-
|
|
279
|
-
-
|
|
280
|
-
-
|
|
281
|
-
-
|
|
282
|
-
- Qwen models: `hermes` (Qwen3-Coder: `qwen3_coder` if available)
|
|
283
|
-
- Mistral models: `mistral` with optimized chat template
|
|
284
|
-
- Llama models: `llama3_json` or `llama4_pythonic` based on version
|
|
285
|
-
- InternLM models: `internlm`
|
|
286
|
-
- Phi models: Tool calling disabled by default (no compatible tokens)
|
|
287
|
-
- Override with `--vllm-args --tool-call-parser <parser> --enable-auto-tool-choice`
|
|
362
|
+
### GPU Memory Allocation
|
|
363
|
+
Controls how much GPU memory vLLM pre-allocates:
|
|
364
|
+
- `--memory 30%`: High concurrency, limited context
|
|
365
|
+
- `--memory 50%`: Balanced (default)
|
|
366
|
+
- `--memory 90%`: Maximum context, low concurrency
|
|
288
367
|
|
|
368
|
+
### Context Window
|
|
369
|
+
Sets maximum input + output tokens:
|
|
370
|
+
- `--context 4k`: 4,096 tokens total
|
|
371
|
+
- `--context 32k`: 32,768 tokens total
|
|
372
|
+
- `--context 128k`: 131,072 tokens total
|
|
289
373
|
|
|
290
|
-
|
|
374
|
+
Example for coding workload:
|
|
375
|
+
```bash
|
|
376
|
+
# Large context for code analysis, moderate concurrency
|
|
377
|
+
pi start Qwen/Qwen2.5-Coder-32B-Instruct --name coder \
|
|
378
|
+
--context 64k --memory 70%
|
|
379
|
+
```
|
|
291
380
|
|
|
292
|
-
|
|
381
|
+
**Note**: When using `--vllm`, the `--memory`, `--context`, and `--gpus` parameters are ignored. You'll see a warning if you try to use them together.
|
|
293
382
|
|
|
294
|
-
|
|
383
|
+
## Session Persistence
|
|
295
384
|
|
|
296
|
-
|
|
385
|
+
The interactive agent mode (`-i`) saves sessions for each project directory:
|
|
297
386
|
|
|
298
|
-
|
|
299
|
-
|
|
300
|
-
|
|
301
|
-
- Llama format (JSON-based or pythonic)
|
|
302
|
-
- Custom formats for each model family
|
|
387
|
+
```bash
|
|
388
|
+
# Start new session
|
|
389
|
+
pi agent qwen -i
|
|
303
390
|
|
|
304
|
-
|
|
305
|
-
|
|
306
|
-
|
|
307
|
-
- Tool calls when you don't want them - Model overeager to use tools
|
|
308
|
-
- No tool calls when you need them - Model doesn't understand when to use tools
|
|
391
|
+
# Continue previous session (maintains chat history)
|
|
392
|
+
pi agent qwen -i -c
|
|
393
|
+
```
|
|
309
394
|
|
|
310
|
-
|
|
395
|
+
Sessions are stored in `~/.pi/sessions/` organized by project path and include:
|
|
396
|
+
- Complete conversation history
|
|
397
|
+
- Tool call results
|
|
398
|
+
- Token usage statistics
|
|
311
399
|
|
|
312
|
-
|
|
313
|
-
- **Qwen models**: `hermes` parser (Qwen3-Coder uses `qwen3_coder`)
|
|
314
|
-
- **Mistral models**: `mistral` parser with custom template
|
|
315
|
-
- **Llama models**: `llama3_json` or `llama4_pythonic` based on version
|
|
316
|
-
- **Phi models**: Tool calling disabled (no compatible tokens)
|
|
400
|
+
## Architecture & Event System
|
|
317
401
|
|
|
318
|
-
|
|
402
|
+
The agent uses a unified event-based architecture where all interactions flow through `AgentEvent` types. This enables:
|
|
403
|
+
- Consistent UI rendering across console and TUI modes
|
|
404
|
+
- Session recording and replay
|
|
405
|
+
- Clean separation between API calls and UI updates
|
|
406
|
+
- JSON output mode for programmatic integration
|
|
319
407
|
|
|
320
|
-
|
|
321
|
-
```bash
|
|
322
|
-
pi start meta-llama/Llama-3.1-8B-Instruct --name llama
|
|
323
|
-
```
|
|
408
|
+
Events are automatically converted to the appropriate API format (Chat Completions or Responses) based on the model type.
|
|
324
409
|
|
|
325
|
-
|
|
326
|
-
```bash
|
|
327
|
-
pi start model/name --name mymodel --vllm-args \
|
|
328
|
-
--tool-call-parser mistral --enable-auto-tool-choice
|
|
329
|
-
```
|
|
410
|
+
### JSON Output Mode
|
|
330
411
|
|
|
331
|
-
|
|
332
|
-
|
|
333
|
-
|
|
334
|
-
|
|
335
|
-
```
|
|
412
|
+
Use `--json` flag to output the event stream as JSONL (JSON Lines) for programmatic consumption:
|
|
413
|
+
```bash
|
|
414
|
+
pi-agent --api-key sk-... --json "What is 2+2?"
|
|
415
|
+
```
|
|
336
416
|
|
|
337
|
-
|
|
338
|
-
|
|
339
|
-
|
|
340
|
-
|
|
417
|
+
Each line is a complete JSON object representing an event:
|
|
418
|
+
```jsonl
|
|
419
|
+
{"type":"user_message","text":"What is 2+2?"}
|
|
420
|
+
{"type":"assistant_start"}
|
|
421
|
+
{"type":"assistant_message","text":"2 + 2 = 4"}
|
|
422
|
+
{"type":"token_usage","inputTokens":10,"outputTokens":5,"totalTokens":15,"cacheReadTokens":0,"cacheWriteTokens":0}
|
|
423
|
+
```
|
|
341
424
|
|
|
342
|
-
|
|
425
|
+
## Troubleshooting
|
|
426
|
+
|
|
427
|
+
### OOM (Out of Memory) Errors
|
|
428
|
+
- Reduce `--memory` percentage
|
|
429
|
+
- Use smaller model or quantized version (FP8)
|
|
430
|
+
- Reduce `--context` size
|
|
431
|
+
|
|
432
|
+
### Model Won't Start
|
|
433
|
+
```bash
|
|
434
|
+
# Check GPU usage
|
|
435
|
+
pi ssh "nvidia-smi"
|
|
343
436
|
|
|
344
|
-
|
|
345
|
-
|
|
346
|
-
|
|
347
|
-
|
|
348
|
-
|
|
437
|
+
# Check if port is in use
|
|
438
|
+
pi list
|
|
439
|
+
|
|
440
|
+
# Force stop all models
|
|
441
|
+
pi stop
|
|
442
|
+
```
|
|
349
443
|
|
|
350
|
-
|
|
444
|
+
### Tool Calling Issues
|
|
445
|
+
- Not all models support tool calling reliably
|
|
446
|
+
- Try different parser: `--vllm --tool-call-parser mistral`
|
|
447
|
+
- Or disable: `--vllm --disable-tool-call-parser`
|
|
351
448
|
|
|
352
|
-
|
|
449
|
+
### Access Denied for Models
|
|
450
|
+
Some models (Llama, Mistral) require HuggingFace access approval. Visit the model page and click "Request access".
|
|
353
451
|
|
|
354
|
-
|
|
452
|
+
### vLLM Build Issues
|
|
453
|
+
If using `--vllm nightly` fails, try:
|
|
454
|
+
- Use `--vllm release` for stable version
|
|
455
|
+
- Check CUDA compatibility with `pi ssh "nvidia-smi"`
|
|
355
456
|
|
|
457
|
+
### Agent Not Finding Messages
|
|
458
|
+
If the agent shows configuration instead of your message, ensure quotes around messages with special characters:
|
|
356
459
|
```bash
|
|
357
|
-
|
|
358
|
-
pi
|
|
359
|
-
|
|
360
|
-
|
|
460
|
+
# Good
|
|
461
|
+
pi agent qwen "What is this file about?"
|
|
462
|
+
|
|
463
|
+
# Bad (shell might interpret special chars)
|
|
464
|
+
pi agent qwen What is this file about?
|
|
361
465
|
```
|
|
362
466
|
|
|
363
|
-
|
|
364
|
-
|
|
365
|
-
|
|
366
|
-
|
|
367
|
-
|
|
467
|
+
## Advanced Usage
|
|
468
|
+
|
|
469
|
+
### Working with Multiple Pods
|
|
470
|
+
```bash
|
|
471
|
+
# Override active pod for any command
|
|
472
|
+
pi start model --name test --pod dev-pod
|
|
473
|
+
pi list --pod prod-pod
|
|
474
|
+
pi stop test --pod dev-pod
|
|
475
|
+
```
|
|
368
476
|
|
|
369
|
-
|
|
477
|
+
### Custom vLLM Arguments
|
|
478
|
+
```bash
|
|
479
|
+
# Pass any vLLM argument after --vllm
|
|
480
|
+
pi start model --name custom --vllm \
|
|
481
|
+
--quantization awq \
|
|
482
|
+
--enable-prefix-caching \
|
|
483
|
+
--max-num-seqs 256 \
|
|
484
|
+
--gpu-memory-utilization 0.95
|
|
485
|
+
```
|
|
370
486
|
|
|
371
|
-
|
|
487
|
+
### Monitoring
|
|
372
488
|
```bash
|
|
373
|
-
|
|
374
|
-
pi
|
|
375
|
-
|
|
489
|
+
# Watch GPU utilization
|
|
490
|
+
pi ssh "watch -n 1 nvidia-smi"
|
|
491
|
+
|
|
492
|
+
# Check model downloads
|
|
493
|
+
pi ssh "du -sh ~/.cache/huggingface/hub/*"
|
|
494
|
+
|
|
495
|
+
# View all logs
|
|
496
|
+
pi ssh "ls -la ~/.vllm_logs/"
|
|
497
|
+
|
|
498
|
+
# Check agent session history
|
|
499
|
+
ls -la ~/.pi/sessions/
|
|
376
500
|
```
|
|
377
|
-
vLLM will automatically use the already-downloaded files and continue from where it left off. This often resolves network or CDN throttling issues.
|
|
378
501
|
|
|
379
|
-
##
|
|
502
|
+
## Environment Variables
|
|
503
|
+
|
|
504
|
+
- `HF_TOKEN` - HuggingFace token for model downloads
|
|
505
|
+
- `PI_API_KEY` - API key for vLLM endpoints
|
|
506
|
+
- `PI_CONFIG_DIR` - Config directory (default: `~/.pi`)
|
|
507
|
+
- `OPENAI_API_KEY` - Used by `pi-agent` when no `--api-key` provided
|
|
508
|
+
|
|
509
|
+
## License
|
|
380
510
|
|
|
381
|
-
|
|
382
|
-
- **Slow Inference**: Could be too many concurrent requests, try increasing gpu_fraction
|
|
383
|
-
- **Connection Refused**: Check pod is running and port is correct
|
|
384
|
-
- **HF Token Issues**: Ensure HF_TOKEN is set before running setup
|
|
385
|
-
- **Access Denied**: Some models (like Llama, Mistral) require completing an access request on HuggingFace first. Visit the model page and click "Request access"
|
|
386
|
-
- **Tool Calling Errors**: See the Tool Calling section above - consider disabling it or using a different model
|
|
387
|
-
- **Model Won't Stop**: If `pi stop` fails, force kill all Python processes and verify GPU is free:
|
|
388
|
-
```bash
|
|
389
|
-
pi ssh "killall -9 python3"
|
|
390
|
-
pi ssh "nvidia-smi" # Should show no processes using GPU
|
|
391
|
-
```
|
|
392
|
-
- **Model Deployment Fails**: Pi currently does not check GPU memory utilization before starting models. If deploying a model fails:
|
|
393
|
-
1. Check if GPUs are full with other models: `pi ssh "nvidia-smi"`
|
|
394
|
-
2. If memory is insufficient, make room by stopping running models: `pi stop <model_name>`
|
|
395
|
-
3. If the error persists with sufficient memory, copy the error output and feed it to an LLM for troubleshooting assistance
|
|
396
|
-
|
|
397
|
-
## Timing notes
|
|
398
|
-
- 8x B200 on DataCrunch, Spot instance
|
|
399
|
-
- pi setup
|
|
400
|
-
- 1:27 min
|
|
401
|
-
- pi start Qwen/Qwen3-Coder-30B-A3B-Instruct
|
|
402
|
-
- (cold start incl. HF download, kernel warmup) 7:32m
|
|
403
|
-
- (warm start, HF model already in cache) 1:02m
|
|
404
|
-
|
|
405
|
-
- 8x H200 on DataCrunch, Spot instance
|
|
406
|
-
- pi setup
|
|
407
|
-
-2:04m
|
|
408
|
-
- pi start Qwen/Qwen3-Coder-30B-A3B-Instruct
|
|
409
|
-
- (cold start incl. HF download, kernel warmup) 9:30m
|
|
410
|
-
- (warm start, HF model already in cache) 1:14m
|
|
411
|
-
- pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 ...
|
|
412
|
-
- (cold start incl. HF download, kernel warmup)
|
|
413
|
-
- (warm start, HF model already in cache)
|
|
511
|
+
MIT
|