@hyperspaceng/neural-pods 0.60.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md ADDED
@@ -0,0 +1,511 @@
1
+ # pi
2
+
3
+ Deploy and manage LLMs on GPU pods with automatic vLLM configuration for agentic workloads.
4
+
5
+ ## Installation
6
+
7
+ ```bash
8
+ npm install -g @mariozechner/pi
9
+ ```
10
+
11
+ ## What is pi?
12
+
13
+ `pi` simplifies running large language models on remote GPU pods. It automatically:
14
+ - Sets up vLLM on fresh Ubuntu pods
15
+ - Configures tool calling for agentic models (Qwen, GPT-OSS, GLM, etc.)
16
+ - Manages multiple models on the same pod with "smart" GPU allocation
17
+ - Provides OpenAI-compatible API endpoints for each model
18
+ - Includes an interactive agent with file system tools for testing
19
+
20
+ ## Quick Start
21
+
22
+ ```bash
23
+ # Set required environment variables
24
+ export HF_TOKEN=your_huggingface_token # Get from https://huggingface.co/settings/tokens
25
+ export PI_API_KEY=your_api_key # Any string you want for API authentication
26
+
27
+ # Setup a DataCrunch pod with NFS storage (models path auto-extracted)
28
+ pi pods setup dc1 "ssh root@1.2.3.4" \
29
+ --mount "sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/your-pseudo /mnt/hf-models"
30
+
31
+ # Start a model (automatic configuration for known models)
32
+ pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen
33
+
34
+ # Send a single message to the model
35
+ pi agent qwen "What is the Fibonacci sequence?"
36
+
37
+ # Interactive chat mode with file system tools
38
+ pi agent qwen -i
39
+
40
+ # Use with any OpenAI-compatible client
41
+ export OPENAI_BASE_URL='http://1.2.3.4:8001/v1'
42
+ export OPENAI_API_KEY=$PI_API_KEY
43
+ ```
44
+
45
+ ## Prerequisites
46
+
47
+ - Node.js 18+
48
+ - HuggingFace token (for model downloads)
49
+ - GPU pod with:
50
+ - Ubuntu 22.04 or 24.04
51
+ - SSH root access
52
+ - NVIDIA drivers installed
53
+ - Persistent storage for models
54
+
55
+ ## Supported Providers
56
+
57
+ ### Primary Support
58
+
59
+ **DataCrunch** - Best for shared model storage
60
+ - NFS volumes sharable across multiple pods in same region
61
+ - Models download once, use everywhere
62
+ - Ideal for teams or multiple experiments
63
+
64
+ **RunPod** - Good persistent storage
65
+ - Network volumes persist independently
66
+ - Cannot share between running pods simultaneously
67
+ - Good for single-pod workflows
68
+
69
+ ### Also Works With
70
+ - Vast.ai (volumes locked to specific machine)
71
+ - Prime Intellect (no persistent storage)
72
+ - AWS EC2 (with EFS setup)
73
+ - Any Ubuntu machine with NVIDIA GPUs, CUDA driver, and SSH
74
+
75
+ ## Commands
76
+
77
+ ### Pod Management
78
+
79
+ ```bash
80
+ pi pods setup <name> "<ssh>" [options] # Setup new pod
81
+ --mount "<mount_command>" # Run mount command during setup
82
+ --models-path <path> # Override extracted path (optional)
83
+ --vllm release|nightly|gpt-oss # vLLM version (default: release)
84
+
85
+ pi pods # List all configured pods
86
+ pi pods active <name> # Switch active pod
87
+ pi pods remove <name> # Remove pod from local config
88
+ pi shell [<name>] # SSH into pod
89
+ pi ssh [<name>] "<command>" # Run command on pod
90
+ ```
91
+
92
+ **Note**: When using `--mount`, the models path is automatically extracted from the mount command's target directory. You only need `--models-path` if not using `--mount` or to override the extracted path.
93
+
94
+ #### vLLM Version Options
95
+
96
+ - `release` (default): Stable vLLM release, recommended for most users
97
+ - `nightly`: Latest vLLM features, needed for newest models like GLM-4.5
98
+ - `gpt-oss`: Special build for OpenAI's GPT-OSS models only
99
+
100
+ ### Model Management
101
+
102
+ ```bash
103
+ pi start <model> --name <name> [options] # Start a model
104
+ --memory <percent> # GPU memory: 30%, 50%, 90% (default: 90%)
105
+ --context <size> # Context window: 4k, 8k, 16k, 32k, 64k, 128k
106
+ --gpus <count> # Number of GPUs to use (predefined models only)
107
+ --pod <name> # Target specific pod (overrides active)
108
+ --vllm <args...> # Pass custom args directly to vLLM
109
+
110
+ pi stop [<name>] # Stop model (or all if no name given)
111
+ pi list # List running models with status
112
+ pi logs <name> # Stream model logs (tail -f)
113
+ ```
114
+
115
+ ### Agent & Chat Interface
116
+
117
+ ```bash
118
+ pi agent <name> "<message>" # Single message to model
119
+ pi agent <name> "<msg1>" "<msg2>" # Multiple messages in sequence
120
+ pi agent <name> -i # Interactive chat mode
121
+ pi agent <name> -i -c # Continue previous session
122
+
123
+ # Standalone OpenAI-compatible agent (works with any API)
124
+ pi-agent --base-url http://localhost:8000/v1 --model llama-3.1 "Hello"
125
+ pi-agent --api-key sk-... "What is 2+2?" # Uses OpenAI by default
126
+ pi-agent --json "What is 2+2?" # Output event stream as JSONL
127
+ pi-agent -i # Interactive mode
128
+ ```
129
+
130
+ The agent includes tools for file operations (read, list, bash, glob, rg) to test agentic capabilities, particularly useful for code navigation and analysis tasks.
131
+
132
+ ## Predefined Model Configurations
133
+
134
+ `pi` includes predefined configurations for popular agentic models, so you do not have to specify `--vllm` arguments manually. `pi` will also check if the model you selected can actually run on your pod with respect to the number of GPUs and available VRAM. Run `pi start` without additional arguments to see a list of predefined models that can run on the active pod.
135
+
136
+ ### Qwen Models
137
+ ```bash
138
+ # Qwen2.5-Coder-32B - Excellent coding model, fits on single H100/H200
139
+ pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen
140
+
141
+ # Qwen3-Coder-30B - Advanced reasoning with tool use
142
+ pi start Qwen/Qwen3-Coder-30B-A3B-Instruct --name qwen3
143
+
144
+ # Qwen3-Coder-480B - State-of-the-art on 8xH200 (data-parallel mode)
145
+ pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen-480b
146
+ ```
147
+
148
+ ### GPT-OSS Models
149
+ ```bash
150
+ # Requires special vLLM build during setup
151
+ pi pods setup gpt-pod "ssh root@1.2.3.4" --models-path /workspace --vllm gpt-oss
152
+
153
+ # GPT-OSS-20B - Fits on 16GB+ VRAM
154
+ pi start openai/gpt-oss-20b --name gpt20
155
+
156
+ # GPT-OSS-120B - Needs 60GB+ VRAM
157
+ pi start openai/gpt-oss-120b --name gpt120
158
+ ```
159
+
160
+ ### GLM Models
161
+ ```bash
162
+ # GLM-4.5 - Requires 8-16 GPUs, includes thinking mode
163
+ pi start zai-org/GLM-4.5 --name glm
164
+
165
+ # GLM-4.5-Air - Smaller version, 1-2 GPUs
166
+ pi start zai-org/GLM-4.5-Air --name glm-air
167
+ ```
168
+
169
+ ### Custom Models with --vllm
170
+
171
+ For models not in the predefined list, use `--vllm` to pass arguments directly to vLLM:
172
+
173
+ ```bash
174
+ # DeepSeek with custom settings
175
+ pi start deepseek-ai/DeepSeek-V3 --name deepseek --vllm \
176
+ --tensor-parallel-size 4 --trust-remote-code
177
+
178
+ # Mistral with pipeline parallelism
179
+ pi start mistralai/Mixtral-8x22B-Instruct-v0.1 --name mixtral --vllm \
180
+ --tensor-parallel-size 8 --pipeline-parallel-size 2
181
+
182
+ # Any model with specific tool parser
183
+ pi start some/model --name mymodel --vllm \
184
+ --tool-call-parser hermes --enable-auto-tool-choice
185
+ ```
186
+
187
+ ## DataCrunch Setup
188
+
189
+ DataCrunch offers the best experience with shared NFS storage across pods:
190
+
191
+ ### 1. Create Shared Filesystem (SFS)
192
+ - Go to DataCrunch dashboard → Storage → Create SFS
193
+ - Choose size and datacenter
194
+ - Note the mount command (e.g., `sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/hf-models-fin02-8ac1bab7 /mnt/hf-models-fin02`)
195
+
196
+ ### 2. Create GPU Instance
197
+ - Create instance in same datacenter as SFS
198
+ - Share the SFS with the instance
199
+ - Get SSH command from dashboard
200
+
201
+ ### 3. Setup with pi
202
+ ```bash
203
+ # Get mount command from DataCrunch dashboard
204
+ pi pods setup dc1 "ssh root@instance.datacrunch.io" \
205
+ --mount "sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/your-pseudo /mnt/hf-models"
206
+
207
+ # Models automatically stored in /mnt/hf-models (extracted from mount command)
208
+ ```
209
+
210
+ ### 4. Benefits
211
+ - Models persist across instance restarts
212
+ - Share models between multiple instances in same datacenter
213
+ - Download once, use everywhere
214
+ - Pay only for storage, not compute time during downloads
215
+
216
+ ## RunPod Setup
217
+
218
+ RunPod offers good persistent storage with network volumes:
219
+
220
+ ### 1. Create Network Volume (optional)
221
+ - Go to RunPod dashboard → Storage → Create Network Volume
222
+ - Choose size and region
223
+
224
+ ### 2. Create GPU Pod
225
+ - Select "Network Volume" during pod creation (if using)
226
+ - Attach your volume to `/runpod-volume`
227
+ - Get SSH command from pod details
228
+
229
+ ### 3. Setup with pi
230
+ ```bash
231
+ # With network volume
232
+ pi pods setup runpod "ssh root@pod.runpod.io" --models-path /runpod-volume
233
+
234
+ # Or use workspace (persists with pod but not shareable)
235
+ pi pods setup runpod "ssh root@pod.runpod.io" --models-path /workspace
236
+ ```
237
+
238
+
239
+ ## Multi-GPU Support
240
+
241
+ ### Automatic GPU Assignment
242
+ When running multiple models, pi automatically assigns them to different GPUs:
243
+ ```bash
244
+ pi start model1 --name m1 # Auto-assigns to GPU 0
245
+ pi start model2 --name m2 # Auto-assigns to GPU 1
246
+ pi start model3 --name m3 # Auto-assigns to GPU 2
247
+ ```
248
+
249
+ ### Specify GPU Count for Predefined Models
250
+ For predefined models with multiple configurations, use `--gpus` to control GPU usage:
251
+ ```bash
252
+ # Run Qwen on 1 GPU instead of all available
253
+ pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen --gpus 1
254
+
255
+ # Run GLM-4.5 on 8 GPUs (if it has an 8-GPU config)
256
+ pi start zai-org/GLM-4.5 --name glm --gpus 8
257
+ ```
258
+
259
+ If the model doesn't have a configuration for the requested GPU count, you'll see available options.
260
+
261
+ ### Tensor Parallelism for Large Models
262
+ For models that don't fit on a single GPU:
263
+ ```bash
264
+ # Use all available GPUs
265
+ pi start meta-llama/Llama-3.1-70B-Instruct --name llama70b --vllm \
266
+ --tensor-parallel-size 4
267
+
268
+ # Specific GPU count
269
+ pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen480 --vllm \
270
+ --data-parallel-size 8 --enable-expert-parallel
271
+ ```
272
+
273
+ ## API Integration
274
+
275
+ All models expose OpenAI-compatible endpoints:
276
+
277
+ ```python
278
+ from openai import OpenAI
279
+
280
+ client = OpenAI(
281
+ base_url="http://your-pod-ip:8001/v1",
282
+ api_key="your-pi-api-key"
283
+ )
284
+
285
+ # Chat completion with tool calling
286
+ response = client.chat.completions.create(
287
+ model="Qwen/Qwen2.5-Coder-32B-Instruct",
288
+ messages=[
289
+ {"role": "user", "content": "Write a Python function to calculate fibonacci"}
290
+ ],
291
+ tools=[{
292
+ "type": "function",
293
+ "function": {
294
+ "name": "execute_code",
295
+ "description": "Execute Python code",
296
+ "parameters": {
297
+ "type": "object",
298
+ "properties": {
299
+ "code": {"type": "string"}
300
+ },
301
+ "required": ["code"]
302
+ }
303
+ }
304
+ }],
305
+ tool_choice="auto"
306
+ )
307
+ ```
308
+
309
+ ## Standalone Agent CLI
310
+
311
+ `pi` includes a standalone OpenAI-compatible agent that can work with any API:
312
+
313
+ ```bash
314
+ # Install globally to get pi-agent command
315
+ npm install -g @mariozechner/pi
316
+
317
+ # Use with OpenAI
318
+ pi-agent --api-key sk-... "What is machine learning?"
319
+
320
+ # Use with local vLLM
321
+ pi-agent --base-url http://localhost:8000/v1 \
322
+ --model meta-llama/Llama-3.1-8B-Instruct \
323
+ --api-key dummy \
324
+ "Explain quantum computing"
325
+
326
+ # Interactive mode
327
+ pi-agent -i
328
+
329
+ # Continue previous session
330
+ pi-agent --continue "Follow up question"
331
+
332
+ # Custom system prompt
333
+ pi-agent --system-prompt "You are a Python expert" "Write a web scraper"
334
+
335
+ # Use responses API (for GPT-OSS models)
336
+ pi-agent --api responses --model openai/gpt-oss-20b "Hello"
337
+ ```
338
+
339
+ The agent supports:
340
+ - Session persistence across conversations
341
+ - Interactive TUI mode with syntax highlighting
342
+ - File system tools (read, list, bash, glob, rg) for code navigation
343
+ - Both Chat Completions and Responses API formats
344
+ - Custom system prompts
345
+
346
+ ## Tool Calling Support
347
+
348
+ `pi` automatically configures appropriate tool calling parsers for known models:
349
+
350
+ - **Qwen models**: `hermes` parser (Qwen3-Coder uses `qwen3_coder`)
351
+ - **GLM models**: `glm4_moe` parser with reasoning support
352
+ - **GPT-OSS models**: Uses `/v1/responses` endpoint, as tool calling (function calling in OpenAI parlance) is currently a [WIP with the `v1/chat/completions` endpoint](https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#tool-use).
353
+ - **Custom models**: Specify with `--vllm --tool-call-parser <parser> --enable-auto-tool-choice`
354
+
355
+ To disable tool calling:
356
+ ```bash
357
+ pi start model --name mymodel --vllm --disable-tool-call-parser
358
+ ```
359
+
360
+ ## Memory and Context Management
361
+
362
+ ### GPU Memory Allocation
363
+ Controls how much GPU memory vLLM pre-allocates:
364
+ - `--memory 30%`: High concurrency, limited context
365
+ - `--memory 50%`: Balanced (default)
366
+ - `--memory 90%`: Maximum context, low concurrency
367
+
368
+ ### Context Window
369
+ Sets maximum input + output tokens:
370
+ - `--context 4k`: 4,096 tokens total
371
+ - `--context 32k`: 32,768 tokens total
372
+ - `--context 128k`: 131,072 tokens total
373
+
374
+ Example for coding workload:
375
+ ```bash
376
+ # Large context for code analysis, moderate concurrency
377
+ pi start Qwen/Qwen2.5-Coder-32B-Instruct --name coder \
378
+ --context 64k --memory 70%
379
+ ```
380
+
381
+ **Note**: When using `--vllm`, the `--memory`, `--context`, and `--gpus` parameters are ignored. You'll see a warning if you try to use them together.
382
+
383
+ ## Session Persistence
384
+
385
+ The interactive agent mode (`-i`) saves sessions for each project directory:
386
+
387
+ ```bash
388
+ # Start new session
389
+ pi agent qwen -i
390
+
391
+ # Continue previous session (maintains chat history)
392
+ pi agent qwen -i -c
393
+ ```
394
+
395
+ Sessions are stored in `~/.pi/sessions/` organized by project path and include:
396
+ - Complete conversation history
397
+ - Tool call results
398
+ - Token usage statistics
399
+
400
+ ## Architecture & Event System
401
+
402
+ The agent uses a unified event-based architecture where all interactions flow through `AgentEvent` types. This enables:
403
+ - Consistent UI rendering across console and TUI modes
404
+ - Session recording and replay
405
+ - Clean separation between API calls and UI updates
406
+ - JSON output mode for programmatic integration
407
+
408
+ Events are automatically converted to the appropriate API format (Chat Completions or Responses) based on the model type.
409
+
410
+ ### JSON Output Mode
411
+
412
+ Use `--json` flag to output the event stream as JSONL (JSON Lines) for programmatic consumption:
413
+ ```bash
414
+ pi-agent --api-key sk-... --json "What is 2+2?"
415
+ ```
416
+
417
+ Each line is a complete JSON object representing an event:
418
+ ```jsonl
419
+ {"type":"user_message","text":"What is 2+2?"}
420
+ {"type":"assistant_start"}
421
+ {"type":"assistant_message","text":"2 + 2 = 4"}
422
+ {"type":"token_usage","inputTokens":10,"outputTokens":5,"totalTokens":15,"cacheReadTokens":0,"cacheWriteTokens":0}
423
+ ```
424
+
425
+ ## Troubleshooting
426
+
427
+ ### OOM (Out of Memory) Errors
428
+ - Reduce `--memory` percentage
429
+ - Use smaller model or quantized version (FP8)
430
+ - Reduce `--context` size
431
+
432
+ ### Model Won't Start
433
+ ```bash
434
+ # Check GPU usage
435
+ pi ssh "nvidia-smi"
436
+
437
+ # Check if port is in use
438
+ pi list
439
+
440
+ # Force stop all models
441
+ pi stop
442
+ ```
443
+
444
+ ### Tool Calling Issues
445
+ - Not all models support tool calling reliably
446
+ - Try different parser: `--vllm --tool-call-parser mistral`
447
+ - Or disable: `--vllm --disable-tool-call-parser`
448
+
449
+ ### Access Denied for Models
450
+ Some models (Llama, Mistral) require HuggingFace access approval. Visit the model page and click "Request access".
451
+
452
+ ### vLLM Build Issues
453
+ If using `--vllm nightly` fails, try:
454
+ - Use `--vllm release` for stable version
455
+ - Check CUDA compatibility with `pi ssh "nvidia-smi"`
456
+
457
+ ### Agent Not Finding Messages
458
+ If the agent shows configuration instead of your message, ensure quotes around messages with special characters:
459
+ ```bash
460
+ # Good
461
+ pi agent qwen "What is this file about?"
462
+
463
+ # Bad (shell might interpret special chars)
464
+ pi agent qwen What is this file about?
465
+ ```
466
+
467
+ ## Advanced Usage
468
+
469
+ ### Working with Multiple Pods
470
+ ```bash
471
+ # Override active pod for any command
472
+ pi start model --name test --pod dev-pod
473
+ pi list --pod prod-pod
474
+ pi stop test --pod dev-pod
475
+ ```
476
+
477
+ ### Custom vLLM Arguments
478
+ ```bash
479
+ # Pass any vLLM argument after --vllm
480
+ pi start model --name custom --vllm \
481
+ --quantization awq \
482
+ --enable-prefix-caching \
483
+ --max-num-seqs 256 \
484
+ --gpu-memory-utilization 0.95
485
+ ```
486
+
487
+ ### Monitoring
488
+ ```bash
489
+ # Watch GPU utilization
490
+ pi ssh "watch -n 1 nvidia-smi"
491
+
492
+ # Check model downloads
493
+ pi ssh "du -sh ~/.cache/huggingface/hub/*"
494
+
495
+ # View all logs
496
+ pi ssh "ls -la ~/.vllm_logs/"
497
+
498
+ # Check agent session history
499
+ ls -la ~/.pi/sessions/
500
+ ```
501
+
502
+ ## Environment Variables
503
+
504
+ - `HF_TOKEN` - HuggingFace token for model downloads
505
+ - `PI_API_KEY` - API key for vLLM endpoints
506
+ - `PI_CONFIG_DIR` - Config directory (default: `~/.pi`)
507
+ - `OPENAI_API_KEY` - Used by `pi-agent` when no `--api-key` provided
508
+
509
+ ## License
510
+
511
+ MIT
package/package.json ADDED
@@ -0,0 +1,40 @@
1
+ {
2
+ "name": "@hyperspaceng/neural-pods",
3
+ "version": "0.60.0",
4
+ "description": "CLI tool for managing vLLM deployments on GPU pods",
5
+ "type": "module",
6
+ "bin": {
7
+ "pi-pods": "dist/cli.js"
8
+ },
9
+ "scripts": {
10
+ "clean": "shx rm -rf dist",
11
+ "build": "tsgo -p tsconfig.build.json && shx chmod +x dist/cli.js && shx cp src/models.json dist/ && shx cp -r scripts dist/",
12
+ "prepublishOnly": "npm run clean && npm run build"
13
+ },
14
+ "files": [
15
+ "dist",
16
+ "scripts"
17
+ ],
18
+ "keywords": [
19
+ "llm",
20
+ "vllm",
21
+ "gpu",
22
+ "ai",
23
+ "cli"
24
+ ],
25
+ "author": "Hyperspace Technologies <hyperspace@hyperspace.ng>",
26
+ "license": "MIT",
27
+ "repository": {
28
+ "type": "git",
29
+ "url": "git+https://github.com/badlogic/pi-mono.git",
30
+ "directory": "packages/pods"
31
+ },
32
+ "engines": {
33
+ "node": ">=20.0.0"
34
+ },
35
+ "dependencies": {
36
+ "@hyperspaceng/neural-agent-core": "^0.60.0",
37
+ "chalk": "^5.5.0"
38
+ },
39
+ "devDependencies": {}
40
+ }
@@ -0,0 +1,83 @@
1
+ #!/usr/bin/env bash
2
+ # Model runner script - runs sequentially, killed by pi stop
3
+ set -euo pipefail
4
+
5
+ # These values are replaced before upload by pi CLI
6
+ MODEL_ID="{{MODEL_ID}}"
7
+ NAME="{{NAME}}"
8
+ PORT="{{PORT}}"
9
+ VLLM_ARGS="{{VLLM_ARGS}}"
10
+
11
+ # Trap to ensure cleanup on exit and kill any child processes
12
+ cleanup() {
13
+ local exit_code=$?
14
+ echo "Model runner exiting with code $exit_code"
15
+ # Kill any child processes
16
+ pkill -P $$ 2>/dev/null || true
17
+ exit $exit_code
18
+ }
19
+ trap cleanup EXIT TERM INT
20
+
21
+ # Force colored output even when not a TTY
22
+ export FORCE_COLOR=1
23
+ export PYTHONUNBUFFERED=1
24
+ export TERM=xterm-256color
25
+ export RICH_FORCE_TERMINAL=1
26
+ export CLICOLOR_FORCE=1
27
+
28
+ # Source virtual environment
29
+ source /root/venv/bin/activate
30
+
31
+ echo "========================================="
32
+ echo "Model Run: $NAME"
33
+ echo "Model ID: $MODEL_ID"
34
+ echo "Port: $PORT"
35
+ if [ -n "$VLLM_ARGS" ]; then
36
+ echo "vLLM Args: $VLLM_ARGS"
37
+ fi
38
+ echo "========================================="
39
+ echo ""
40
+
41
+ # Download model (with color progress bars)
42
+ echo "Downloading model (will skip if cached)..."
43
+ HF_HUB_ENABLE_HF_TRANSFER=1 hf download "$MODEL_ID"
44
+
45
+ if [ $? -ne 0 ]; then
46
+ echo "❌ ERROR: Failed to download model" >&2
47
+ exit 1
48
+ fi
49
+
50
+ echo ""
51
+ echo "✅ Model download complete"
52
+ echo ""
53
+
54
+ # Build vLLM command
55
+ VLLM_CMD="vllm serve '$MODEL_ID' --port $PORT --api-key '$PI_API_KEY'"
56
+ if [ -n "$VLLM_ARGS" ]; then
57
+ VLLM_CMD="$VLLM_CMD $VLLM_ARGS"
58
+ fi
59
+
60
+ echo "Starting vLLM server..."
61
+ echo "Command: $VLLM_CMD"
62
+ echo "========================================="
63
+ echo ""
64
+
65
+ # Run vLLM in background so we can monitor it
66
+ echo "Starting vLLM process..."
67
+ bash -c "$VLLM_CMD" &
68
+ VLLM_PID=$!
69
+
70
+ # Monitor the vLLM process
71
+ echo "Monitoring vLLM process (PID: $VLLM_PID)..."
72
+ wait $VLLM_PID
73
+ VLLM_EXIT_CODE=$?
74
+
75
+ if [ $VLLM_EXIT_CODE -ne 0 ]; then
76
+ echo "❌ ERROR: vLLM exited with code $VLLM_EXIT_CODE" >&2
77
+ # Make sure to exit the script command too
78
+ kill -TERM $$ 2>/dev/null || true
79
+ exit $VLLM_EXIT_CODE
80
+ fi
81
+
82
+ echo "✅ vLLM exited normally"
83
+ exit 0
@@ -0,0 +1,336 @@
1
+ #!/usr/bin/env bash
2
+ # GPU pod bootstrap for vLLM deployment
3
+ set -euo pipefail
4
+
5
+ # Parse arguments passed from pi CLI
6
+ MOUNT_COMMAND=""
7
+ MODELS_PATH=""
8
+ HF_TOKEN=""
9
+ PI_API_KEY=""
10
+ VLLM_VERSION="release" # Default to release
11
+
12
+ while [[ $# -gt 0 ]]; do
13
+ case $1 in
14
+ --mount)
15
+ MOUNT_COMMAND="$2"
16
+ shift 2
17
+ ;;
18
+ --models-path)
19
+ MODELS_PATH="$2"
20
+ shift 2
21
+ ;;
22
+ --hf-token)
23
+ HF_TOKEN="$2"
24
+ shift 2
25
+ ;;
26
+ --vllm-api-key)
27
+ PI_API_KEY="$2"
28
+ shift 2
29
+ ;;
30
+ --vllm)
31
+ VLLM_VERSION="$2"
32
+ shift 2
33
+ ;;
34
+ *)
35
+ echo "ERROR: Unknown option: $1" >&2
36
+ exit 1
37
+ ;;
38
+ esac
39
+ done
40
+
41
+ # Validate required parameters
42
+ if [ -z "$HF_TOKEN" ]; then
43
+ echo "ERROR: HF_TOKEN is required" >&2
44
+ exit 1
45
+ fi
46
+
47
+ if [ -z "$PI_API_KEY" ]; then
48
+ echo "ERROR: PI_API_KEY is required" >&2
49
+ exit 1
50
+ fi
51
+
52
+ if [ -z "$MODELS_PATH" ]; then
53
+ echo "ERROR: MODELS_PATH is required" >&2
54
+ exit 1
55
+ fi
56
+
57
+ echo "=== Starting pod setup ==="
58
+
59
+ # Install system dependencies
60
+ apt update -y
61
+ apt install -y python3-pip python3-venv git build-essential cmake ninja-build curl wget lsb-release htop pkg-config
62
+
63
+ # --- Install matching CUDA toolkit -------------------------------------------
64
+ echo "Checking CUDA driver version..."
65
+ DRIVER_CUDA_VERSION=$(nvidia-smi | grep "CUDA Version" | awk '{print $9}')
66
+ echo "Driver supports CUDA: $DRIVER_CUDA_VERSION"
67
+
68
+ # Check if nvcc exists and its version
69
+ if command -v nvcc &> /dev/null; then
70
+ NVCC_VERSION=$(nvcc --version | grep "release" | awk '{print $6}' | cut -d, -f1)
71
+ echo "Current nvcc version: $NVCC_VERSION"
72
+ else
73
+ NVCC_VERSION="none"
74
+ echo "nvcc not found"
75
+ fi
76
+
77
+ # Install CUDA toolkit matching driver version if needed
78
+ if [[ "$NVCC_VERSION" != "$DRIVER_CUDA_VERSION" ]]; then
79
+ echo "Installing CUDA Toolkit $DRIVER_CUDA_VERSION to match driver..."
80
+
81
+ # Detect Ubuntu version
82
+ UBUNTU_VERSION=$(lsb_release -rs)
83
+ UBUNTU_CODENAME=$(lsb_release -cs)
84
+
85
+ echo "Detected Ubuntu $UBUNTU_VERSION ($UBUNTU_CODENAME)"
86
+
87
+ # Map Ubuntu version to NVIDIA repo path
88
+ if [[ "$UBUNTU_VERSION" == "24.04" ]]; then
89
+ REPO_PATH="ubuntu2404"
90
+ elif [[ "$UBUNTU_VERSION" == "22.04" ]]; then
91
+ REPO_PATH="ubuntu2204"
92
+ elif [[ "$UBUNTU_VERSION" == "20.04" ]]; then
93
+ REPO_PATH="ubuntu2004"
94
+ else
95
+ echo "Warning: Unsupported Ubuntu version $UBUNTU_VERSION, trying ubuntu2204"
96
+ REPO_PATH="ubuntu2204"
97
+ fi
98
+
99
+ # Add NVIDIA package repositories
100
+ wget https://developer.download.nvidia.com/compute/cuda/repos/${REPO_PATH}/x86_64/cuda-keyring_1.1-1_all.deb
101
+ dpkg -i cuda-keyring_1.1-1_all.deb
102
+ rm cuda-keyring_1.1-1_all.deb
103
+ apt-get update
104
+
105
+ # Install specific CUDA toolkit version
106
+ # Convert version format (12.9 -> 12-9)
107
+ CUDA_VERSION_APT=$(echo $DRIVER_CUDA_VERSION | sed 's/\./-/')
108
+ echo "Installing cuda-toolkit-${CUDA_VERSION_APT}..."
109
+ apt-get install -y cuda-toolkit-${CUDA_VERSION_APT}
110
+
111
+ # Add CUDA to PATH
112
+ export PATH=/usr/local/cuda-${DRIVER_CUDA_VERSION}/bin:$PATH
113
+ export LD_LIBRARY_PATH=/usr/local/cuda-${DRIVER_CUDA_VERSION}/lib64:${LD_LIBRARY_PATH:-}
114
+
115
+ # Verify installation
116
+ nvcc --version
117
+ else
118
+ echo "CUDA toolkit $NVCC_VERSION matches driver version"
119
+ export PATH=/usr/local/cuda-${DRIVER_CUDA_VERSION}/bin:$PATH
120
+ export LD_LIBRARY_PATH=/usr/local/cuda-${DRIVER_CUDA_VERSION}/lib64:${LD_LIBRARY_PATH:-}
121
+ fi
122
+
123
+ # --- Install uv (fast Python package manager) --------------------------------
124
+ curl -LsSf https://astral.sh/uv/install.sh | sh
125
+ export PATH="$HOME/.local/bin:$PATH"
126
+
127
+ # --- Install Python 3.12 if not available ------------------------------------
128
+ if ! command -v python3.12 &> /dev/null; then
129
+ echo "Python 3.12 not found. Installing via uv..."
130
+ uv python install 3.12
131
+ fi
132
+
133
+ # --- Clean up existing environments and caches -------------------------------
134
+ echo "Cleaning up existing environments and caches..."
135
+
136
+ # Remove existing venv for a clean installation
137
+ VENV="$HOME/venv"
138
+ if [ -d "$VENV" ]; then
139
+ echo "Removing existing virtual environment..."
140
+ rm -rf "$VENV"
141
+ fi
142
+
143
+ # Remove uv cache to ensure fresh installs
144
+ if [ -d "$HOME/.cache/uv" ]; then
145
+ echo "Clearing uv cache..."
146
+ rm -rf "$HOME/.cache/uv"
147
+ fi
148
+
149
+ # Remove vLLM cache to avoid conflicts
150
+ if [ -d "$HOME/.cache/vllm" ]; then
151
+ echo "Clearing vLLM cache..."
152
+ rm -rf "$HOME/.cache/vllm"
153
+ fi
154
+
155
+ # --- Create and activate venv ------------------------------------------------
156
+ echo "Creating fresh virtual environment..."
157
+ uv venv --python 3.12 --seed "$VENV"
158
+ source "$VENV/bin/activate"
159
+
160
+ # --- Install PyTorch and vLLM ------------------------------------------------
161
+ echo "Installing vLLM and dependencies (version: $VLLM_VERSION)..."
162
+ case "$VLLM_VERSION" in
163
+ release)
164
+ echo "Installing vLLM release with PyTorch..."
165
+ # Install vLLM with automatic PyTorch backend selection
166
+ # vLLM will automatically install the correct PyTorch version
167
+ uv pip install vllm>=0.10.0 --torch-backend=auto || {
168
+ echo "ERROR: Failed to install vLLM"
169
+ exit 1
170
+ }
171
+ ;;
172
+ nightly)
173
+ echo "Installing vLLM nightly with PyTorch..."
174
+ echo "This will install the latest nightly build of vLLM..."
175
+
176
+ # Install vLLM nightly with PyTorch
177
+ uv pip install -U vllm \
178
+ --torch-backend=auto \
179
+ --extra-index-url https://wheels.vllm.ai/nightly || {
180
+ echo "ERROR: Failed to install vLLM nightly"
181
+ exit 1
182
+ }
183
+
184
+ echo "vLLM nightly successfully installed!"
185
+ ;;
186
+ gpt-oss)
187
+ echo "Installing GPT-OSS special build with PyTorch nightly..."
188
+ echo "WARNING: This build is ONLY for GPT-OSS models!"
189
+ echo "Installing PyTorch nightly and cutting-edge dependencies..."
190
+
191
+ # Convert CUDA version format for PyTorch (12.4 -> cu124)
192
+ PYTORCH_CUDA="cu$(echo $DRIVER_CUDA_VERSION | sed 's/\.//')"
193
+ echo "Using PyTorch nightly with ${PYTORCH_CUDA} (driver supports ${DRIVER_CUDA_VERSION})"
194
+
195
+ # The GPT-OSS build will pull PyTorch nightly and other dependencies
196
+ # via the extra index URLs. We don't pre-install torch here to avoid conflicts.
197
+ uv pip install --pre vllm==0.10.1+gptoss \
198
+ --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
199
+ --extra-index-url https://download.pytorch.org/whl/nightly/${PYTORCH_CUDA} \
200
+ --index-strategy unsafe-best-match || {
201
+ echo "ERROR: Failed to install GPT-OSS vLLM build"
202
+ echo "This automatically installs PyTorch nightly with ${PYTORCH_CUDA}, Triton nightly, and other dependencies"
203
+ exit 1
204
+ }
205
+
206
+ # Install gpt-oss library for tool support
207
+ uv pip install gpt-oss || {
208
+ echo "WARNING: Failed to install gpt-oss library (needed for tool use)"
209
+ }
210
+ ;;
211
+ *)
212
+ echo "ERROR: Unknown vLLM version: $VLLM_VERSION"
213
+ exit 1
214
+ ;;
215
+ esac
216
+
217
+ # --- Install additional packages ---------------------------------------------
218
+ echo "Installing additional packages..."
219
+ # Note: tensorrt removed temporarily due to CUDA 13.0 compatibility issues
220
+ # TensorRT still depends on deprecated nvidia-cuda-runtime-cu13 package
221
+ uv pip install huggingface-hub psutil hf_transfer
222
+
223
+ # --- FlashInfer installation (optional, improves performance) ----------------
224
+ echo "Attempting FlashInfer installation (optional)..."
225
+ if uv pip install flashinfer-python; then
226
+ echo "FlashInfer installed successfully"
227
+ else
228
+ echo "FlashInfer not available, using Flash Attention instead"
229
+ fi
230
+
231
+ # --- Mount storage if provided -----------------------------------------------
232
+ if [ -n "$MOUNT_COMMAND" ]; then
233
+ echo "Setting up mount..."
234
+
235
+ # Create mount point directory if it doesn't exist
236
+ mkdir -p "$MODELS_PATH"
237
+
238
+ # Execute the mount command
239
+ eval "$MOUNT_COMMAND" || {
240
+ echo "WARNING: Mount command failed, continuing without mount"
241
+ }
242
+
243
+ # Verify mount succeeded (optional, may not always be a mount point)
244
+ if mountpoint -q "$MODELS_PATH" 2>/dev/null; then
245
+ echo "Storage successfully mounted at $MODELS_PATH"
246
+ else
247
+ echo "Note: $MODELS_PATH is not a mount point (might be local storage)"
248
+ fi
249
+ fi
250
+
251
+ # --- Model storage setup ------------------------------------------------------
252
+ echo ""
253
+ echo "=== Setting up model storage ==="
254
+ echo "Storage path: $MODELS_PATH"
255
+
256
+ # Check if the path exists and is writable
257
+ if [ ! -d "$MODELS_PATH" ]; then
258
+ echo "Creating model storage directory: $MODELS_PATH"
259
+ mkdir -p "$MODELS_PATH"
260
+ fi
261
+
262
+ if [ ! -w "$MODELS_PATH" ]; then
263
+ echo "ERROR: Model storage path is not writable: $MODELS_PATH"
264
+ echo "Please check permissions"
265
+ exit 1
266
+ fi
267
+
268
+ # Create the huggingface cache directory structure in the models path
269
+ mkdir -p "${MODELS_PATH}/huggingface/hub"
270
+
271
+ # Remove any existing cache directory or symlink
272
+ if [ -e ~/.cache/huggingface ] || [ -L ~/.cache/huggingface ]; then
273
+ echo "Removing existing ~/.cache/huggingface..."
274
+ rm -rf ~/.cache/huggingface 2>/dev/null || true
275
+ fi
276
+
277
+ # Create parent directory if needed
278
+ mkdir -p ~/.cache
279
+
280
+ # Create symlink from ~/.cache/huggingface to the models path
281
+ ln -s "${MODELS_PATH}/huggingface" ~/.cache/huggingface
282
+ echo "Created symlink: ~/.cache/huggingface -> ${MODELS_PATH}/huggingface"
283
+
284
+ # Verify the symlink works
285
+ if [ -d ~/.cache/huggingface/hub ]; then
286
+ echo "✓ Model storage configured successfully"
287
+
288
+ # Check available space
289
+ AVAILABLE_SPACE=$(df -h "$MODELS_PATH" | awk 'NR==2 {print $4}')
290
+ echo "Available space: $AVAILABLE_SPACE"
291
+ else
292
+ echo "ERROR: Could not verify model storage setup"
293
+ echo "The symlink was created but the target directory is not accessible"
294
+ exit 1
295
+ fi
296
+
297
+ # --- Configure environment ----------------------------------------------------
298
+ mkdir -p ~/.config/vllm
299
+ touch ~/.config/vllm/do_not_track
300
+
301
+ # Write environment to .bashrc for persistence
302
+ cat >> ~/.bashrc << EOF
303
+
304
+ # Pi vLLM environment
305
+ [ -d "\$HOME/venv" ] && source "\$HOME/venv/bin/activate"
306
+ export PATH="/usr/local/cuda-${DRIVER_CUDA_VERSION}/bin:\$HOME/.local/bin:\$PATH"
307
+ export LD_LIBRARY_PATH="/usr/local/cuda-${DRIVER_CUDA_VERSION}/lib64:\${LD_LIBRARY_PATH:-}"
308
+ export HF_TOKEN="${HF_TOKEN}"
309
+ export PI_API_KEY="${PI_API_KEY}"
310
+ export HUGGING_FACE_HUB_TOKEN="${HF_TOKEN}"
311
+ export HF_HUB_ENABLE_HF_TRANSFER=1
312
+ export VLLM_NO_USAGE_STATS=1
313
+ export VLLM_DO_NOT_TRACK=1
314
+ export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
315
+ export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
316
+ EOF
317
+
318
+ # Create log directory for vLLM
319
+ mkdir -p ~/.vllm_logs
320
+
321
+ # --- Output GPU info for pi CLI to parse -------------------------------------
322
+ echo ""
323
+ echo "===GPU_INFO_START==="
324
+ nvidia-smi --query-gpu=index,name,memory.total --format=csv,noheader | while IFS=, read -r id name memory; do
325
+ # Trim whitespace
326
+ id=$(echo "$id" | xargs)
327
+ name=$(echo "$name" | xargs)
328
+ memory=$(echo "$memory" | xargs)
329
+ echo "{\"id\": $id, \"name\": \"$name\", \"memory\": \"$memory\"}"
330
+ done
331
+ echo "===GPU_INFO_END==="
332
+
333
+ echo ""
334
+ echo "=== Setup complete ==="
335
+ echo "Pod is ready for vLLM deployments"
336
+ echo "Models will be cached at: $MODELS_PATH"