npm - @mariozechner/pi - Versions diffs - 0.1.5 → 0.5.0 - Mend

@mariozechner/pi 0.1.5 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (43) hide show

package/README.md +410 -216
package/dist/cli.d.ts +3 -0
package/dist/cli.d.ts.map +1 -0
package/dist/cli.js +348 -0
package/dist/cli.js.map +1 -0
package/dist/commands/models.d.ts +39 -0
package/dist/commands/models.d.ts.map +1 -0
package/dist/commands/models.js +612 -0
package/dist/commands/models.js.map +1 -0
package/dist/commands/pods.d.ts +21 -0
package/dist/commands/pods.d.ts.map +1 -0
package/dist/commands/pods.js +175 -0
package/dist/commands/pods.js.map +1 -0
package/dist/commands/prompt.d.ts +7 -0
package/dist/commands/prompt.d.ts.map +1 -0
package/dist/commands/prompt.js +55 -0
package/dist/commands/prompt.js.map +1 -0
package/dist/config.d.ts +11 -0
package/dist/config.d.ts.map +1 -0
package/dist/config.js +74 -0
package/dist/config.js.map +1 -0
package/dist/index.d.ts +2 -0
package/dist/index.d.ts.map +1 -0
package/dist/index.js +3 -0
package/dist/index.js.map +1 -0
package/dist/model-configs.d.ts +22 -0
package/dist/model-configs.d.ts.map +1 -0
package/dist/model-configs.js +75 -0
package/dist/model-configs.js.map +1 -0
package/dist/models.json +305 -0
package/dist/ssh.d.ts +24 -0
package/dist/ssh.d.ts.map +1 -0
package/dist/ssh.js +115 -0
package/dist/ssh.js.map +1 -0
package/dist/types.d.ts +23 -0
package/dist/types.d.ts.map +1 -0
package/dist/types.js +3 -0
package/dist/types.js.map +1 -0
package/package.json +38 -40
package/LICENSE +0 -21
package/pi.js +0 -878
package/pod_setup.sh +0 -133
package/vllm_manager.py +0 -499

package/README.md CHANGED Viewed

@@ -1,6 +1,6 @@
-# GPU Pod Manager
+# pi
-Quickly deploy LLMs on GPU pods from [Prime Intellect](https://www.primeintellect.ai/), [Vast.ai](https://vast.ai/), [DataCrunch](datacrunch.io), AWS, etc., for local coding agents and AI assistants.
+Deploy and manage LLMs on GPU pods with automatic vLLM configuration for agentic workloads.
 ## Installation
@@ -8,310 +8,504 @@ Quickly deploy LLMs on GPU pods from [Prime Intellect](https://www.primeintellec
 npm install -g @mariozechner/pi
 ```
-Or run directly with npx:
+## What is pi?
+`pi` simplifies running large language models on remote GPU pods. It automatically:
+- Sets up vLLM on fresh Ubuntu pods
+- Configures tool calling for agentic models (Qwen, GPT-OSS, GLM, etc.)
+- Manages multiple models on the same pod with "smart" GPU allocation
+- Provides OpenAI-compatible API endpoints for each model
+- Includes an interactive agent with file system tools for testing
+## Quick Start
 ```bash
-npx @mariozechner/pi
+# Set required environment variables
+export HF_TOKEN=your_huggingface_token      # Get from https://huggingface.co/settings/tokens
+export PI_API_KEY=your_api_key              # Any string you want for API authentication
+# Setup a DataCrunch pod with NFS storage (models path auto-extracted)
+pi pods setup dc1 "ssh root@1.2.3.4" \
+  --mount "sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/your-pseudo /mnt/hf-models"
+# Start a model (automatic configuration for known models)
+pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen
+# Send a single message to the model
+pi agent qwen "What is the Fibonacci sequence?"
+# Interactive chat mode with file system tools
+pi agent qwen -i
+# Use with any OpenAI-compatible client
+export OPENAI_BASE_URL='http://1.2.3.4:8001/v1'
+export OPENAI_API_KEY=$PI_API_KEY
 ```
-## What This Is
+## Prerequisites
-A simple CLI tool that automatically sets up and manages vLLM deployments on GPU pods. Start from a clean Ubuntu pod and have multiple models running in minutes. A GPU pod is defined as an Ubuntu machine with root access, one or more GPUs, and Cuda drivers installed. It is aimed at individuals who are limited by local hardware and want to experiment with large open weight LLMs for their coding assistent workflows.
+- Node.js 18+
+- HuggingFace token (for model downloads)
+- GPU pod with:
+  - Ubuntu 22.04 or 24.04
+  - SSH root access
+  - NVIDIA drivers installed
+  - Persistent storage for models
-**Key Features:**
-- **Zero to LLM in minutes** - Automatically installs vLLM and all dependencies on clean pods
-- **Multi-model management** - Run multiple models concurrently on a single pod
-- **Smart GPU allocation** - Round robin assigns models to available GPUs on multi-GPU pods
-- **Tensor parallelism** - Run large models across multiple GPUs with `--all-gpus`
-- **OpenAI-compatible API** - Drop-in replacement for OpenAI API clients with automatic tool/function calling support
-- **No complex setup** - Just SSH access, no Kubernetes or Docker required
-- **Privacy first** - vLLM telemetry disabled by default
+## Supported Providers
-**Limitations:**
-- OpenAI endpoints exposed to the public internet (yolo)
-- Requires manual pod creation via Prime Intellect, Vast.ai, AWS, etc.
-- Assumes Ubuntu 22 image when creating pods
+### Primary Support
-## What this is not
-- A provisioning manager for pods. You need to provision the pods on the respective provider themselves.
-- Super optimized LLM deployment infrastructure for absolute best performance. This is for individuals who want to quickly spin up large open weights models for local LLM loads.
+**DataCrunch** - Best for shared model storage
+- NFS volumes sharable across multiple pods in same region
+- Models download once, use everywhere
+- Ideal for teams or multiple experiments
-## Requirements
+**RunPod** - Good persistent storage
+- Network volumes persist independently
+- Cannot share between running pods simultaneously
+- Good for single-pod workflows
-- **Node.js 14+** - To run the CLI tool on your machine
-- **HuggingFace Token** - Required for downloading models (get one at https://huggingface.co/settings/tokens)
-- **Prime Intellect Account** - Sign up at https://app.primeintellect.ai
-- **GPU Pod** - At least one running pod with:
-  - Ubuntu 22+ image (selected when creating pod)
-  - SSH access enabled
-  - Clean state (no manual vLLM installation needed)
-  - **Note**: B200 GPUs require PyTorch nightly with CUDA 12.8+ (automatically installed if detected). However, vLLM may need to be built from source for full compatibility.
+### Also Works With
+- Vast.ai (volumes locked to specific machine)
+- Prime Intellect (no persistent storage)
+- AWS EC2 (with EFS setup)
+- Any Ubuntu machine with NVIDIA GPUs, CUDA driver, and SSH
-## Quick Start
+## Commands
+### Pod Management
 ```bash
-# 1. Get a GPU pod from Prime Intellect
-#    Visit https://app.primeintellect.ai or https://vast.ai/ or https://datacrunch.io and create a pod (use Ubuntu 22+ image)
-#    Providers usually give you an SSH command with which to log into the machine. Copy that command.
+pi pods setup <name> "<ssh>" [options]        # Setup new pod
+  --mount "<mount_command>"                   # Run mount command during setup
+  --models-path <path>                        # Override extracted path (optional)
+  --vllm release|nightly|gpt-oss              # vLLM version (default: release)
+pi pods                                       # List all configured pods
+pi pods active <name>                         # Switch active pod
+pi pods remove <name>                         # Remove pod from local config
+pi shell [<name>]                             # SSH into pod
+pi ssh [<name>] "<command>"                   # Run command on pod
+```
-# 2. On your local machine, run the following to setup the remote pod. The Hugging Face token
-#    is required for model download.
-export HF_TOKEN=your_huggingface_token
-pi setup my-pod-name "ssh root@135.181.71.41 -p 22"
+**Note**: When using `--mount`, the models path is automatically extracted from the mount command's target directory. You only need `--models-path` if not using `--mount` or to override the extracted path.
-# 3. Start a model (automatically manages GPU assignment)
-pi start microsoft/Phi-3-mini-128k-instruct --name phi3 --memory 20%
+#### vLLM Version Options
-# 4. Test the model with a prompt
-pi prompt phi3 "What is 2+2?"
-# Response: The answer is 4.
+- `release` (default): Stable vLLM release, recommended for most users
+- `nightly`: Latest vLLM features, needed for newest models like GLM-4.5
+- `gpt-oss`: Special build for OpenAI's GPT-OSS models only
-# 5. Start another model (automatically uses next available GPU on multi-GPU pods)
-pi start Qwen/Qwen2.5-7B-Instruct --name qwen --memory 30%
+### Model Management
-# 6. Check running models
-pi list
+```bash
+pi start <model> --name <name> [options]  # Start a model
+  --memory <percent>      # GPU memory: 30%, 50%, 90% (default: 90%)
+  --context <size>        # Context window: 4k, 8k, 16k, 32k, 64k, 128k
+  --gpus <count>          # Number of GPUs to use (predefined models only)
+  --pod <name>            # Target specific pod (overrides active)
+  --vllm <args...>        # Pass custom args directly to vLLM
+pi stop [<name>]          # Stop model (or all if no name given)
+pi list                   # List running models with status
+pi logs <name>            # Stream model logs (tail -f)
+```
-# 7. Use with your coding agent
-export OPENAI_BASE_URL='http://135.181.71.41:8001/v1'  # For first model
-export OPENAI_API_KEY='dummy'
+### Agent & Chat Interface
+```bash
+pi agent <name> "<message>"               # Single message to model
+pi agent <name> "<msg1>" "<msg2>"         # Multiple messages in sequence
+pi agent <name> -i                        # Interactive chat mode
+pi agent <name> -i -c                     # Continue previous session
+# Standalone OpenAI-compatible agent (works with any API)
+pi-agent --base-url http://localhost:8000/v1 --model llama-3.1 "Hello"
+pi-agent --api-key sk-... "What is 2+2?"  # Uses OpenAI by default
+pi-agent --json "What is 2+2?"            # Output event stream as JSONL
+pi-agent -i                                # Interactive mode
 ```
-## How It Works
+The agent includes tools for file operations (read, list, bash, glob, rg) to test agentic capabilities, particularly useful for code navigation and analysis tasks.
-1. **Automatic Setup**: When you run `pi setup`, it:
-   - Connects to your clean Ubuntu pod
-   - Installs Python, CUDA drivers, and vLLM
-   - Configures HuggingFace tokens
-   - Sets up the model manager
+## Predefined Model Configurations
-2. **Model Management**: Each `pi start` command:
-   - Automatically finds an available GPU (on multi-GPU systems)
-   - Allocates the specified memory fraction
-   - Starts a separate vLLM instance on a unique port accessible via the OpenAI API protocol
-   - Manages logs and process lifecycle
+`pi` includes predefined configurations for popular agentic models, so you do not have to specify `--vllm` arguments manually. `pi` will also check if the model you selected can actually run on your pod with respect to the number of GPUs and available VRAM. Run `pi start` without additional arguments to see a list of predefined models that can run on the active pod.
-3. **Multi-GPU Support**: On pods with multiple GPUs:
-   - Single models automatically distribute across available GPUs
-   - Large models can use tensor parallelism with `--all-gpus`
-   - View GPU assignments with `pi list`
+### Qwen Models
+```bash
+# Qwen2.5-Coder-32B - Excellent coding model, fits on single H100/H200
+pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen
+# Qwen3-Coder-30B - Advanced reasoning with tool use
+pi start Qwen/Qwen3-Coder-30B-A3B-Instruct --name qwen3
-## Commands
+# Qwen3-Coder-480B - State-of-the-art on 8xH200 (data-parallel mode)
+pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen-480b
+```
-### Pod Management
+### GPT-OSS Models
+```bash
+# Requires special vLLM build during setup
+pi pods setup gpt-pod "ssh root@1.2.3.4" --models-path /workspace --vllm gpt-oss
+# GPT-OSS-20B - Fits on 16GB+ VRAM
+pi start openai/gpt-oss-20b --name gpt20
-The tool supports managing multiple Prime Intellect pods from a single machine. Each pod is identified by a name you choose (e.g., "prod", "dev", "h200"). While all your pods continue running independently, the tool operates on one "active" pod at a time - all model commands (start, stop, list, etc.) are directed to this active pod. You can easily switch which pod is active to manage models on different machines.
+# GPT-OSS-120B - Needs 60GB+ VRAM
+pi start openai/gpt-oss-120b --name gpt120
+```
+### GLM Models
 ```bash
-pi setup <pod-name> "<ssh_command>"  # Configure and activate a pod
-pi pods                              # List all pods (active pod marked)
-pi pod <pod-name>                    # Switch active pod
-pi pod remove <pod-name>             # Remove pod from config
-pi shell                             # SSH into active pod
+# GLM-4.5 - Requires 8-16 GPUs, includes thinking mode
+pi start zai-org/GLM-4.5 --name glm
+# GLM-4.5-Air - Smaller version, 1-2 GPUs
+pi start zai-org/GLM-4.5-Air --name glm-air
 ```
-### Model Management
+### Custom Models with --vllm
-Each model runs as a separate vLLM instance with its own port and GPU allocation. The tool automatically manages GPU assignment on multi-GPU systems and ensures models don't conflict. Models are accessed by their short names (either auto-generated or specified with --name).
+For models not in the predefined list, use `--vllm` to pass arguments directly to vLLM:
 ```bash
-pi list                              # List running models on active pod
-pi search <query>                    # Search HuggingFace models
-pi start <model> [options]           # Start a model with options
-  --name <name>                      # Short alias (default: auto-generated)
-  --context <size>                   # Context window: 4k, 8k, 16k, 32k (default: model default)
-  --memory <percent>                 # GPU memory: 30%, 50%, 90% (default: 90%)
-  --all-gpus                         # Use tensor parallelism across all GPUs
-  --vllm-args                        # Pass all remaining args directly to vLLM
-pi stop [name]                       # Stop a model (or all if no name)
-pi logs <name>                       # View logs with tail -f
-pi prompt <name> "message"           # Quick test prompt
+# DeepSeek with custom settings
+pi start deepseek-ai/DeepSeek-V3 --name deepseek --vllm \
+  --tensor-parallel-size 4 --trust-remote-code
+# Mistral with pipeline parallelism
+pi start mistralai/Mixtral-8x22B-Instruct-v0.1 --name mixtral --vllm \
+  --tensor-parallel-size 8 --pipeline-parallel-size 2
+# Any model with specific tool parser
+pi start some/model --name mymodel --vllm \
+  --tool-call-parser hermes --enable-auto-tool-choice
 ```
-## Examples
+## DataCrunch Setup
+DataCrunch offers the best experience with shared NFS storage across pods:
+### 1. Create Shared Filesystem (SFS)
+- Go to DataCrunch dashboard → Storage → Create SFS
+- Choose size and datacenter
+- Note the mount command (e.g., `sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/hf-models-fin02-8ac1bab7 /mnt/hf-models-fin02`)
+### 2. Create GPU Instance
+- Create instance in same datacenter as SFS
+- Share the SFS with the instance
+- Get SSH command from dashboard
-### Search for models
+### 3. Setup with pi
 ```bash
-pi search codellama
-pi search deepseek
-pi search qwen
+# Get mount command from DataCrunch dashboard
+pi pods setup dc1 "ssh root@instance.datacrunch.io" \
+  --mount "sudo mount -t nfs -o nconnect=16 nfs.fin-02.datacrunch.io:/your-pseudo /mnt/hf-models"
+# Models automatically stored in /mnt/hf-models (extracted from mount command)
 ```
-**Note**: vLLM does not support formats like GGUF. Read the [docs](https://docs.vllm.ai/en/latest/)
+### 4. Benefits
+- Models persist across instance restarts
+- Share models between multiple instances in same datacenter
+- Download once, use everywhere
+- Pay only for storage, not compute time during downloads
-### A100 80GB scenarios
-```bash
-# Small model, high concurrency (~30-50 concurrent requests)
-pi start microsoft/Phi-3-mini-128k-instruct --name phi3 --memory 30%
+## RunPod Setup
+RunPod offers good persistent storage with network volumes:
-# Medium model, balanced (~10-20 concurrent requests)
-pi start meta-llama/Llama-3.1-8B-Instruct --name llama8b --memory 50%
+### 1. Create Network Volume (optional)
+- Go to RunPod dashboard → Storage → Create Network Volume
+- Choose size and region
-# Large model, limited concurrency (~5-10 concurrent requests)
-pi start meta-llama/Llama-3.1-70B-Instruct --name llama70b --memory 90%
+### 2. Create GPU Pod
+- Select "Network Volume" during pod creation (if using)
+- Attach your volume to `/runpod-volume`
+- Get SSH command from pod details
-# Run multiple small models
-pi start Qwen/Qwen2.5-Coder-1.5B --name coder1 --memory 15%
-pi start microsoft/Phi-3-mini-128k-instruct --name phi3 --memory 15%
+### 3. Setup with pi
+```bash
+# With network volume
+pi pods setup runpod "ssh root@pod.runpod.io" --models-path /runpod-volume
+# Or use workspace (persists with pod but not shareable)
+pi pods setup runpod "ssh root@pod.runpod.io" --models-path /workspace
 ```
-## Understanding Context and Memory
-### Context Window vs Output Tokens
-Models are loaded with their default context length. You can use the `context` parameter to specify a lower or higher context length. The `context` parameter sets the **total** token budget for input + output combined:
-- Starting a model with `context=8k` means 8,192 tokens total
-- If your prompt uses 6,000 tokens, you have 2,192 tokens left for the response
-- Each OpenAI API request to the model can specify `max_output_tokens` to control output length within this budget
+## Multi-GPU Support
-Example:
+### Automatic GPU Assignment
+When running multiple models, pi automatically assigns them to different GPUs:
 ```bash
-# Start model with 32k total context
-pi start meta-llama/Llama-3.1-8B --name llama --context 32k --memory 50%
+pi start model1 --name m1  # Auto-assigns to GPU 0
+pi start model2 --name m2  # Auto-assigns to GPU 1
+pi start model3 --name m3  # Auto-assigns to GPU 2
+```
+### Specify GPU Count for Predefined Models
+For predefined models with multiple configurations, use `--gpus` to control GPU usage:
+```bash
+# Run Qwen on 1 GPU instead of all available
+pi start Qwen/Qwen2.5-Coder-32B-Instruct --name qwen --gpus 1
-# When calling the API, you control output length per request:
-# - Send 20k token prompt
-# - Request max_tokens=4000
-# - Total = 24k (fits within 32k context)
+# Run GLM-4.5 on 8 GPUs (if it has an 8-GPU config)
+pi start zai-org/GLM-4.5 --name glm --gpus 8
 ```
-### GPU Memory and Concurrency
-vLLM pre-allocates GPU memory controlled by `gpu_fraction`. This matters for coding agents that spawn sub-agents, as each connection needs memory.
+If the model doesn't have a configuration for the requested GPU count, you'll see available options.
+### Tensor Parallelism for Large Models
+For models that don't fit on a single GPU:
+```bash
+# Use all available GPUs
+pi start meta-llama/Llama-3.1-70B-Instruct --name llama70b --vllm \
+  --tensor-parallel-size 4
-Example: On an A100 80GB with a 7B model (FP16, ~14GB weights):
-- `gpu_fraction=0.3` (24GB): ~10GB for KV cache → ~30-50 concurrent requests
-- `gpu_fraction=0.5` (40GB): ~26GB for KV cache → ~50-80 concurrent requests
-- `gpu_fraction=0.9` (72GB): ~58GB for KV cache → ~100+ concurrent requests
+# Specific GPU count
+pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen480 --vllm \
+  --data-parallel-size 8 --enable-expert-parallel
+```
-Models load in their native precision from HuggingFace (usually FP16/BF16). Check the model card's "Files and versions" tab - look for file sizes: 7B models are ~14GB, 13B are ~26GB, 70B are ~140GB. Quantized models (AWQ, GPTQ) in the name use less memory but may have quality trade-offs.
+## API Integration
+All models expose OpenAI-compatible endpoints:
+```python
+from openai import OpenAI
+client = OpenAI(
+    base_url="http://your-pod-ip:8001/v1",
+    api_key="your-pi-api-key"
+)
+# Chat completion with tool calling
+response = client.chat.completions.create(
+    model="Qwen/Qwen2.5-Coder-32B-Instruct",
+    messages=[
+        {"role": "user", "content": "Write a Python function to calculate fibonacci"}
+    ],
+    tools=[{
+        "type": "function",
+        "function": {
+            "name": "execute_code",
+            "description": "Execute Python code",
+            "parameters": {
+                "type": "object",
+                "properties": {
+                    "code": {"type": "string"}
+                },
+                "required": ["code"]
+            }
+        }
+    }],
+    tool_choice="auto"
+)
+```
-## Multi-GPU Support
+## Standalone Agent CLI
-For pods with multiple GPUs, the tool automatically manages GPU assignment:
+`pi` includes a standalone OpenAI-compatible agent that can work with any API:
-### Automatic GPU assignment for multiple models
 ```bash
-# Each model automatically uses the next available GPU
-pi start microsoft/Phi-3-mini-128k-instruct --memory 20%  # Auto-assigns to GPU 0
-pi start Qwen/Qwen2.5-7B-Instruct --memory 20%         # Auto-assigns to GPU 1
-pi start meta-llama/Llama-3.1-8B --memory 20%          # Auto-assigns to GPU 2
+# Install globally to get pi-agent command
+npm install -g @mariozechner/pi
-# Check which GPU each model is using
-pi list
+# Use with OpenAI
+pi-agent --api-key sk-... "What is machine learning?"
+# Use with local vLLM
+pi-agent --base-url http://localhost:8000/v1 \
+         --model meta-llama/Llama-3.1-8B-Instruct \
+         --api-key dummy \
+         "Explain quantum computing"
+# Interactive mode
+pi-agent -i
+# Continue previous session
+pi-agent --continue "Follow up question"
+# Custom system prompt
+pi-agent --system-prompt "You are a Python expert" "Write a web scraper"
+# Use responses API (for GPT-OSS models)
+pi-agent --api responses --model openai/gpt-oss-20b "Hello"
 ```
-### Run large models across all GPUs
+The agent supports:
+- Session persistence across conversations
+- Interactive TUI mode with syntax highlighting
+- File system tools (read, list, bash, glob, rg) for code navigation
+- Both Chat Completions and Responses API formats
+- Custom system prompts
+## Tool Calling Support
+`pi` automatically configures appropriate tool calling parsers for known models:
+- **Qwen models**: `hermes` parser (Qwen3-Coder uses `qwen3_coder`)
+- **GLM models**: `glm4_moe` parser with reasoning support
+- **GPT-OSS models**: Uses `/v1/responses` endpoint, as tool calling (function calling in OpenAI parlance) is currently a [WIP with the `v1/chat/completions` endpoint](https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#tool-use).
+- **Custom models**: Specify with `--vllm --tool-call-parser <parser> --enable-auto-tool-choice`
+To disable tool calling:
 ```bash
-# Use --all-gpus for tensor parallelism across all available GPUs
-pi start meta-llama/Llama-3.1-70B-Instruct --all-gpus
-pi start Qwen/Qwen2.5-72B-Instruct --all-gpus --context 64k
+pi start model --name mymodel --vllm --disable-tool-call-parser
 ```
-### Advanced: Custom vLLM arguments
+## Memory and Context Management
+### GPU Memory Allocation
+Controls how much GPU memory vLLM pre-allocates:
+- `--memory 30%`: High concurrency, limited context
+- `--memory 50%`: Balanced (default)
+- `--memory 90%`: Maximum context, low concurrency
+### Context Window
+Sets maximum input + output tokens:
+- `--context 4k`: 4,096 tokens total
+- `--context 32k`: 32,768 tokens total
+- `--context 128k`: 131,072 tokens total
+Example for coding workload:
 ```bash
-# Pass custom arguments directly to vLLM with --vllm-args
-# Everything after --vllm-args is passed to vLLM unchanged
+# Large context for code analysis, moderate concurrency
+pi start Qwen/Qwen2.5-Coder-32B-Instruct --name coder \
+  --context 64k --memory 70%
+```
-# Qwen3-Coder 480B on 8xH200 with expert parallelism
-pi start Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 --name qwen-coder --vllm-args \
-  --data-parallel-size 8 --enable-expert-parallel \
-  --tool-call-parser qwen3_coder --enable-auto-tool-choice --max-model-len 200000
+**Note**: When using `--vllm`, the `--memory`, `--context`, and `--gpus` parameters are ignored. You'll see a warning if you try to use them together.
-# DeepSeek with custom quantization
-pi start deepseek-ai/DeepSeek-Coder-V2-Instruct --name deepseek --vllm-args \
-  --tensor-parallel-size 4 --quantization fp8 --trust-remote-code
+## Session Persistence
-# Mixtral with pipeline parallelism
-pi start mistralai/Mixtral-8x22B-Instruct-v0.1 --name mixtral --vllm-args \
-  --tensor-parallel-size 8 --pipeline-parallel-size 2
+The interactive agent mode (`-i`) saves sessions for each project directory:
+```bash
+# Start new session
+pi agent qwen -i
+# Continue previous session (maintains chat history)
+pi agent qwen -i -c
 ```
-### Check GPU usage
+Sessions are stored in `~/.pi/sessions/` organized by project path and include:
+- Complete conversation history
+- Tool call results
+- Token usage statistics
+## Architecture & Event System
+The agent uses a unified event-based architecture where all interactions flow through `AgentEvent` types. This enables:
+- Consistent UI rendering across console and TUI modes
+- Session recording and replay
+- Clean separation between API calls and UI updates
+- JSON output mode for programmatic integration
+Events are automatically converted to the appropriate API format (Chat Completions or Responses) based on the model type.
+### JSON Output Mode
+Use `--json` flag to output the event stream as JSONL (JSON Lines) for programmatic consumption:
 ```bash
-pi ssh "nvidia-smi"
+pi-agent --api-key sk-... --json "What is 2+2?"
 ```
-## Architecture Notes
+Each line is a complete JSON object representing an event:
+```jsonl
+{"type":"user_message","text":"What is 2+2?"}
+{"type":"assistant_start"}
+{"type":"assistant_message","text":"2 + 2 = 4"}
+{"type":"token_usage","inputTokens":10,"outputTokens":5,"totalTokens":15,"cacheReadTokens":0,"cacheWriteTokens":0}
+```
-- **Multi-Pod Support**: The tool stores multiple pod configurations in `~/.pi_config` with one active pod at a time.
-- **Port Allocation**: Each model runs on a separate port (8001, 8002, etc.) allowing multiple models on one GPU.
-- **Memory Management**: vLLM uses PagedAttention for efficient memory use with less than 4% waste.
-- **Model Caching**: Models are downloaded once and cached on the pod.
-- **Tool Parser Auto-Detection**: The tool automatically selects the appropriate tool parser based on the model:
-  - Qwen models: `hermes` (Qwen3-Coder: `qwen3_coder` if available)
-  - Mistral models: `mistral` with optimized chat template
-  - Llama models: `llama3_json` or `llama4_pythonic` based on version
-  - InternLM models: `internlm`
-  - Phi models: Tool calling disabled by default (no compatible tokens)
-  - Override with `--vllm-args --tool-call-parser <parser> --enable-auto-tool-choice`
+## Troubleshooting
+### OOM (Out of Memory) Errors
+- Reduce `--memory` percentage
+- Use smaller model or quantized version (FP8)
+- Reduce `--context` size
-## Tool Calling (Function Calling)
+### Model Won't Start
+```bash
+# Check GPU usage
+pi ssh "nvidia-smi"
-Tool calling allows LLMs to request the use of external functions/APIs, but it's a complex feature with many caveats:
+# Check if port is in use
+pi list
-### The Reality of Tool Calling
+# Force stop all models
+pi stop
+```
-1. **Model Compatibility**: Not all models support tool calling, even if they claim to. Many models lack the special tokens or training needed for reliable tool parsing.
+### Tool Calling Issues
+- Not all models support tool calling reliably
+- Try different parser: `--vllm --tool-call-parser mistral`
+- Or disable: `--vllm --disable-tool-call-parser`
-2. **Parser Mismatches**: Different models use different tool calling formats:
-   - Hermes format (XML-like)
-   - Mistral format (specific JSON structure)
-   - Llama format (JSON-based or pythonic)
-   - Custom formats for each model family
+### Access Denied for Models
+Some models (Llama, Mistral) require HuggingFace access approval. Visit the model page and click "Request access".
-3. **Common Issues**:
-   - "Could not locate tool call start/end tokens" - Model doesn't have required special tokens
-   - Malformed JSON/XML output - Model wasn't trained for the parser format
-   - Tool calls when you don't want them - Model overeager to use tools
-   - No tool calls when you need them - Model doesn't understand when to use tools
+### vLLM Build Issues
+If using `--vllm nightly` fails, try:
+- Use `--vllm release` for stable version
+- Check CUDA compatibility with `pi ssh "nvidia-smi"`
-### How We Handle It
+### Agent Not Finding Messages
+If the agent shows configuration instead of your message, ensure quotes around messages with special characters:
+```bash
+# Good
+pi agent qwen "What is this file about?"
-The tool automatically detects the model and tries to use an appropriate parser:
-- **Qwen models**: `hermes` parser (Qwen3-Coder uses `qwen3_coder`)
-- **Mistral models**: `mistral` parser with custom template
-- **Llama models**: `llama3_json` or `llama4_pythonic` based on version
-- **Phi models**: Tool calling disabled (no compatible tokens)
+# Bad (shell might interpret special chars)
+pi agent qwen What is this file about?
+```
-### Your Options
+## Advanced Usage
-1. **Let auto-detection handle it** (default):
-   ```bash
-   pi start meta-llama/Llama-3.1-8B-Instruct --name llama
-   ```
+### Working with Multiple Pods
+```bash
+# Override active pod for any command
+pi start model --name test --pod dev-pod
+pi list --pod prod-pod
+pi stop test --pod dev-pod
+```
-2. **Force a specific parser** (if you know better):
-   ```bash
-   pi start model/name --name mymodel --vllm-args \
-     --tool-call-parser mistral --enable-auto-tool-choice
-   ```
+### Custom vLLM Arguments
+```bash
+# Pass any vLLM argument after --vllm
+pi start model --name custom --vllm \
+  --quantization awq \
+  --enable-prefix-caching \
+  --max-num-seqs 256 \
+  --gpu-memory-utilization 0.95
+```
-3. **Disable tool calling entirely** (most reliable):
-   ```bash
-   pi start model/name --name mymodel --vllm-args \
-     --disable-tool-call-parser
-   ```
+### Monitoring
+```bash
+# Watch GPU utilization
+pi ssh "watch -n 1 nvidia-smi"
-4. **Handle tools in your application** (recommended for production):
-   - Send regular prompts asking the model to output JSON
-   - Parse the response in your code
-   - More control, more reliable
+# Check model downloads
+pi ssh "du -sh ~/.cache/huggingface/hub/*"
-### Best Practices
+# View all logs
+pi ssh "ls -la ~/.vllm_logs/"
-- **Test first**: Try a simple tool call to see if it works with your model
-- **Have a fallback**: Be prepared for tool calling to fail
-- **Consider alternatives**: Sometimes a well-crafted prompt works better than tool calling
-- **Read the docs**: Check the model card for tool calling examples
-- **Monitor logs**: Check `~/.vllm_logs/` for parser errors
+# Check agent session history
+ls -la ~/.pi/sessions/
+```
-Remember: Tool calling is still an evolving feature in the LLM ecosystem. What works today might break tomorrow with a model update.
+## Environment Variables
-## Troubleshooting
+- `HF_TOKEN` - HuggingFace token for model downloads
+- `PI_API_KEY` - API key for vLLM endpoints
+- `PI_CONFIG_DIR` - Config directory (default: `~/.pi`)
+- `OPENAI_API_KEY` - Used by `pi-agent` when no `--api-key` provided
+## License
-- **OOM Errors**: Reduce gpu_fraction or use a smaller model
-- **Slow Inference**: Could be too many concurrent requests, try increasing gpu_fraction
-- **Connection Refused**: Check pod is running and port is correct
-- **HF Token Issues**: Ensure HF_TOKEN is set before running setup
-- **Access Denied**: Some models (like Llama, Mistral) require completing an access request on HuggingFace first. Visit the model page and click "Request access"
-- **Tool Calling Errors**: See the Tool Calling section above - consider disabling it or using a different model
+MIT