npm - @ssweens/pi-vertex - Versions diffs - 1.1.4 → 1.1.7 - Mend

@ssweens/pi-vertex 1.1.4 → 1.1.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

package/CHANGELOG.md CHANGED Viewed

@@ -2,6 +2,47 @@
 All notable changes to this project will be documented in this file.
+## [1.1.7] - 2026-05-16
+### Added
+- **Regional pricing for Claude models** — non-global Vertex endpoints (us-east5, europe-west1, asia-southeast1, us/eu multi-region) carry a 10% price premium per GCP's published rates. The streaming layer now automatically selects the correct cost tier based on the resolved endpoint at call time. No config change required — if your `GOOGLE_CLOUD_LOCATION` or config resolves to any non-`global` location, cost tracking reflects the regional rate.
+  - Claude Opus 4.7/4.6/4.5: global $5.00/$25.00 → regional $5.50/$27.50
+  - Claude Sonnet 4.6/4.5: global $3.00/$15.00 → regional $3.30/$16.50
+  - Claude Haiku 4.5: global $1.00/$5.00 → regional $1.10/$5.50
+  - Claude Opus 4.1, Opus 4, Sonnet 4: uniform pricing (no regional variant on GCP)
+- **`costRegional?: ModelCost` field on `VertexModelConfig`** — optional cost tier used when the resolved GCP location is non-global. Models without this field use `cost` for all regions.
+### Fixed
+- **Grok cache read pricing** — previously 0 for both xAI models; corrected to GCP official rates:
+  - `grok-4.20-reasoning`: cacheRead $0.20/1M
+  - `grok-4.1-fast-reasoning`: cacheRead $0.05/1M
+## [1.1.6] - 2026-05-16
+### Fixed
+- **`maxTokens / 2` halving removed** — both the Anthropic and OpenAI-compat MaaS streaming paths were silently capping requests at half the model's stated `maxTokens`. Requests now use the full `maxTokens` value unless the caller explicitly overrides it.
+- **Gemini cached token double-counting** — `promptTokenCount` includes cached tokens, so input cost was inflated. Input usage is now `promptTokenCount − cachedTokenCount`, matching the actual billable amount.
+- **`sanitizeText` corrupted emoji** — the previous regex replaced all surrogate code units including valid pairs (emoji are encoded as two surrogates). Now only unpaired/lone surrogates are stripped.
+- **Gemini Pro can't use `MINIMAL` thinking level** — `ThinkingLevel.MINIMAL` is only valid for Flash models. Pro requests with `minimal`/`low` effort now floor to `ThinkingLevel.LOW`.
+- **Reasoning models always get a minimum thinking config** — previously thinking was only configured when an explicit `reasoning` effort was passed. For reasoning-capable Gemini models, a minimum config (lowest budget/level) is now always set, matching pi-mono behavior and preventing silent thought suppression.
+- **`convertToGeminiMessages`: missing tool results injected** — if an assistant turn with tool calls has no matching `toolResult` message, a synthetic error result (`"No result provided"`) is flushed before the next turn. Prevents Gemini 400 errors from dangling tool calls.
+- **`convertToGeminiMessages`: image tool results supported** — `toolResult` messages containing image content are now forwarded correctly. Gemini 3+ models receive them as `functionResponse.parts`; older models get a separate user image turn.
+- **`convertToGeminiMessages`: tighter same-model guard** — thought signature replay now also requires `api === "google-generative-ai"` so signatures from non-Gemini providers (e.g. Claude) are never incorrectly forwarded.
+- **`convertToGeminiMessages`: removed `id` from `functionCall` parts** — the `requiresToolCallId` heuristic was wrong; Gemini does not use tool call IDs in `functionCall` parts.
+### Updated
+- `claude-opus-4-6`: `maxTokens` corrected to `128000` (was `32000`)
+- `claude-sonnet-4-6`: `maxTokens` corrected to `128000` (was `64000`)
+- `convertToolsForGemini` / `convertTools`: signatures tightened from `any[]` to typed `Tool[]`
+*Bug fixes co-discovered with [lhl/pi-vertex](https://github.com/lhl/pi-vertex), a respected community fork. Credit: @lhl.*
+## [1.1.5] - 2026-05-16
+### Added
+- **xAI Grok models** (new publisher on Vertex MaaS OpenAI-compat endpoint):
+  - `grok-4.20-reasoning` — flagship model, 200K context, text+image input, reasoning+tools, $1.25/$2.50 per 1M tokens
+  - `grok-4.1-fast-reasoning` — cost-effective model, 128K context, text+image input, reasoning+tools, $0.20/$0.50 per 1M tokens
+- **Claude Opus 4.7** (`claude-opus-4-7`) — 1M context, 128K max output tokens (up from 4.6's 32K), $5.00/$25.00 per 1M, same cache pricing as Opus 4.6
+- **Gemma 4 26B A4B IT** (`gemma-4-26b-a4b-it`) — Google's MoE instruction-tuned model via MaaS, 262K context, 128K max output, text+image input, $0.15/$0.60 per 1M tokens
 ## [1.1.4] - 2026-03-30
 ### Fixed
 - Removed error message override for `400 (no body)` responses from Vertex MaaS models. The original message now passes through to `isContextOverflow()` which already handles this pattern, enabling proper auto-compact instead of showing a raw error to the user.

package/README.md CHANGED Viewed

@@ -15,18 +15,18 @@ Set your GCP project and credentials. Vertex AI models (Gemini, Claude, Llama, D
 ## Features
-- **43 models** across 4 categories:
-  - **Gemini** (8): 3.1 Pro, 3 Pro, 3 Flash, 2.5 Pro, 2.5 Flash, 2.0 Flash, and more
-  - **Claude** (12): Opus 4.6, Sonnet 4.6, 4.5, 4.1, 4, 3.7 Sonnet, 3.5 Sonnet v2, 3.5 Sonnet, 3 Haiku
+- **48 models** across 4 categories:
+  - **Gemini** (9): 3.1 Pro, 3.1 Flash-Lite, 3 Flash, 2.5 Pro, 2.5 Flash, 2.5 Flash-Lite, 2.0 Flash, 2.0 Flash-Lite
+  - **Claude** (10): Opus 4.7, Opus 4.6, Sonnet 4.6, Opus/Sonnet/Haiku 4.5, Opus 4.1, Opus 4, Sonnet 4, 3.5 Sonnet v2
   - **Llama** (3): 4 Maverick, 4 Scout, 3.3 70B
-  - **Other MaaS** (20): AI21 Jamba, Mistral, DeepSeek, Qwen, OpenAI GPT-OSS, Kimi, MiniMax, GLM
+  - **Other MaaS** (26): Grok, Gemma, Mistral, DeepSeek, Qwen, OpenAI GPT-OSS, Kimi, MiniMax, GLM
 - **Unified streaming**: Single provider, multiple model families
 - **Full tool calling support**: All models with multi-turn tool use and proper tool result handling
 - **Thinking/reasoning**: Gemini 3 thinking levels, Gemini 2.5 thinking budgets, thought signature preservation
 - **Automatic auth**: Uses Google Application Default Credentials
 - **Region awareness**: Global endpoints where supported, regional where required
-- **Pricing tracking**: Built-in cost per token for all models (including thinking tokens)
+- **Pricing tracking**: Built-in cost per token for all models (including thinking tokens and regional endpoint premiums)
 ## Installation
@@ -128,17 +128,20 @@ alias pil="GOOGLE_CLOUD_PROJECT=your-project pi --provider vertex --model llama-
 ### Claude Models
-| Model | Context | Max Tokens | Input | Reasoning | Price (in/out) | Region |
-|-------|---------|------------|-------|-----------|----------------|--------|
-| claude-opus-4-6 | 1M | 32,000 | text, image | ✅ | $5.00/$25.00 | global |
-| claude-sonnet-4-6 | 1M | 64,000 | text, image | ✅ | $3.00/$15.00 | global |
-| claude-opus-4-5 | 200K | 32,000 | text, image | ✅ | $5.00/$25.00 | global |
-| claude-sonnet-4-5 | 200K | 64,000 | text, image | ✅ | $3.00/$15.00 | global |
-| claude-haiku-4-5 | 200K | 64,000 | text, image | ✅ | $1.00/$5.00 | global |
-| claude-opus-4-1 | 200K | 32,000 | text, image | ✅ | $15.00/$75.00 | global |
-| claude-opus-4 | 200K | 32,000 | text, image | ✅ | $15.00/$75.00 | global |
-| claude-sonnet-4 | 200K | 64,000 | text, image | ✅ | $3.00/$15.00 | global |
-| claude-3-5-sonnet-v2 | 200K | 8,192 | text, image | ❌ | $3.00/$15.00 | global |
+Prices shown are for the **global** endpoint. Non-global regions (us-east5, europe-west1, asia-southeast1, us/eu multi-region) carry a 10% premium — cost tracking adjusts automatically based on your configured `GOOGLE_CLOUD_LOCATION`.
+| Model | Context | Max Tokens | Input | Reasoning | Price global (in/out) | Price regional (in/out) |
+|-------|---------|------------|-------|-----------|----------------------|------------------------|
+| claude-opus-4-7 | 1M | 128,000 | text, image | ✅ | $5.00/$25.00 | $5.50/$27.50 |
+| claude-opus-4-6 | 1M | 128,000 | text, image | ✅ | $5.00/$25.00 | $5.50/$27.50 |
+| claude-sonnet-4-6 | 1M | 128,000 | text, image | ✅ | $3.00/$15.00 | $3.30/$16.50 |
+| claude-opus-4-5 | 200K | 32,000 | text, image | ✅ | $5.00/$25.00 | $5.50/$27.50 |
+| claude-sonnet-4-5 | 200K | 64,000 | text, image | ✅ | $3.00/$15.00 | $3.30/$16.50 |
+| claude-haiku-4-5 | 200K | 64,000 | text, image | ✅ | $1.00/$5.00 | $1.10/$5.50 |
+| claude-opus-4-1 | 200K | 32,000 | text, image | ✅ | $15.00/$75.00 | (uniform) |
+| claude-opus-4 | 200K | 32,000 | text, image | ✅ | $15.00/$75.00 | (uniform) |
+| claude-sonnet-4 | 200K | 64,000 | text, image | ✅ | $3.00/$15.00 | (uniform) |
+| claude-3-5-sonnet-v2 | 200K | 8,192 | text, image | ❌ | $3.00/$15.00 | (uniform) |
 ### Llama Models
@@ -170,6 +173,9 @@ alias pil="GOOGLE_CLOUD_PROJECT=your-project pi --provider vertex --model llama-
 | minimax-m2 | 196K | minimaxai | $0.30/$1.20 | global |
 | glm-5 | 200K | zai-org | $1.00/$3.20 | global |
 | glm-4.7 | 200K | zai-org | $0.60/$2.20 | global |
+| grok-4.20-reasoning | 200K | xai | $1.25/$2.50 | global |
+| grok-4.1-fast-reasoning | 128K | xai | $0.20/$0.50 | global |
+| gemma-4-26b-a4b-it | 262K | google | $0.15/$0.60 | global |
 ## Regional Endpoints
@@ -218,6 +224,10 @@ export GOOGLE_CLOUD_LOCATION=us-central1
 - `@mariozechner/pi-ai`: Peer dependency
 - `@mariozechner/pi-coding-agent`: Peer dependency
+## Acknowledgments
+[lhl](https://github.com/lhl) maintains [lhl/pi-vertex](https://github.com/lhl/pi-vertex), an independent fork that added comprehensive unit tests and CI, and identified several important bugs. Several fixes in v1.1.6 were co-discovered through review of that work, including the `maxTokens/2` halving bug, Gemini cached-token double-counting, `sanitizeText` emoji corruption, missing tool result flushing, and image tool result forwarding. Kudos.
 ## License
 MIT

package/models/claude.ts CHANGED Viewed

@@ -2,13 +2,42 @@
  * Claude model definitions for Vertex AI
  * Source: https://cloud.google.com/vertex-ai/generative-ai/docs/partner-models/use-partner-models
  * Pricing: https://cloud.google.com/vertex-ai/generative-ai/pricing#partner-models
- * All prices per 1M tokens (global endpoint, <= 200K input tokens)
+ * All prices per 1M tokens (<=200K input tokens)
+ * `cost` = global endpoint; `costRegional` = non-global (us-east5, europe-west1,
+ * asia-southeast1, us/eu multi-region) — uniformly 10% above global.
  * Cache write prices are for 5-minute TTL
  */
 import type { VertexModelConfig } from "../types.js";
 export const CLAUDE_MODELS: VertexModelConfig[] = [
+  // Claude 4.7 series
+  {
+    id: "claude-opus-4-7",
+    name: "Claude Opus 4.7",
+    apiId: "claude-opus-4-7",
+    publisher: "anthropic",
+    endpointType: "maas",
+    contextWindow: 1000000,
+    maxTokens: 128000,
+    input: ["text", "image"],
+    reasoning: true,
+    tools: true,
+    cost: {
+      input: 5.00,
+      output: 25.00,
+      cacheRead: 0.50,
+      cacheWrite: 6.25,
+    },
+    costRegional: {
+      input: 5.50,
+      output: 27.50,
+      cacheRead: 0.55,
+      cacheWrite: 6.875,
+    },
+    region: "global",
+  },
   // Claude 4.6 series
   {
     id: "claude-opus-4-6",
@@ -17,7 +46,7 @@ export const CLAUDE_MODELS: VertexModelConfig[] = [
     publisher: "anthropic",
     endpointType: "maas",
     contextWindow: 1000000,
-    maxTokens: 32000,
+    maxTokens: 128000,
     input: ["text", "image"],
     reasoning: true,
     tools: true,
@@ -27,6 +56,12 @@ export const CLAUDE_MODELS: VertexModelConfig[] = [
       cacheRead: 0.50,
       cacheWrite: 6.25,
     },
+    costRegional: {
+      input: 5.50,
+      output: 27.50,
+      cacheRead: 0.55,
+      cacheWrite: 6.875,
+    },
     region: "global",
   },
   {
@@ -36,7 +71,7 @@ export const CLAUDE_MODELS: VertexModelConfig[] = [
     publisher: "anthropic",
     endpointType: "maas",
     contextWindow: 1000000,
-    maxTokens: 64000,
+    maxTokens: 128000,
     input: ["text", "image"],
     reasoning: true,
     tools: true,
@@ -46,6 +81,12 @@ export const CLAUDE_MODELS: VertexModelConfig[] = [
       cacheRead: 0.30,
       cacheWrite: 3.75,
     },
+    costRegional: {
+      input: 3.30,
+      output: 16.50,
+      cacheRead: 0.33,
+      cacheWrite: 4.125,
+    },
     region: "global",
   },
@@ -67,6 +108,12 @@ export const CLAUDE_MODELS: VertexModelConfig[] = [
       cacheRead: 0.50,
       cacheWrite: 6.25,
     },
+    costRegional: {
+      input: 5.50,
+      output: 27.50,
+      cacheRead: 0.55,
+      cacheWrite: 6.875,
+    },
     region: "global",
   },
   {
@@ -86,6 +133,12 @@ export const CLAUDE_MODELS: VertexModelConfig[] = [
       cacheRead: 0.30,
       cacheWrite: 3.75,
     },
+    costRegional: {
+      input: 3.30,
+      output: 16.50,
+      cacheRead: 0.33,
+      cacheWrite: 4.125,
+    },
     region: "global",
   },
   {
@@ -105,6 +158,12 @@ export const CLAUDE_MODELS: VertexModelConfig[] = [
       cacheRead: 0.10,
       cacheWrite: 1.25,
     },
+    costRegional: {
+      input: 1.10,
+      output: 5.50,
+      cacheRead: 0.11,
+      cacheWrite: 1.375,
+    },
     region: "global",
   },

package/models/maas.ts CHANGED Viewed

@@ -8,6 +8,46 @@
 import type { VertexModelConfig } from "../types.js";
 export const MAAS_MODELS: VertexModelConfig[] = [
+  // --- xAI Grok ---
+  {
+    id: "grok-4.20-reasoning",
+    name: "Grok 4.20 Reasoning",
+    apiId: "grok-4.20-reasoning",
+    publisher: "xai",
+    endpointType: "maas",
+    contextWindow: 200000,
+    maxTokens: 32000,
+    input: ["text", "image"],
+    reasoning: true,
+    tools: true,
+    cost: {
+      input: 1.25,
+      output: 2.50,
+      cacheRead: 0.20,
+      cacheWrite: 0,
+    },
+    region: "global",
+  },
+  {
+    id: "grok-4.1-fast-reasoning",
+    name: "Grok 4.1 Fast Reasoning",
+    apiId: "grok-4.1-fast-reasoning",
+    publisher: "xai",
+    endpointType: "maas",
+    contextWindow: 128000,
+    maxTokens: 32000,
+    input: ["text", "image"],
+    reasoning: true,
+    tools: true,
+    cost: {
+      input: 0.20,
+      output: 0.50,
+      cacheRead: 0.05,
+      cacheWrite: 0,
+    },
+    region: "global",
+  },
   // --- Meta Llama ---
   {
     id: "llama-4-maverick",
@@ -383,6 +423,27 @@ export const MAAS_MODELS: VertexModelConfig[] = [
     region: "global",
   },
+  // --- Google Gemma (MaaS) ---
+  {
+    id: "gemma-4-26b-a4b-it",
+    name: "Gemma 4 26B A4B IT",
+    apiId: "gemma-4-26b-a4b-it-maas",
+    publisher: "google",
+    endpointType: "maas",
+    contextWindow: 262144,
+    maxTokens: 128000,
+    input: ["text", "image"],
+    reasoning: false,
+    tools: false,
+    cost: {
+      input: 0.15,
+      output: 0.60,
+      cacheRead: 0,
+      cacheWrite: 0,
+    },
+    region: "global",
+  },
   // --- GLM (Zhipu AI) ---
   {
     id: "glm-5",

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "@ssweens/pi-vertex",
-  "version": "1.1.4",
+  "version": "1.1.7",
   "description": "Google Vertex AI provider for Pi coding agent - supports Gemini, Claude, and all MaaS models",
   "type": "module",
   "main": "index.ts",

package/streaming/gemini.ts CHANGED Viewed

@@ -24,14 +24,53 @@ const THINKING_LEVEL_MAP: Record<string, ThinkingLevel> = {
   high: ThinkingLevel.HIGH,
 };
+interface GeminiThinkingConfig {
+  includeThoughts?: boolean;
+  thinkingBudget?: number;
+  thinkingLevel?: ThinkingLevel;
+}
+function isGemini3ProModel(modelId: string): boolean {
+  return /gemini-3(?:\.\d+)?-pro/.test(modelId.toLowerCase());
+}
+function isGemini3FlashModel(modelId: string): boolean {
+  return /gemini-3(?:\.\d+)?-flash/.test(modelId.toLowerCase());
+}
+function isGemini25ProModel(modelId: string): boolean {
+  return /gemini-2\.5-pro/.test(modelId.toLowerCase());
+}
+function getGemini3ThinkingLevel(effort: string, modelId: string): ThinkingLevel {
+  if (isGemini3ProModel(modelId)) {
+    // Pro only supports LOW/MEDIUM/HIGH — floor minimal/low to LOW
+    if (effort === "minimal" || effort === "low") return ThinkingLevel.LOW;
+    if (effort === "medium") return ThinkingLevel.MEDIUM;
+    return ThinkingLevel.HIGH;
+  }
+  return THINKING_LEVEL_MAP[effort];
+}
+function getLowestThinkingConfig(modelId: string): GeminiThinkingConfig {
+  if (isGemini3ProModel(modelId)) {
+    return { thinkingLevel: ThinkingLevel.LOW };
+  }
+  if (isGemini3FlashModel(modelId)) {
+    return { thinkingLevel: ThinkingLevel.MINIMAL };
+  }
+  if (isGemini25ProModel(modelId)) {
+    return { thinkingBudget: 128 };
+  }
+  return { thinkingBudget: 0 };
+}
 function mapGeminiStopReason(reason: string): "stop" | "length" | "toolUse" | "error" {
   switch (reason) {
     case FinishReason.STOP:
       return "stop";
     case FinishReason.MAX_TOKENS:
       return "length";
-    case FinishReason.SAFETY:
-    case FinishReason.RECITATION:
     default:
       return "error";
   }
@@ -79,9 +118,11 @@ export function streamGemini(
       // Convert messages with model ID for proper thinking/tool handling
       const contents = convertToGeminiMessages(context.messages, model.apiId);
-      // Build config — only set temperature when explicitly provided
-      const config: any = {
-        maxOutputTokens: options?.maxTokens || Math.floor(model.maxTokens / 2),
+      // Build config — only set temperature when explicitly provided.
+      // The Vertex Gemini config shape is sprawling; use Record to avoid
+      // fighting the SDK's incomplete typings.
+      const config: Record<string, unknown> = {
+        maxOutputTokens: options?.maxTokens || model.maxTokens,
         ...(options?.temperature !== undefined && { temperature: options.temperature }),
       };
@@ -95,28 +136,33 @@ export function streamGemini(
         config.tools = convertToolsForGemini(context.tools);
       }
-      // Add thinking configuration (matches pi-mono's buildParams logic)
-      if (model.reasoning && options?.reasoning) {
-        const effort = options.reasoning === "xhigh" ? "high" : options.reasoning;
-        const isGemini3 = model.apiId.startsWith("gemini-3");
-        const thinkingConfig: any = { includeThoughts: true };
+      // Add thinking configuration (matches pi-mono's buildParams logic).
+      // For reasoning models: always set a minimum thinking config so the model
+      // doesn't silently suppress thoughts when no effort level is specified.
+      if (model.reasoning) {
+        if (options?.reasoning) {
+          const effort = options.reasoning === "xhigh" ? "high" : options.reasoning;
+          const isGemini3 = model.apiId.startsWith("gemini-3");
+          const thinkingConfig: GeminiThinkingConfig = { includeThoughts: true };
+          if (isGemini3) {
+            // Gemini 3 Pro doesn't support MINIMAL; Flash models do.
+            thinkingConfig.thinkingLevel = getGemini3ThinkingLevel(effort, model.apiId);
+          } else {
+            // Gemini 2.5 models use thinking budgets (token counts)
+            const budgets: Record<string, number> = {
+              minimal: 128,
+              low: 2048,
+              medium: 8192,
+              high: model.apiId.includes("2.5-pro") ? 32768 : 24576,
+            };
+            thinkingConfig.thinkingBudget = budgets[effort] ?? 8192;
+          }
-        if (isGemini3) {
-          // Gemini 3 models use thinking levels (MINIMAL/LOW/MEDIUM/HIGH)
-          thinkingConfig.thinkingLevel = THINKING_LEVEL_MAP[effort];
+          config.thinkingConfig = thinkingConfig;
         } else {
-          // Gemini 2.5 models use thinking budgets (token counts)
-          const budgets: Record<string, number> = {
-            minimal: 128,
-            low: 2048,
-            medium: 8192,
-            high: model.apiId.includes("2.5-pro") ? 32768 : 24576,
-          };
-          thinkingConfig.thinkingBudget = budgets[effort] ?? 8192;
+          config.thinkingConfig = getLowestThinkingConfig(model.apiId);
         }
-        config.thinkingConfig = thinkingConfig;
       }
       // Pass abort signal to SDK for in-flight cancellation
@@ -136,8 +182,10 @@ export function streamGemini(
         config,
       });
-      // Track current content block for thinking/text transitions
-      let currentBlock: any = null;
+      // Track current content block for thinking/text transitions.
+      type StreamingTextBlock = { type: "text"; text: string; textSignature?: string };
+      type StreamingThinkingBlock = { type: "thinking"; thinking: string; thinkingSignature?: string };
+      let currentBlock: StreamingTextBlock | StreamingThinkingBlock | null = null;
       let currentBlockType: "text" | "thinking" | null = null;
       for await (const chunk of response) {
@@ -152,13 +200,11 @@ export function streamGemini(
               // Check if we need to transition to a new block
               if (currentBlockType !== targetType) {
-                // End previous block
-                if (currentBlock && currentBlockType) {
-                  if (currentBlockType === "text") {
-                    stream.push({ type: "text_end", contentIndex: output.content.length - 1, content: currentBlock.text, partial: output });
-                  } else {
-                    stream.push({ type: "thinking_end", contentIndex: output.content.length - 1, content: currentBlock.thinking, partial: output });
-                  }
+                // End previous block (narrow on type for correct field access)
+                if (currentBlock?.type === "text") {
+                  stream.push({ type: "text_end", contentIndex: output.content.length - 1, content: currentBlock.text, partial: output });
+                } else if (currentBlock?.type === "thinking") {
+                  stream.push({ type: "thinking_end", contentIndex: output.content.length - 1, content: currentBlock.thinking, partial: output });
                 }
                 // Start new block
@@ -174,12 +220,12 @@ export function streamGemini(
                 currentBlockType = targetType;
               }
-              // Accumulate content
-              if (currentBlockType === "thinking") {
+              // Accumulate content (narrow on discriminant for type safety)
+              if (currentBlock?.type === "thinking") {
                 currentBlock.thinking += part.text;
                 currentBlock.thinkingSignature = retainThoughtSignature(currentBlock.thinkingSignature, part.thoughtSignature);
                 stream.push({ type: "thinking_delta", contentIndex: output.content.length - 1, delta: part.text, partial: output });
-              } else {
+              } else if (currentBlock?.type === "text") {
                 currentBlock.text += part.text;
                 currentBlock.textSignature = retainThoughtSignature(currentBlock.textSignature, part.thoughtSignature);
                 stream.push({ type: "text_delta", contentIndex: output.content.length - 1, delta: part.text, partial: output });
@@ -188,12 +234,12 @@ export function streamGemini(
             if (part.functionCall) {
               // End current text/thinking block before tool call
-              if (currentBlock && currentBlockType) {
-                if (currentBlockType === "text") {
-                  stream.push({ type: "text_end", contentIndex: output.content.length - 1, content: currentBlock.text, partial: output });
-                } else {
-                  stream.push({ type: "thinking_end", contentIndex: output.content.length - 1, content: currentBlock.thinking, partial: output });
-                }
+              if (currentBlock?.type === "text") {
+                stream.push({ type: "text_end", contentIndex: output.content.length - 1, content: currentBlock.text, partial: output });
+              } else if (currentBlock?.type === "thinking") {
+                stream.push({ type: "thinking_end", contentIndex: output.content.length - 1, content: currentBlock.thinking, partial: output });
+              }
+              if (currentBlock) {
                 currentBlock = null;
                 currentBlockType = null;
               }
@@ -210,7 +256,7 @@ export function streamGemini(
                 type: "toolCall" as const,
                 id: toolCallId,
                 name: part.functionCall.name || "",
-                arguments: (part.functionCall.args as Record<string, any>) ?? {},
+                arguments: (part.functionCall.args as Record<string, unknown>) ?? {},
                 ...(part.thoughtSignature && { thoughtSignature: part.thoughtSignature }),
               };
@@ -230,18 +276,26 @@ export function streamGemini(
             output.errorMessage = "Content blocked by safety filters";
           }
           // Override to toolUse if any tool calls are present (matches pi-mono)
-          if (output.content.some((b: any) => b.type === "toolCall")) {
+          if (output.content.some((b) => b.type === "toolCall")) {
             output.stopReason = "toolUse";
           }
         }
-        // Update usage — include thoughtsTokenCount in output (matches pi-mono)
+        // Update usage — include thoughtsTokenCount in output (matches pi-mono).
+        // Subtract cached tokens from prompt to avoid double-counting in input cost.
         if (chunk.usageMetadata) {
-          const meta = chunk.usageMetadata as any;
+          const meta = chunk.usageMetadata as {
+            cachedContentTokenCount?: number;
+            promptTokenCount?: number;
+            candidatesTokenCount?: number;
+            thoughtsTokenCount?: number;
+            totalTokenCount?: number;
+          };
+          const cachedTokens = meta.cachedContentTokenCount || 0;
           output.usage = {
-            input: meta.promptTokenCount || 0,
+            input: Math.max(0, (meta.promptTokenCount || 0) - cachedTokens),
             output: (meta.candidatesTokenCount || 0) + (meta.thoughtsTokenCount || 0),
-            cacheRead: meta.cachedContentTokenCount || 0,
+            cacheRead: cachedTokens,
             cacheWrite: 0,
             totalTokens: meta.totalTokenCount || 0,
             cost: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0, total: 0 },
@@ -251,15 +305,17 @@ export function streamGemini(
       }
       // End final block
-      if (currentBlock && currentBlockType) {
-        if (currentBlockType === "text") {
-          stream.push({ type: "text_end", contentIndex: output.content.length - 1, content: currentBlock.text, partial: output });
-        } else {
-          stream.push({ type: "thinking_end", contentIndex: output.content.length - 1, content: currentBlock.thinking, partial: output });
-        }
+      if (currentBlock?.type === "text") {
+        stream.push({ type: "text_end", contentIndex: output.content.length - 1, content: currentBlock.text, partial: output });
+      } else if (currentBlock?.type === "thinking") {
+        stream.push({ type: "thinking_end", contentIndex: output.content.length - 1, content: currentBlock.thinking, partial: output });
+      }
+      if (options?.signal?.aborted) {
+        throw new Error("Request was aborted");
       }
-      stream.push({ type: "done", reason: output.stopReason as any, message: output });
+      stream.push({ type: "done", reason: output.stopReason, message: output });
       stream.end();
     } catch (error) {
       output.stopReason = options?.signal?.aborted ? "aborted" : "error";

package/streaming/maas.ts CHANGED Viewed

@@ -53,6 +53,11 @@ async function streamAnthropic(
   const location = resolveLocation(model.region);
   const auth = getAuthConfig(location);
+  // Use regional pricing when the resolved endpoint is not the global one.
+  // Models without costRegional (e.g. Opus 4.1, Sonnet 4) have uniform pricing.
+  const effectiveCost =
+    auth.location !== "global" && model.costRegional ? model.costRegional : model.cost;
   const client = new AnthropicVertex({
     projectId: auth.projectId,
     region: auth.location,
@@ -218,7 +223,7 @@ async function streamAnthropic(
   const params: any = {
     model: model.apiId,
-    max_tokens: options?.maxTokens || Math.floor(model.maxTokens / 2),
+    max_tokens: options?.maxTokens || model.maxTokens,
     messages,
     ...(context.systemPrompt ? { system: context.systemPrompt } : {}),
     ...(tools && tools.length > 0 ? { tools } : {}),
@@ -314,7 +319,7 @@ async function streamAnthropic(
   }
   output.usage.totalTokens = output.usage.input + output.usage.output + output.usage.cacheRead + output.usage.cacheWrite;
-  calculateCost(model as any, output.usage);
+  calculateCost({ ...model, cost: effectiveCost } as any, output.usage);
   if (output.content.some((b: any) => b.type === "toolCall")) {
     output.stopReason = "toolUse";
@@ -371,7 +376,7 @@ export function streamMaaS(
       const innerStream = streamSimpleOpenAICompletions(modelForPi, context as any, {
         ...options,
         apiKey: accessToken,
-        maxTokens: options?.maxTokens || Math.floor(model.maxTokens / 2),
+        maxTokens: options?.maxTokens || model.maxTokens,
         temperature: options?.temperature,
       });

package/types.ts CHANGED Viewed

@@ -47,7 +47,17 @@ export interface VertexModelConfig {
   input: ModelInputType[];
   reasoning: boolean;
   tools: boolean;
+  /** Pricing for the global endpoint (default). */
   cost: ModelCost;
+  /**
+   * Pricing for non-global regional endpoints (us-east5, europe-west1,
+   * asia-southeast1, us/eu multi-region, etc.).
+   *
+   * When the resolved GCP location is not "global" and this field is set,
+   * the streaming layer uses these costs instead of `cost`.
+   * Omit for models whose pricing is uniform across all regions.
+   */
+  costRegional?: ModelCost;
   region: string;
 }

package/utils.ts CHANGED Viewed

@@ -7,18 +7,24 @@
 import type {
   AssistantMessage,
+  ImageContent,
   Message,
   TextContent,
   ThinkingContent,
+  Tool,
   ToolCall,
   ToolResultMessage,
 } from "./types.js";
 /**
- * Sanitize text by removing invalid surrogate pairs
+ * Sanitize text by removing unpaired surrogate code units.
+ * Valid surrogate pairs (emoji) are preserved.
  */
 export function sanitizeText(text: string): string {
-  return text.replace(/[\uD800-\uDFFF]/g, "\uFFFD");
+  return text.replace(
+    /[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]/g,
+    "",
+  );
 }
 // --- Thought signature helpers (matching pi-mono google-shared.ts) ---
@@ -50,12 +56,20 @@ export function retainThoughtSignature(
   return existing;
 }
-/**
- * Whether a model requires explicit tool call IDs in functionCall parts.
- * Claude and GPT-OSS models on Vertex require them; native Gemini models don't.
- */
-function requiresToolCallId(modelId: string): boolean {
-  return modelId.startsWith("claude-") || modelId.startsWith("gpt-oss-");
+type GeminiContent = {
+  role: "user" | "model";
+  parts: Array<Record<string, unknown>>;
+};
+function getGeminiMajorVersion(modelId: string): number | undefined {
+  const match = modelId.toLowerCase().match(/^gemini(?:-live)?-(\d+)/);
+  return match ? Number.parseInt(match[1], 10) : undefined;
+}
+function supportsMultimodalFunctionResponse(modelId: string): boolean {
+  const majorVersion = getGeminiMajorVersion(modelId);
+  if (majorVersion !== undefined) return majorVersion >= 3;
+  return true;
 }
 /**
@@ -64,12 +78,77 @@ function requiresToolCallId(modelId: string): boolean {
  * Handles the full pi-ai Message union: UserMessage, AssistantMessage (with
  * TextContent, ThinkingContent, ToolCall blocks), and ToolResultMessage.
  */
-export function convertToGeminiMessages(messages: Message[], modelId: string): any[] {
-  const result: any[] = [];
+export function convertToGeminiMessages(messages: Message[], modelId: string): GeminiContent[] {
+  const result: GeminiContent[] = [];
   const isGemini3 = modelId.startsWith("gemini-3");
+  let pendingToolCalls: ToolCall[] = [];
+  let existingToolResultIds = new Set<string>();
+  const pushToolResult = (
+    toolCallId: string,
+    toolName: string,
+    content: ToolResultMessage["content"],
+    isError: boolean,
+  ) => {
+    const textContent = content.filter((c): c is TextContent => c.type === "text");
+    const textResult = textContent.map((c) => c.text).join("\n");
+    const imageContent = content.filter((c): c is ImageContent => c.type === "image");
+    const hasText = textResult.length > 0;
+    const hasImages = imageContent.length > 0;
+    const responseValue = hasText
+      ? sanitizeText(textResult)
+      : hasImages
+        ? "(see attached image)"
+        : "";
+    const imageParts = imageContent.map((img) => ({
+      inlineData: { mimeType: img.mimeType, data: img.data },
+    }));
+    const functionResponsePart: Record<string, unknown> = {
+      functionResponse: {
+        name: toolName,
+        response: isError ? { error: responseValue } : { output: responseValue },
+        ...(hasImages && supportsMultimodalFunctionResponse(modelId) ? { parts: imageParts } : {}),
+      },
+    };
+    // Merge consecutive tool results into a single user turn (required by Gemini API)
+    const lastContent = result[result.length - 1];
+    if (lastContent?.role === "user" && lastContent.parts?.some((p) => "functionResponse" in p)) {
+      lastContent.parts.push(functionResponsePart);
+    } else {
+      result.push({ role: "user", parts: [functionResponsePart] });
+    }
+    // Gemini < 3: carry image tool results as a separate user image turn
+    if (hasImages && !supportsMultimodalFunctionResponse(modelId)) {
+      result.push({
+        role: "user",
+        parts: [{ text: "Tool result image:" }, ...imageParts],
+      });
+    }
+  };
+  const flushMissingToolResults = () => {
+    if (pendingToolCalls.length === 0) return;
+    for (const toolCall of pendingToolCalls) {
+      if (!existingToolResultIds.has(toolCall.id)) {
+        pushToolResult(
+          toolCall.id,
+          toolCall.name,
+          [{ type: "text", text: "No result provided" }],
+          true,
+        );
+      }
+    }
+    pendingToolCalls = [];
+    existingToolResultIds = new Set<string>();
+  };
   for (const msg of messages) {
     if (msg.role === "user") {
+      flushMissingToolResults();
       if (typeof msg.content === "string") {
         if (msg.content.trim()) {
           result.push({
@@ -78,33 +157,34 @@ export function convertToGeminiMessages(messages: Message[], modelId: string): a
           });
         }
       } else {
-        const parts = msg.content.map((item) => {
-          if (item.type === "text") {
-            return { text: sanitizeText(item.text) };
-          } else {
-            return {
-              inlineData: {
-                mimeType: item.mimeType,
-                data: item.data,
-              },
-            };
-          }
-        });
+        const parts: Array<Record<string, unknown>> = msg.content.map(
+          (item: TextContent | ImageContent) => {
+            if (item.type === "text") {
+              return { text: sanitizeText(item.text) };
+            }
+            return { inlineData: { mimeType: item.mimeType, data: item.data } };
+          },
+        );
         if (parts.length > 0) {
           result.push({ role: "user", parts });
         }
       }
     } else if (msg.role === "assistant") {
       const assistantMsg = msg as AssistantMessage;
+      flushMissingToolResults();
       // Skip errored/aborted messages — they're incomplete turns
       if (assistantMsg.stopReason === "error" || assistantMsg.stopReason === "aborted") {
         continue;
       }
+      // Also require api match so cross-provider thought signatures aren't replayed
       const isSameProviderAndModel =
-        assistantMsg.provider === "vertex" && assistantMsg.model === modelId;
-      const parts: any[] = [];
+        assistantMsg.provider === "vertex" &&
+        assistantMsg.api === "google-generative-ai" &&
+        assistantMsg.model === modelId;
+      const parts: Array<Record<string, unknown>> = [];
+      const toolCalls: ToolCall[] = [];
       for (const block of assistantMsg.content) {
         if (block.type === "text") {
@@ -134,13 +214,13 @@ export function convertToGeminiMessages(messages: Message[], modelId: string): a
           }
         } else if (block.type === "toolCall") {
           const toolCallBlock = block as ToolCall;
+          toolCalls.push(toolCallBlock);
           const thoughtSig = resolveThoughtSignature(isSameProviderAndModel, toolCallBlock.thoughtSignature);
-          const part: any = {
+          const part: Record<string, unknown> = {
             functionCall: {
               name: toolCallBlock.name,
               args: toolCallBlock.arguments ?? {},
-              ...(requiresToolCallId(modelId) ? { id: toolCallBlock.id } : {}),
             },
           };
           if (thoughtSig) {
@@ -159,31 +239,24 @@ export function convertToGeminiMessages(messages: Message[], modelId: string): a
       if (parts.length > 0) {
         result.push({ role: "model", parts });
       }
+      if (toolCalls.length > 0) {
+        pendingToolCalls = toolCalls;
+        existingToolResultIds = new Set<string>();
+      }
     } else if (msg.role === "toolResult") {
       const toolResultMsg = msg as ToolResultMessage;
-      const textContent = toolResultMsg.content.filter((c) => c.type === "text") as TextContent[];
-      const textResult = textContent.map((c) => c.text).join("\n");
-      const responseValue = textResult || "";
-      const includeId = requiresToolCallId(modelId);
-      const functionResponsePart: any = {
-        functionResponse: {
-          name: toolResultMsg.toolName,
-          response: toolResultMsg.isError ? { error: responseValue } : { output: responseValue },
-          ...(includeId ? { id: toolResultMsg.toolCallId } : {}),
-        },
-      };
-      // Merge consecutive tool results into a single user turn (required by Gemini API)
-      const lastContent = result[result.length - 1];
-      if (lastContent?.role === "user" && lastContent.parts?.some((p: any) => p.functionResponse)) {
-        lastContent.parts.push(functionResponsePart);
-      } else {
-        result.push({ role: "user", parts: [functionResponsePart] });
-      }
+      existingToolResultIds.add(toolResultMsg.toolCallId);
+      pushToolResult(
+        toolResultMsg.toolCallId,
+        toolResultMsg.toolName,
+        toolResultMsg.content,
+        toolResultMsg.isError,
+      );
     }
   }
+  flushMissingToolResults();
   return result;
 }
@@ -191,7 +264,9 @@ export function convertToGeminiMessages(messages: Message[], modelId: string): a
  * Convert tools to Gemini format using parametersJsonSchema (full JSON Schema support).
  * This differs from OpenAI format — Gemini uses functionDeclarations wrapped in an array.
  */
-export function convertToolsForGemini(tools: any[]): any[] | undefined {
+export function convertToolsForGemini(
+  tools: Tool[],
+): Array<{ functionDeclarations: Array<Record<string, unknown>> }> | undefined {
   if (!tools || tools.length === 0) return undefined;
   return [
     {
@@ -207,7 +282,7 @@ export function convertToolsForGemini(tools: any[]): any[] | undefined {
 /**
  * Convert tools to OpenAI format (for Claude and MaaS models)
  */
-export function convertTools(tools: any[]): any[] {
+export function convertTools(tools: Tool[]): Array<Record<string, unknown>> {
   return tools.map((tool) => ({
     type: "function",
     function: {