npm - copilot-custom-endpoint - Versions diffs - 1.3.4 → 1.3.6 - Mend

copilot-custom-endpoint 1.3.4 → 1.3.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

package/README.md CHANGED Viewed

@@ -29,7 +29,6 @@ That's it. No code, no servers to manage (unless the model specifically needs th
 | **Qwen 3.7 Max**            | DashScope | Optional               | ❌           | [Setup](docs/models/qwen.md)                                                                       |
 | **MiniMax M3**              | MiniMax   | No                     | ✅           | [Setup](docs/models/minimax.md)                                                                    |
 | **GLM 5.1**                 | Z.ai      | No                     | ❌           | [Setup](docs/models/glm.md)                                                                        |
-| **GLM 4.7 Flash (free)**    | Z.ai      | No                     | ❌           | [Setup](docs/models/glm.md)                                                                        |
 | **GLM 5V Turbo**            | Z.ai      | No                     | ✅           | [Setup](docs/models/glm.md)                                                                        |
 | **DeepSeek V4 Pro / Flash** | DeepSeek  | No (uses an extension) | ✅ via proxy | [Marketplace](https://marketplace.visualstudio.com/items?itemName=Vizards.deepseek-v4-for-copilot) |
@@ -117,7 +116,7 @@ VS Code's built-in `view_image` tool only accepts **static images** (PNG, JPG, G
 **Video Context MCP** is a small MCP server that bridges that gap. It works with **GitHub Copilot, Cursor, and Claude Code** out of the box, and:
 - **Extracts frames** from local files or remote URLs (no `ffmpeg` gymnastics required).
-- **Routes them through a multi-provider fallback chain** — `Gemini → GLM-4.6V-flash → Qwen3.7-plus → Kimi K2.6 → MiMo-V2.5` — so a single `GLM 5V Turbo` rate-limit hiccup doesn't kill your session.
+- **Routes them through a multi-provider fallback chain** — `Gemini → GLM 5V Turbo → Qwen3.7-plus → Kimi K2.6 → MiMo-V2.5` — so a single `GLM 5V Turbo` rate-limit hiccup doesn't kill your session.
 - **Answers natural-language questions** about the video grounded in actual frames: "what does the speaker click in the last 30 seconds?", "summarize the demo", "find the frame where the error appears".
 - **Extras:** timestamp search, audio transcription with speaker diarization, and video metadata (resolution, duration, codec).

package/docs/example-config.md CHANGED Viewed

@@ -155,21 +155,7 @@ Here's a complete, real-world `chatLanguageModels.json` that combines **all the
           "top_p": 0.95
         }
       },
-      {
-        "id": "glm-4.7-flash",
-        "name": "GLM 4.7 Flash (free)",
-        "url": "https://api.z.ai/api/paas/v4/chat/completions",
-        "toolCalling": true,
-        "vision": false,
-        "streaming": true,
-        "maxInputTokens": 204800,
-        "maxOutputTokens": 131072,
-        "requestBody": {
-          "thinking": { "type": "enabled" },
-          "temperature": 1,
-          "top_p": 0.95
-        }
-      },
       {
         "id": "glm-5v-turbo",
         "name": "GLM 5V Turbo (vision flagship)",
@@ -198,6 +184,6 @@ If you only need one provider, jump straight to its setup guide:
 - [Qwen 3.7 Plus / 3.7 Max](qwen.md)
 - [Xiaomi MiMo (V2.5 / V2.5 Pro / V2 Flash)](mimo.md)
 - [MiniMax M3](minimax.md)
-- [GLM (5.1 / 4.7 Flash / 5V Turbo)](glm.md)
+- [GLM (5.1 / 5V Turbo)](glm.md)
 > **DeepSeek V4 Pro / V4 Flash** use the [DeepSeek V4 for Copilot Chat](https://marketplace.visualstudio.com/items?itemName=Vizards.deepseek-v4-for-copilot) extension — they don't appear in `chatLanguageModels.json`.

package/docs/models/glm.md CHANGED Viewed

@@ -2,42 +2,28 @@
 > **TL;DR:** GLM works directly with VS Code's custom-endpoint provider — **no proxy needed**. The API is OpenAI Chat Completions compatible at `https://api.z.ai/api/paas/v4/chat/completions`, and Z.ai's default `thinking.clear_thinking: true` quietly strips `reasoning_content` from prior turns, which makes multi-turn tool loops stable even when VS Code doesn't preserve reasoning blocks. The **GLM Coding Plan** endpoint is **not** usable here — it is locked to a curated list of coding tools (Claude Code, Cline, OpenCode, etc.).
-> ⚠️ **Free-tier rate-limit warning.** Z.ai's free models — `glm-4.7-flash` and `glm-4.6v-flash` — are aggressively throttled. In practice, expect to see HTTP `{"code":"1302","message":"Rate limit reached for requests"}` (surfaced in VS Code as `Reason: Rate limit exceeded` / `ChatRateLimited`) **on a significant fraction of requests**, especially:
->
-> - During long chat sessions or after several tool turns (context > 8K is throttled to **1%** of the standard concurrency cap).
-> - When `thinking: { type: "enabled" }` is set — thinking tokens still hold the in-flight slot, so the model occupies the throttle window longer.
-> - During peak hours, when many users are sharing the same free pool.
->
-> **This is the free tier behaving as designed, not a bug.** For uninterrupted work, use a paid model: `glm-4.6v` is the cheapest paid vision option ($0.30/$0.90 per 1M), `glm-4.7` is the best cost/quality balance for text-only ($0.60/$2.20 per 1M), and `glm-5.1` is the flagship ($1.40/$4.40 per 1M). See [Rate limits](#rate-limits) for the full breakdown.
 ## At a Glance
-| Field                  | Value                                                              |
-| ---------------------- | ------------------------------------------------------------------ |
-| Mode                   | **Direct** (no proxy)                                              |
-| Vision                 | ✅ Yes (`glm-5v-turbo` only)                                       |
-| Tool calling           | ✅ Yes (native multimodal tool use on `glm-4.6v` / `glm-5v-turbo`) |
-| Context (flagship)     | 200K (`glm-5.1` / `glm-4.7` / `glm-4.7-flash` / `glm-5v-turbo`)    |
-| Max output (flagship)  | 128K                                                               |
-| Required `requestBody` | `thinking: { type: "enabled" }` (recommended)                      |
-| Endpoint (intl)        | `https://api.z.ai/api/paas/v4/chat/completions`                    |
-| Endpoint (China)       | `https://open.bigmodel.cn/api/paas/v4/chat/completions`            |
-| Auth                   | `Authorization: Bearer $ZAI_API_KEY`                               |
+| Field                  | Value                                                   |
+| ---------------------- | ------------------------------------------------------- |
+| Mode                   | **Direct** (no proxy)                                   |
+| Vision                 | ✅ Yes (`glm-5v-turbo` only)                            |
+| Tool calling           | ✅ Yes (native multimodal tool use on `glm-5v-turbo`)   |
+| Context (flagship)     | 200K (`glm-5.1` / `glm-5v-turbo`)                       |
+| Max output (flagship)  | 128K                                                    |
+| Required `requestBody` | `thinking: { type: "enabled" }` (recommended)           |
+| Endpoint (intl)        | `https://api.z.ai/api/paas/v4/chat/completions`         |
+| Endpoint (China)       | `https://open.bigmodel.cn/api/paas/v4/chat/completions` |
+| Auth                   | `Authorization: Bearer $ZAI_API_KEY`                    |
 ### Models at a glance
-| Model            | Vision | Context | Max output | Thinking      | Cost (in / out per 1M) | Role                                                      |
-| ---------------- | ------ | ------- | ---------- | ------------- | ---------------------- | --------------------------------------------------------- |
-| `glm-5.1`        | ❌     | 200K    | 128K       | `enabled`     | $1.40 / $4.40          | Current flagship — long-horizon / 8h autonomous work      |
-| `glm-4.7`        | ❌     | 200K    | 128K       | `enabled`     | $0.60 / $2.20          | Flagship 4.x — strong coding/agent                        |
-| `glm-4.7-flash`  | ❌     | 200K    | 128K       | `enabled`     | Free ¹                 | **Free** — newest 4.x tier at no cost                     |
-| `glm-5v-turbo`   | ✅     | 200K    | 128K       | `enabled`     | $1.20 / $4.00          | Multimodal **coding** model — vision-based agentic coding |
-| `glm-4.6v`       | ✅     | 128K    | 32K        | hybrid (auto) | $0.30 / $0.90          | Vision + **native multimodal tool calls**                 |
-| `glm-4.6v-flash` | ✅     | 128K    | 32K        | hybrid (auto) | Free ¹                 | **Free** vision tier                                      |
+| Model          | Vision | Context | Max output | Thinking  | Cost (in / out per 1M) | Role                                                      |
+| -------------- | ------ | ------- | ---------- | --------- | ---------------------- | --------------------------------------------------------- |
+| `glm-5.1`      | ❌     | 200K    | 128K       | `enabled` | $1.40 / $4.40          | Current flagship — long-horizon / 8h autonomous work      |
+| `glm-5v-turbo` | ✅     | 200K    | 128K       | `enabled` | $1.20 / $4.00          | Multimodal **coding** model — vision-based agentic coding |
-> ¹ **Free-tier caveat:** the two `*flash` free models are heavily rate-limited — see the [warning at the top of this document](#glm-zai--zhipu-ai--vs-code-custom-endpoint-setup-guide) and the [Rate limits](#rate-limits) section. Expect frequent HTTP `1302 / ChatRateLimited` errors, especially on context > 8K or with thinking enabled. For reliable use, prefer `glm-4.6v` (cheapest paid vision) or `glm-4.7` (best cost/quality, text-only).
-> Other GLM models — `glm-5`, `glm-5-turbo`, `glm-4.7`, `glm-4.6`, `glm-4.6v`, `glm-4.6v-flashx`, `glm-4.6v-flash`, `glm-4.5`, `glm-4.5-air`, `glm-4.5-flash`, `glm-4.5-x`, `glm-4.5-airx`, `glm-4-32b-0414-128k` — are callable on the same endpoint but are intentionally **not** added to the default `chatLanguageModels.json` block below. Add them in the same shape if you need them. Note: `glm-4.6v-flashx` was previously in the default block but has been **removed** because live testing showed it is not reliable for tool calling.
+> Other GLM models — `glm-5`, `glm-5-turbo`, `glm-4.6v-flashx`, `glm-4.5`, `glm-4.5-air`, `glm-4.5-flash`, `glm-4.5-x`, `glm-4.5-airx`, `glm-4-32b-0414-128k` — are callable on the same endpoint but are intentionally **not** added to the default `chatLanguageModels.json` block below. Add them in the same shape if you need them. Note: `glm-4.6v-flashx` was previously in the default block but has been **removed** because live testing showed it is not reliable for tool calling.
 ## Quick Start
@@ -84,21 +70,6 @@ Config file location:
         "top_p": 0.95
       }
     },
-    {
-      "id": "glm-4.7-flash",
-      "name": "GLM 4.7 Flash (free)",
-      "url": "https://api.z.ai/api/paas/v4/chat/completions",
-      "toolCalling": true,
-      "vision": false,
-      "streaming": true,
-      "maxInputTokens": 204800,
-      "maxOutputTokens": 131072,
-      "requestBody": {
-        "thinking": { "type": "enabled" },
-        "temperature": 1,
-        "top_p": 0.95
-      }
-    },
     {
       "id": "glm-5v-turbo",
       "name": "GLM 5V Turbo (vision flagship)",
@@ -142,11 +113,11 @@ Config file location:
 ### Sampling parameters
-| Parameter     | Range (hard cap) | Default                                       |
-| ------------- | ---------------- | --------------------------------------------- |
-| `temperature` | `[0.0, 1.0]`     | `1.0` for GLM-4.6 / 4.7 / 5.x · `0.6` for 4.5 |
-| `top_p`       | `[0.01, 1.0]`    | `0.95` for 4.5/4.6/4.7/5.x · `0.9` for 4-32B  |
-| `do_sample`   | bool             | `true` — set `false` to bypass sampling       |
+| Parameter     | Range (hard cap) | Default                                  |
+| ------------- | ---------------- | ---------------------------------------- |
+| `temperature` | `[0.0, 1.0]`     | `1.0` for GLM-4.6 / 5.x · `0.6` for 4.5  |
+| `top_p`       | `[0.01, 1.0]`    | `0.95` for 4.5/4.6/5.x · `0.9` for 4-32B |
+| `do_sample`   | bool             | `true` — set `false` to bypass sampling  |
 > **Important:** Z.ai's `temperature` is capped at `1.0` server-side. Sending a value like `1.2` will be rejected. VS Code's defaults (typically `0`–`1`) are within range, but the explicit `requestBody` values above are the recommended ones for coding/agent work.
@@ -154,10 +125,10 @@ Config file location:
 `thinking` is a GLM-specific object. It only applies to **GLM-4.5 and above**.
-| Field                     | Values                 | Default                                                       | Meaning                                                                                                   |
-| ------------------------- | ---------------------- | ------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------- |
-| `thinking.type`           | `enabled` / `disabled` | `enabled` for 5.1/5/5-Turbo/4.7/4.5V; hybrid for 4.6/4.6V/4.5 | Force-enable or force-disable chain-of-thought. Hybrid lets the model decide.                             |
-| `thinking.clear_thinking` | bool                   | **`true`**                                                    | When `true`, Z.ai **strips historical `reasoning_content` from prior turns** before sending to the model. |
+| Field                     | Values                 | Default                                              | Meaning                                                                                                   |
+| ------------------------- | ---------------------- | ---------------------------------------------------- | --------------------------------------------------------------------------------------------------------- |
+| `thinking.type`           | `enabled` / `disabled` | `enabled` for 5.1/5/5-Turbo/4.5V; hybrid for 4.6/4.5 | Force-enable or force-disable chain-of-thought. Hybrid lets the model decide.                             |
+| `thinking.clear_thinking` | bool                   | **`true`**                                           | When `true`, Z.ai **strips historical `reasoning_content` from prior turns** before sending to the model. |
 > **Why the default is good for VS Code:** with `clear_thinking: true` (the server default), Z.ai doesn't require the client to forward `reasoning_content` between turns. VS Code's custom-endpoint provider doesn't preserve that field across tool turns — but for GLM it doesn't need to, because the server strips it. This avoids the same class of `reasoning_content` 400 errors that bite on MiMo.
 >
@@ -175,86 +146,43 @@ Config file location:
 - **Streaming** via SSE, terminated with `data: [DONE]`. (Same as OpenAI.)
 - **Tool calling** with the standard `tools` array. `tool_choice` accepts only `auto`.
 - **Max 128 functions** per request.
-- **Tool stream** (`tool_stream: true`) is supported on the `glm-4.6v` family and above for streaming tool-call deltas.
-- **Vision** on `glm-4.6v`, `glm-4.6v-flash`, and `glm-5v-turbo` using the OpenAI `image_url` content-part format. External URLs and base64 data URIs both work.
-- **Video input** on `glm-5v-turbo` — the model natively accepts video (Input Modality: **Video / Image / Text / File**). Use a public video URL in an `image_url` content part via direct API call; VS Code's chat UI does not currently forward video attachments to the model. For a turnkey VS Code integration that bridges the gap (extracts frames, routes them to GLM or a fallback provider, and answers natural-language questions about the video), see [**Video Context MCP**](https://www.videocontextmcp.com/) — an MCP server that gives Copilot/Cursor/Claude Code video understanding via the `glm-4.6v` provider and a multi-provider fallback chain (Gemini → GLM-4.6V → Qwen 3.7 Plus → Kimi K2.6 → MiMo-V2.5).
-- **Native multimodal tool calling** on `glm-4.6v` (and inherited by `glm-5v-turbo`) — images, screenshots, and document pages can be passed directly as tool parameters and tool results can be consumed visually.
+- **Tool stream** (`tool_stream: true`) is supported on the `glm-5v-turbo` family and above for streaming tool-call deltas.
+- **Vision** on `glm-5v-turbo` using the OpenAI `image_url` content-part format. External URLs and base64 data URIs both work.
+- **Video input** on `glm-5v-turbo` — the model natively accepts video (Input Modality: **Video / Image / Text / File**). Use a public video URL in an `image_url` content part via direct API call; VS Code's chat UI does not currently forward video attachments to the model. For a turnkey VS Code integration that bridges the gap (extracts frames, routes them to GLM or a fallback provider, and answers natural-language questions about the video), see [**Video Context MCP**](https://www.videocontextmcp.com/) — an MCP server that gives Copilot/Cursor/Claude Code video understanding via the `glm-5v-turbo` provider and a multi-provider fallback chain (Gemini → GLM 5V Turbo → Qwen 3.7 Plus → Kimi K2.6 → MiMo-V2.5).
+- **Native multimodal tool calling** on `glm-5v-turbo` — images, screenshots, and document pages can be passed directly as tool parameters and tool results can be consumed visually.
 - **Built-in web search** is exposed as a tool type `web_search` (different from `function`).
 - **Context caching** is automatic — the API returns `usage.prompt_tokens_details.cached_tokens` on cache hits; cache writes are currently free of charge.
 - **Response shape extras:** `choices[].message.reasoning_content` (when thinking is on), `web_search[]` (when web search is used), `usage` with `cached_tokens`.
 ### Rate limits
-Z.ai throttles **by in-flight concurrency**, not classic RPM/TPM. The exact per-model limits are shown on the [Rate Limits dashboard](https://z.ai/manage-apikey/rate-limits) once you are signed in. **Pay-as-you-go models share a generous pool; free-tier models (`glm-4.7-flash`, `glm-4.6v-flash`) are on a separate, much tighter bucket.**
+Z.ai throttles **by in-flight concurrency**, not classic RPM/TPM. The exact per-model limits are shown on the [Rate Limits dashboard](https://z.ai/manage-apikey/rate-limits) once you are signed in. **All paid models share a generous pool sized to your prepaid balance.**
-#### What "rate limited" looks like in VS Code
-When a free-tier model is throttled, Z.ai returns:
-```json
-{ "code": "1302", "message": "Rate limit reached for requests" }
-```
-VS Code surfaces this as a chat-side error:
-```
-Sorry, your request failed. Please try again.
-Client Request Id: <uuid>
-Reason: Rate limit exceeded
-ChatRateLimited: Rate limit exceeded
-{"code":"1302","message":"Rate limit reached for requests"} ...
-```
-> **This is normal for free models.** It is not a configuration problem, a bad API key, or a VS Code bug — it is the free tier protecting itself from abuse. Wait ~30 seconds and retry, or switch to a paid model.
-#### Free-tier specifics
-| Constraint                                                                                    | Impact                                                                              |
-| --------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------- |
-| Requests with **context > 8K tokens** are throttled to **1% of the standard concurrency cap** | Long chats and any session past a couple of tool turns fall into this bucket.       |
-| `thinking: { type: "enabled" }` keeps the in-flight slot held for the full reasoning duration | A thinking-on request counts as in-flight ~2–5× longer than a thinking-off request. |
-| Free quota is **per Z.ai account, shared across all free models**                             | Running `glm-4.7-flash` and `glm-4.6v-flash` concurrently drains the same bucket.   |
-| Peak-hour contention is significant                                                           | US/EU business hours see noticeably more `1302` errors than off-peak.               |
-#### Paid-tier specifics
-- Paid models (`glm-5.1`, `glm-4.7`, `glm-4.6v`, `glm-5v-turbo`) share a much larger concurrency pool sized to your prepaid balance. (Note: `glm-4.6v-flashx` was previously listed here as the cheapest paid option, but it has been removed from the recommended set because live testing showed it is not reliable for tool calling.)
-- For the cheapest reliable paid option, use `glm-4.6v` ($0.30 / $0.90 per 1M) if you need vision, or `glm-4.7` ($0.60 / $2.20 per 1M) for text-only work.
-- `glm-4.7` ($0.60 / $2.20 per 1M) is the recommended default for agent/coding work — strong quality at a low price.
-- `glm-5.1` ($1.40 / $4.40 per 1M) is the flagship and only worth it for long-horizon autonomous tasks.
-#### Reducing rate-limit pressure (still on the free tier)
-If you want to keep using `glm-4.7-flash` despite the limits:
-1. **Disable thinking** for tool-heavy sessions: set `thinking: { type: "disabled" }` in `requestBody`. Shorter responses free the slot faster.
-2. **Start new chats** instead of long-running ones — each new chat resets the per-conversation context length back under 8K.
-3. **Stagger agent runs** — don't run two Copilot agent sessions against the same Z.ai key in parallel; they share the same in-flight counter.
-4. **Retry with backoff** in your headspace: one `1302` is not a permanent block; it's a "right now is full, try again in 30s".
+For vision-capable work, use the `glm-5v-turbo` model ($1.20 / $4.00 per 1M). For text-only use, `glm-5.1` ($1.40 / $4.40 per 1M) is the flagship and recommended default for agent/coding work — strong quality for long-horizon autonomous tasks.
 > The **GLM Coding Plan** has separate (much higher) concurrency limits but is **not available via custom endpoints** — see [Why the Coding Plan is not an option](#why-the-glm-coding-plan-is-not-an-option-for-vs-code) below.
 ## Troubleshooting
-| Symptom                                                    | Likely cause                                                          | Fix                                                                                                        |
-| ---------------------------------------------------------- | --------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------- |
-| Model not in picker                                        | Config not reloaded, or JSON syntax error                             | Restart VS Code; validate JSON                                                                             |
-| HTTP 400 on the first turn                                 | `requestBody` removed `do_sample` semantics or invalid `temperature`  | Ensure `temperature ≤ 1.0` and `top_p ∈ [0.01, 1.0]`                                                       |
-| `invalid temperature: only values ≤ 1.0 are allowed`       | Set `temperature > 1.0`                                               | Lower it to `1.0` or below                                                                                 |
-| Tool call succeeds but follow-up turn degrades             | `clear_thinking: false` set in `requestBody`                          | Remove the `clear_thinking` key and let the server default to `true`                                       |
-| `tool_choice: required` rejected                           | GLM only supports `auto`                                              | Don't override `tool_choice` (VS Code's default is `auto`)                                                 |
-| `Failed to download multimodal content` on a vision call   | Z.ai's servers couldn't reach the image URL                           | Use a base64 `data:image/...` URI instead                                                                  |
-| 401 Unauthorized                                           | Region mismatch (international key used on China URL, or vice versa)  | Match your key to the regional endpoint                                                                    |
-| Upstream complains about `reasoning_content is missing`    | You set `clear_thinking: false` from a client that doesn't forward it | Drop `clear_thinking` from `requestBody`                                                                   |
-| 429 / "concurrency limit exceeded"                         | Too many in-flight requests                                           | Reduce concurrent agent sessions, or upgrade your Z.ai plan                                                |
-| `1302` / `ChatRateLimited` on a free-tier model (`*flash`) | Expected behavior — free tier is heavily throttled                    | Wait ~30s and retry, disable `thinking`, start a new chat, or switch to `glm-4.7`                          |
-| Long Chinese responses when the prompt is English          | Missing `Accept-Language: en-US,en` (Z.ai default)                    | Optional — VS Code's custom-endpoint provider doesn't set custom headers; usually the prompt language wins |
+| Symptom                                                  | Likely cause                                                          | Fix                                                                  |
+| -------------------------------------------------------- | --------------------------------------------------------------------- | -------------------------------------------------------------------- |
+| Model not in picker                                      | Config not reloaded, or JSON syntax error                             | Restart VS Code; validate JSON                                       |
+| HTTP 400 on the first turn                               | `requestBody` removed `do_sample` semantics or invalid `temperature`  | Ensure `temperature ≤ 1.0` and `top_p ∈ [0.01, 1.0]`                 |
+| `invalid temperature: only values ≤ 1.0 are allowed`     | Set `temperature > 1.0`                                               | Lower it to `1.0` or below                                           |
+| Tool call succeeds but follow-up turn degrades           | `clear_thinking: false` set in `requestBody`                          | Remove the `clear_thinking` key and let the server default to `true` |
+| `tool_choice: required` rejected                         | GLM only supports `auto`                                              | Don't override `tool_choice` (VS Code's default is `auto`)           |
+| `Failed to download multimodal content` on a vision call | Z.ai's servers couldn't reach the image URL                           | Use a base64 `data:image/...` URI instead                            |
+| 401 Unauthorized                                         | Region mismatch (international key used on China URL, or vice versa)  | Match your key to the regional endpoint                              |
+| Upstream complains about `reasoning_content is missing`  | You set `clear_thinking: false` from a client that doesn't forward it | Drop `clear_thinking` from `requestBody`                             |
+| 429 / "concurrency limit exceeded"                       | Too many in-flight requests                                           | Reduce concurrent agent sessions, or upgrade your Z.ai plan          |
+| Long Chinese responses when the prompt is English | Missing `Accept-Language: en-US,en` (Z.ai default) | Optional — VS Code's custom-endpoint provider doesn't set custom headers; usually the prompt language wins |
 ## Pricing
 All prices are **USD per 1M tokens** (cache miss) on the Z.ai international platform. Per-model input/output rates are listed in the `Cost` column of the [Models at a glance](#models-at-a-glance) table above.
-> **Cache writes** are currently **Limited-time Free** for all models. Cached-input pricing is roughly 18% of the input price (e.g. `$0.60` input → `$0.11` cached for `glm-4.7`). China platform (`bigmodel.cn`) prices in CNY; see the [China pricing page](https://bigmodel.cn/pricing). For the cross-provider comparison, see [docs/pricing.md](../pricing.md).
+> **Cache writes** are currently **Limited-time Free** for all models. Cached-input pricing is roughly 18% of the input price (e.g. `$1.40` input → `$0.25` cached for `glm-5.1`). China platform (`bigmodel.cn`) prices in CNY; see the [China pricing page](https://bigmodel.cn/pricing). For the cross-provider comparison, see [docs/pricing.md](../pricing.md).
 ---
@@ -275,15 +203,15 @@ That makes VS Code's `chat-completions` provider the obvious starting point —
 ### What differs from other providers in this repo
-| Concern                           | Z.ai / GLM behaviour                                                                                                                                                                                                             | Why it matters for VS Code                                                                                                                  |
-| --------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
-| Thinking default                  | Hybrid for `glm-4.6v` / `glm-4.6v-flash`; always-on for `glm-5.1` / `glm-4.7` / `glm-4.7-flash` / `glm-5v-turbo`.                                                                                                                | VS Code can simply set `thinking: { type: "enabled" }` in `requestBody` to make thinking deterministic on every model.                      |
-| `reasoning_content` on tool turns | Z.ai defaults to `clear_thinking: true`, **silently stripping historical `reasoning_content`**.                                                                                                                                  | This is a near-perfect match for VS Code, which does **not** preserve `reasoning_content` between turns. Loops work without extra plumbing. |
-| `tool_choice`                     | Only `auto` is accepted.                                                                                                                                                                                                         | VS Code's default behaviour is `auto`, so no override needed.                                                                               |
-| `temperature` hard cap            | `[0.0, 1.0]` — strictly enforced server-side.                                                                                                                                                                                    | Use `1.0` for coding/agent work; never go above.                                                                                            |
-| `do_sample`                       | Default `true`. When `false`, `temperature` and `top_p` are ignored.                                                                                                                                                             | Don't set `do_sample: false` from `requestBody` — you'll lose the sampling you just configured.                                             |
-| Coding Plan endpoint              | A separate endpoint at `https://api.z.ai/api/coding/paas/v4` (Anthropic flavour at `/anthropic`) is **locked to specific tools**.                                                                                                | Cannot be used for VS Code custom endpoints — see below.                                                                                    |
-| Vision (image input + tool use)   | OpenAI `image_url` content-part format (external URLs and base64 data URIs both work). `glm-4.6v` introduced **native multimodal tool use** (images as tool args, tool results consumed visually); `glm-5v-turbo` inherits this. | Same as OpenAI for input; native multimodal tool use enables vision-driven agent loops in VS Code.                                          |
+| Concern                           | Z.ai / GLM behaviour                                                                                                                                                                                 | Why it matters for VS Code                                                                                                                  |
+| --------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
+| Thinking default                  | Always-on for `glm-5.1` / `glm-5v-turbo`.                                                                                                                                                            | VS Code can simply set `thinking: { type: "enabled" }` in `requestBody` to make thinking deterministic on every model.                      |
+| `reasoning_content` on tool turns | Z.ai defaults to `clear_thinking: true`, **silently stripping historical `reasoning_content`**.                                                                                                      | This is a near-perfect match for VS Code, which does **not** preserve `reasoning_content` between turns. Loops work without extra plumbing. |
+| `tool_choice`                     | Only `auto` is accepted.                                                                                                                                                                             | VS Code's default behaviour is `auto`, so no override needed.                                                                               |
+| `temperature` hard cap            | `[0.0, 1.0]` — strictly enforced server-side.                                                                                                                                                        | Use `1.0` for coding/agent work; never go above.                                                                                            |
+| `do_sample`                       | Default `true`. When `false`, `temperature` and `top_p` are ignored.                                                                                                                                 | Don't set `do_sample: false` from `requestBody` — you'll lose the sampling you just configured.                                             |
+| Coding Plan endpoint              | A separate endpoint at `https://api.z.ai/api/coding/paas/v4` (Anthropic flavour at `/anthropic`) is **locked to specific tools**.                                                                    | Cannot be used for VS Code custom endpoints — see below.                                                                                    |
+| Vision (image input + tool use)   | OpenAI `image_url` content-part format (external URLs and base64 data URIs both work). `glm-5v-turbo` supports **native multimodal tool use** (images as tool args, tool results consumed visually). | Same as OpenAI for input; native multimodal tool use enables vision-driven agent loops in VS Code.                                          |
 ### Why the GLM Coding Plan is **not** an option for VS Code
@@ -293,7 +221,7 @@ The Coding Plan ($18–$160/mo) is a subscription that gates access to a small s
 - _"If the system detects usage through unauthorized or unsupported tools (such as SDK-based access or other third-party integrations), some subscription benefits may be restricted."_
 - The Coding endpoint base URL (`/api/coding/paas/v4`) is the only one that consumes the subscription quota — the general `/api/paas/v4` endpoint **always** charges against pay-as-you-go balance, even if you have an active Coding Plan.
-VS Code's custom-endpoint provider is not on the supported list, so the Coding Plan endpoint is the wrong target. The general `/api/paas/v4/chat/completions` endpoint is the correct one, and it bills against your prepaid Z.ai balance (or, for `glm-4.7-flash` / `glm-4.6v-flash`, is free).
+VS Code's custom-endpoint provider is not on the supported list, so the Coding Plan endpoint is the wrong target. The general `/api/paas/v4/chat/completions` endpoint is the correct one, and it bills against your prepaid Z.ai balance.
 ### Plan for this repo
@@ -304,7 +232,7 @@ This file is the **research record and the user-facing setup guide**. The implem
 3. **No new proxy.** The direct path is sufficient. If a future provider quirk surfaces (e.g. a sampling cap that VS Code sends above `1.0`, or a `clear_thinking` semantics change), the existing `lib/create-proxy.mjs` factory makes it cheap to add a `proxy/glm-proxy.mjs` mirroring `proxy/qwen-proxy.mjs`.
 4. **No CLI / npm-script changes.** `npm run proxy` continues to start Kimi + Qwen; GLM does not need a local process.
 5. **No test changes.** Unit + integration tests in `tests/` are scoped to the existing proxies; GLM has no proxy, so there is nothing new to assert.
-6. ~~**Live validation pending.**~~ ✅ **Live validation complete for `glm-5v-turbo` and `glm-5.1` (text-only).** See [Validation results](#validation-results) for the full pass/fail table. Remaining `curl`-based checks (`glm-4.7-flash`, `glm-4.7`, `glm-4.6v`) and `glm-4.6v` vision test are still pending.
+6. ~~**Live validation pending.**~~ ✅ **Live validation complete for `glm-5v-turbo` and `glm-5.1` (text-only).** See [Validation results](#validation-results) for the full pass/fail table.
 ### Validation results
@@ -327,23 +255,18 @@ This file is the **research record and the user-facing setup guide**. The implem
 | 6   | VS Code: `glm-5.1` appears in picker                                      | "Agent \| GLM 5.1 (flagship)"                        | ✅     |
 | 7   | VS Code: plain chat, streaming chat on `glm-5.1`                          | Streaming output visible                             | ✅     |
 | 8   | VS Code: agent mode — tool calling (browser open) on `glm-5.1`            | Multi-turn tool loop succeeds                        | ✅     |
-| 9   | `curl` non-streaming `glm-4.7-flash` against Z.ai                         | HTTP 200, assistant message in `content`             | ⏳     |
-| 10  | `curl` streaming `glm-4.7-flash`                                          | HTTP 200, SSE chunks with `data: [DONE]` terminator  | ⏳     |
-| 11  | `curl` tool call `glm-4.7` (function-call tool)                           | HTTP 200, `finish_reason: "tool_calls"`              | ⏳     |
-| 12  | `curl` vision `glm-4.6v` with base64 image                                | HTTP 200, image content described                    | ⏳     |
-| 13  | `curl` tool call when `thinking: { type: "enabled" }` is set on `glm-5.1` | HTTP 200, `reasoning_content` + `tool_calls`         | ⏳     |
-| 14  | `curl` tool-call follow-up turn (proves `clear_thinking`)                 | HTTP 200, prior `reasoning_content` is auto-stripped | ⏳     |
-| 15  | VS Code: vision (image attached) on `glm-4.6v`                            | Image content described in response                  | ⏳     |
+| 9   | `curl` tool call when `thinking: { type: "enabled" }` is set on `glm-5.1` | HTTP 200, `reasoning_content` + `tool_calls`         | ⏳     |
+| 10  | `curl` tool-call follow-up turn (proves `clear_thinking`)                 | HTTP 200, prior `reasoning_content` is auto-stripped | ⏳     |
 > **`glm-5v-turbo` fully validated** ✅ for VS Code custom-endpoint use: plain chat ✅, streaming ✅, tool calling ✅ (tested with `open_browser_page` opening Google), vision ✅ (accurately described a daily.dev screenshot including post titles, tags, sidebar navigation, browser tabs, and ad content).
 >
-> **Video input: GLM-5V-Turbo supports it natively, but VS Code's tool pipeline blocks it.** Z.ai's official docs state GLM-5V-Turbo's **Input Modality is "Video / Image / Text / File"**, and the Chat Completion API accepts **video** alongside images, audio, and files. There is even an official **"Video Object Tracking"** skill/example for `glm-5v-turbo`. However, VS Code's `view_image` tool only accepts static image formats (`png`, `jpg`, `jpeg`, `gif`, `webp`) and **rejects video files at the tool layer before they reach the model**. To test video input with GLM-5V-Turbo, use a direct API call (e.g., `curl`) with a public video URL in an `image_url` content part, or extract frames as images first (e.g., `ffmpeg -i video.mp4 -vframes 1 frame.png`). For a turnkey bridge that does this automatically inside VS Code, see [**Video Context MCP**](https://www.videocontextmcp.com/) — an MCP server that extracts frames from a video and routes them to GLM-4.6V (or one of four other providers in a fallback chain) so you can ask natural-language questions about any video. See [GLM-5V-Turbo docs](https://docs.z.ai/guides/vlm/glm-5v-turbo) for the official video input examples.
+> **Video input: GLM-5V-Turbo supports it natively, but VS Code's tool pipeline blocks it.** Z.ai's official docs state GLM-5V-Turbo's **Input Modality is "Video / Image / Text / File"**, and the Chat Completion API accepts **video** alongside images, audio, and files. There is even an official **"Video Object Tracking"** skill/example for `glm-5v-turbo`. However, VS Code's `view_image` tool only accepts static image formats (`png`, `jpg`, `jpeg`, `gif`, `webp`) and **rejects video files at the tool layer before they reach the model**. To test video input with GLM-5V-Turbo, use a direct API call (e.g., `curl`) with a public video URL in an `image_url` content part, or extract frames as images first (e.g., `ffmpeg -i video.mp4 -vframes 1 frame.png`). For a turnkey bridge that does this automatically inside VS Code, see [**Video Context MCP**](https://www.videocontextmcp.com/) — an MCP server that extracts frames from a video and routes them to a model provider (e.g. GLM 5V Turbo) in a fallback chain so you can ask natural-language questions about any video. See [GLM-5V-Turbo docs](https://docs.z.ai/guides/vlm/glm-5v-turbo) for the official video input examples.
 >
-> **`glm-5.1` partially validated** ✅ for text-only use: plain chat ✅, streaming ✅, tool calling ✅. The remaining `curl`-based checks and `glm-4.6v` vision tests are pending.
+> **`glm-5.1` partially validated** ✅ for text-only use: plain chat ✅, streaming ✅, tool calling ✅. The remaining `curl`-based checks are pending.
 ## Companion tools
-- [**Video Context MCP**](https://www.videocontextmcp.com/) — an MCP server that gives AI coding assistants (GitHub Copilot, Cursor, Claude Code) the ability to **understand video content** via natural language. Extracts frames from local or remote videos, routes them through a multi-provider fallback chain (**Gemini → GLM-4.6V-flash → Qwen 3.7 Plus → Kimi K2.6 → MiMo-V2.5**), and returns answers grounded in actual video frames. Also handles summarization, timestamp search, audio transcription with speaker diarization, and video metadata. Works around the limitation that VS Code's built-in `view_image` tool only accepts static images — so it lets `glm-5v-turbo`'s native video support actually be exercised end-to-end from inside VS Code.
+- [**Video Context MCP**](https://www.videocontextmcp.com/) — an MCP server that gives AI coding assistants (GitHub Copilot, Cursor, Claude Code) the ability to **understand video content** via natural language. Extracts frames from local or remote videos, routes them through a multi-provider fallback chain (**Gemini → GLM 5V Turbo → Qwen 3.7 Plus → Kimi K2.6 → MiMo-V2.5**), and returns answers grounded in actual video frames. Also handles summarization, timestamp search, audio transcription with speaker diarization, and video metadata. Works around the limitation that VS Code's built-in `view_image` tool only accepts static images — so it lets `glm-5v-turbo`'s native video support actually be exercised end-to-end from inside VS Code.
 ## References
@@ -352,10 +275,7 @@ This file is the **research record and the user-facing setup guide**. The implem
 - Z.ai pricing: `https://docs.z.ai/guides/overview/pricing`
 - Z.ai chat-completion reference: `https://docs.z.ai/api-reference/llm/chat-completion`
 - Z.ai thinking mode: `https://docs.z.ai/guides/capabilities/thinking-mode`
-- GLM-4.7: `https://docs.z.ai/guides/llm/glm-4.7`
-- GLM-4.6: `https://docs.z.ai/guides/llm/glm-4.6`
 - GLM-4.5: `https://docs.z.ai/guides/llm/glm-4.5`
-- GLM-4.6V (vision): `https://docs.z.ai/guides/vlm/glm-4.6v`
 - GLM-5V-Turbo (vision): `https://docs.z.ai/guides/vlm/glm-5v-turbo`
 - Z.ai Coding Plan overview: `https://docs.z.ai/devpack/overview`
 - Z.ai Coding Plan tool integration: `https://docs.z.ai/devpack/tool/others`

package/docs/pricing.md CHANGED Viewed

@@ -22,44 +22,43 @@ All prices below are in **USD per 1M tokens** (non-cached). To convert to AI cre
 These are the models available through GitHub Copilot's model roster as of June 1, 2026.
-| Model                 | Provider  | Tier        | Input (per 1M) | Cached input | Output (per 1M) | Context |
-| --------------------- | --------- | ----------- | -------------- | ------------ | --------------- | ------- |
-| **GPT-5.5**           | OpenAI    | Powerful    | $5.00          | $0.50        | $30.00          | —       |
-| **Claude Opus 4.8**   | Anthropic | Powerful    | $5.00          | $0.50        | $25.00          | 1M      |
-| **Claude Opus 4.7**   | Anthropic | Powerful    | $5.00          | $0.50        | $25.00          | 1M      |
-| **GPT-5.4**           | OpenAI    | Versatile   | $2.50          | $0.25        | $15.00          | —       |
-| **GPT-5.3-Codex**     | OpenAI    | Powerful    | $1.75          | $0.175       | $14.00          | —       |
-| **Claude Sonnet 4.6** | Anthropic | Versatile   | $3.00          | $0.30        | $15.00          | 1M      |
-| **Gemini 3.1 Pro**    | Google    | Powerful    | $2.00¹         | $0.20        | $12.00¹         | 1M      |
-| **Claude Haiku 4.5**  | Anthropic | Versatile   | $1.00          | $0.10        | $5.00           | 1M      |
-| **Gemini 3.5 Flash**  | Google    | Lightweight | $1.50          | $0.15        | $9.00           | 1M      |
-| **Gemini 2.5 Pro**    | Google    | Powerful    | $1.25¹         | $0.125       | $10.00¹         | 1M      |
-| **GPT-5.4 mini**      | OpenAI    | Lightweight | $0.75          | $0.075       | $4.50           | —       |
-| **Gemini 3 Flash**    | Google    | Lightweight | $0.50          | $0.05        | $3.00           | 1M      |
-| **Raptor mini**       | GitHub    | Versatile   | $0.25          | $0.025       | $2.00           | —       |
+| Model                 | Provider  | Tier        | Input (per 1M) | Cached input | Output (per 1M) | Context window |
+| --------------------- | --------- | ----------- | -------------- | ------------ | --------------- | -------------- |
+| **Raptor mini**       | GitHub    | Versatile   | $0.25          | $0.025       | $2.00           | 264K           |
+| **Gemini 3 Flash**    | Google    | Lightweight | $0.50          | $0.05        | $3.00           | 173K           |
+| **GPT-5.4 mini**      | OpenAI    | Lightweight | $0.75          | $0.075       | $4.50           | 400K           |
+| **Claude Haiku 4.5**  | Anthropic | Versatile   | $1.00          | $0.10        | $5.00           | 160K           |
+| **Gemini 2.5 Pro**    | Google    | Powerful    | $1.25¹         | $0.125       | $10.00¹         | 173K           |
+| **Gemini 3.5 Flash**  | Google    | Lightweight | $1.50          | $0.15        | $9.00           | 1M             |
+| **GPT-5.3-Codex**     | OpenAI    | Powerful    | $1.75          | $0.175       | $14.00          | 400K           |
+| **Gemini 3.1 Pro**    | Google    | Powerful    | $2.00¹         | $0.20        | $12.00¹         | 1M             |
+| **GPT-5.4**           | OpenAI    | Versatile   | $2.50          | $0.25        | $15.00          | 1M             |
+| **Claude Sonnet 4.6** | Anthropic | Versatile   | $3.00          | $0.30        | $15.00          | 1M             |
+| **Claude Opus 4.8**   | Anthropic | Powerful    | $5.00          | $0.50        | $25.00          | 1M             |
+| **Claude Opus 4.7**   | Anthropic | Powerful    | $5.00          | $0.50        | $25.00          | 1M             |
+| **GPT-5.5**           | OpenAI    | Powerful    | $5.00          | $0.50        | $30.00          | 1M             |
 ¹ Gemini 3.1 Pro and 2.5 Pro pricing applies to prompts ≤200K tokens.
 ## Custom-endpoint alternatives
-| Model                 | Provider  | Input (per 1M)                | Output (per 1M)                         | Context window |
-| --------------------- | --------- | ----------------------------- | --------------------------------------- | -------------- |
-| **DeepSeek V4 Flash** | DeepSeek  | $0.14                         | $0.28                                   | 1M             |
-| **MiMo V2 Flash** 🏆  | Xiaomi    | $0.10                         | $0.30                                   | 256K           |
-| **Kimi K2.6**         | Moonshot  | $0.16                         | $0.95 (non-thinking) / $4.00 (thinking) | 256K           |
-| **DeepSeek V4 Pro**   | DeepSeek  | $1.74                         | $3.48                                   | 1M             |
-| **MiMo V2.5**         | Xiaomi    | $0.40                         | $2.00                                   | 1M             |
-| **MiMo V2.5 Pro**     | Xiaomi    | $1.00                         | $3.00                                   | 1M             |
-| **Qwen 3.7 Plus**     | DashScope | $0.40 (≤256K) / $1.20 (>256K) | $1.60 (≤256K) / $4.80 (>256K)           | 1M             |
-| **Qwen 3.7 Max**      | DashScope | $2.50 (≤1M)                   | $7.50 (≤1M)                             | 1M             |
-| **MiniMax M3**        | MiniMax   | $0.60 (≤512K) / $1.20 (>512K) | $2.40 (≤512K) / $4.80 (>512K)           | 1M             |
-| **GLM 4.7 Flash**     | Z.ai      | Free (rate-limited ¹)         | Free (rate-limited ¹)                   | 200K           |
-| **GLM 5V Turbo**      | Z.ai      | $1.20                         | $4.00                                   | 200K           |
-| **GLM 5.1**           | Z.ai      | $1.40                         | $4.40                                   | 200K           |
+| Model                 | Provider  | Input (per 1M)                | Cached input                  | Output (per 1M)                         | Context window |
+| --------------------- | --------- | ----------------------------- | ----------------------------- | --------------------------------------- | -------------- |
+| **MiMo V2 Flash**     | Xiaomi    | $0.10                         | $0.01                         | $0.30                                   | 256K           |
+| **DeepSeek V4 Flash** | DeepSeek  | $0.14                         | $0.0028                       | $0.28                                   | 1M             |
+| **Kimi K2.6**         | Moonshot  | $0.16                         | —                             | $0.95 (non-thinking) / $4.00 (thinking) | 256K           |
+| **Qwen 3.7 Plus**     | DashScope | $0.40 (≤256K) / $1.20 (>256K) | —                             | $1.60 (≤256K) / $4.80 (>256K)           | 1M             |
+| **MiMo V2.5**         | Xiaomi    | $0.40                         | $0.08                         | $2.00                                   | 1M             |
+| **DeepSeek V4 Pro**   | DeepSeek  | $0.435                        | $0.003625                     | $0.87                                   | 1M             |
+| **MiniMax M3**        | MiniMax   | $0.60 (≤512K) / $1.20 (>512K) | $0.12 (≤512K) / $0.24 (>512K) | $2.40 (≤512K) / $4.80 (>512K)           | 1M             |
+| **MiMo V2.5 Pro**     | Xiaomi    | $1.00                         | $0.20                         | $3.00                                   | 1M             |
+| **GLM 5V Turbo**      | Z.ai      | $1.20                         | $0.24                         | $4.00                                   | 200K           |
+| **GLM 5.1**           | Z.ai      | $1.40                         | $0.26                         | $4.40                                   | 200K           |
+| **Qwen 3.7 Max**      | DashScope | $2.50 (≤1M)                   | —                             | $7.50 (≤1M)                             | 1M             |
 > **Notes:**
 >
-> - **DeepSeek V4** input pricing shown is the **cache miss** price. Cache hits are significantly cheaper ($0.0028/M for Flash, $0.0145/M for Pro).
+> - **DeepSeek V4** input pricing shown is the **cache miss** price. Cache hits are significantly cheaper ($0.0028/M for Flash, $0.003625/M for Pro).
 > - **MiMo** input pricing shown is the **cache miss** price. Cache hits are 5× cheaper for V2.5 Pro ($0.20/M) and V2.5 ($0.08/M), and 10× cheaper for V2 Flash ($0.01/M).
 > - **Gemini 3 Flash** is priced at $0.50/MTok input (text/image/video) and $1.00/MTok input for audio.
 > - **Anthropic (Claude)** models also have a cache write cost ($6.25/MTok for Opus, $3.75/MTok for Sonnet, $1.25/MTok for Haiku). Opus 4.7+ use a new tokenizer that may use up to 35% more tokens for the same text.
@@ -67,8 +66,8 @@ These are the models available through GitHub Copilot's model roster as of June
 > - **Qwen** models use **tiered pricing** — determined by total input tokens per request. Prices above are for non-thinking mode.
 > - **Kimi K2.6** pricing is from the **Moonshot platform** (direct). Via DashScope: $0.89 input / $3.71 output.
 > - **DashScope** offers a **free quota** of 1M input + 1M output tokens per model, valid for 90 days.
-> - **MiniMax M3** uses **tiered pricing** — input price doubles above 512K input tokens. A 7-day 50% off promotion is available for new accounts.
-> - **GLM** free-tier models (`glm-4.7-flash`) are aggressively rate-limited (HTTP `1302 / ChatRateLimited`), especially on context > 8K or with thinking enabled. Paid GLM models share a much larger concurrency pool.
+> - **MiniMax M3** uses **tiered pricing** — input price doubles above 512K input tokens. Cache hits are priced at 20% of the input rate ($0.12/M ≤512K, $0.24/M >512K). A 7-day 50% off promotion is available for new accounts.
+> - **GLM** models support prompt caching — cache hits are priced at $0.24/M for 5V Turbo and $0.26/M for 5.1.
 > - **MiMo** offers a **Token Plan** subscription model with discounted rates and a free cache-writing promotion.
 > - For typical Copilot chat usage (short-to-medium prompts), you'll almost always fall in the lowest pricing tier.
@@ -76,32 +75,32 @@ These are the models available through GitHub Copilot's model roster as of June
 For a typical coding session (~10K input + ~2K output tokens per turn, 50 turns):
-| Model                    | Estimated session cost | Copilot Pro+ credits |
-| ------------------------ | ---------------------- | -------------------- |
-| MiMo V2 Flash 🏆         | ~$0.08                 | —                    |
-| DeepSeek V4 Flash 🏆     | ~$0.10                 | —                    |
-| Kimi K2.6 (non-thinking) | ~$0.18                 | —                    |
-| MiMo V2.5                | ~$0.40                 | —                    |
-| Kimi K2.6 (thinking)     | ~$0.48                 | —                    |
-| Qwen 3.7 Plus            | ~$0.36                 | —                    |
-| Gemini 3 Flash           | ~$0.55                 | ~55                  |
-| MiniMax M3               | ~$0.54                 | —                    |
-| MiMo V2.5 Pro            | ~$0.80                 | —                    |
-| GLM 4.7 Flash (free)     | ~$0.00 ¹               | —                    |
-| GPT-5.4 mini             | ~$0.83                 | ~83                  |
-| Claude Haiku 4.5         | ~$1.00                 | ~100                 |
-| DeepSeek V4 Pro          | ~$1.22                 | —                    |
-| Qwen 3.7 Max             | ~$1.33                 | —                    |
-| Gemini 2.5 Pro           | ~$1.63                 | ~163                 |
-| Gemini 3.5 Flash         | ~$1.65                 | ~165                 |
-| Gemini 3.1 Pro           | ~$2.20                 | ~220                 |
-| GPT-5.3-Codex            | ~$2.28                 | ~228                 |
-| GPT-5.4                  | ~$2.75                 | ~275                 |
-| Claude Sonnet 4.6        | ~$3.00                 | ~300                 |
-| Claude Opus 4.8 / 4.7    | ~$5.00                 | ~500                 |
-| GPT-5.5                  | ~$5.50                 | ~550                 |
+| Model                    | Estimated session cost |
+| ------------------------ | ---------------------- |
+| MiMo V2 Flash            | ~$0.08                 |
+| DeepSeek V4 Flash        | ~$0.10                 |
+| Kimi K2.6 (non-thinking) | ~$0.18                 |
+| DeepSeek V4 Pro          | ~$0.30                 |
+| Raptor mini              | ~$0.33                 |
+| Qwen 3.7 Plus            | ~$0.36                 |
+| MiMo V2.5                | ~$0.40                 |
+| Kimi K2.6 (thinking)     | ~$0.48                 |
+| MiniMax M3               | ~$0.54                 |
+| Gemini 3 Flash           | ~$0.55                 |
+| MiMo V2.5 Pro            | ~$0.80                 |
+| GPT-5.4 mini             | ~$0.83                 |
+| Claude Haiku 4.5         | ~$1.00                 |
+| Qwen 3.7 Max             | ~$1.33                 |
+| Gemini 2.5 Pro           | ~$1.63                 |
+| Gemini 3.5 Flash         | ~$1.65                 |
+| Gemini 3.1 Pro           | ~$2.20                 |
+| GPT-5.3-Codex            | ~$2.28                 |
+| GPT-5.4                  | ~$2.75                 |
+| Claude Sonnet 4.6        | ~$3.00                 |
+| Claude Opus 4.8 / 4.7    | ~$5.00                 |
+| GPT-5.5                  | ~$5.50                 |
-> **How long does 7,000 credits last?** A Pro+ subscriber running 50-turn sessions could afford roughly **13 GPT-5.5 sessions**, **23 Opus sessions**, or **212 Raptor mini sessions** per month — or mix and match.
+> **How long does 7,000 credits last?** A Pro+ subscriber running 50-turn sessions could afford roughly **13 GPT-5.5 sessions**, **23 Opus sessions**, or **212 Raptor mini sessions** per month — or mix and match. (Multiply session cost by 100 to convert to AI credits.)
 > Prices last verified: June 1, 2026. Always check the official pages for the latest rates:
 >

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "copilot-custom-endpoint",
-  "version": "1.3.4",
+  "version": "1.3.6",
   "description": "Local proxies for VS Code Copilot custom endpoints — Kimi K2 & Qwen 3.x",
   "license": "MIT",
   "type": "module",
@@ -55,4 +55,4 @@
   "dependencies": {
     "dotenv": "^17.4.2"
   }
-}
+}