npm - copilot-custom-endpoint - Versions diffs - 1.2.4 → 1.3.1 - Mend

copilot-custom-endpoint 1.2.4 → 1.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

package/README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # GitHub Copilot Custom Endpoints
-> **TL;DR** — GitHub Copilot switched to usage-based billing on **June 1, 2026**. Every chat and agent session now burns AI credits — fast. This repo shows you how to plug **cheaper non-GitHub models** (DeepSeek, Kimi, Qwen, MiMo, MiniMax) into VS Code's Copilot chat — often **5–55× cheaper** than the built-ins — while keeping agent mode, tools, streaming, and vision.
+> **TL;DR** — GitHub Copilot switched to usage-based billing on **June 1, 2026**. Every chat and agent session now burns AI credits — fast. This repo shows you how to plug **cheaper non-GitHub models** (DeepSeek, Kimi, Qwen, MiMo, MiniMax, GLM) into VS Code's Copilot chat — often **5–55× cheaper** than the built-ins — while keeping agent mode, tools, streaming, and vision.
 ## What is this?
@@ -24,10 +24,13 @@ That's it. No code, no servers to manage (unless the model specifically needs th
 | **MiMo V2 Flash**           | Xiaomi    | No                     | ❌           | [Setup](docs/models/mimo.md)                                                                       |
 | **MiMo V2.5**               | Xiaomi    | No                     | ✅           | [Setup](docs/models/mimo.md)                                                                       |
 | **MiMo V2.5 Pro**           | Xiaomi    | No                     | ❌           | [Setup](docs/models/mimo.md)                                                                       |
-| **Kimi K2.6**               | Moonshot  | **Yes**                | ✅           | [Setup](docs/models/kimi-k2.6.md)                                                                  |
+| **Kimi K2.6**               | Moonshot  | **Yes**                | ✅           | [Setup](docs/models/kimi.md)                                                                       |
 | **Qwen 3.6 Plus**           | DashScope | Optional               | ✅           | [Setup](docs/models/qwen.md)                                                                       |
 | **Qwen 3.7 Max**            | DashScope | Optional               | ❌           | [Setup](docs/models/qwen.md)                                                                       |
 | **MiniMax M3**              | MiniMax   | No                     | ✅           | [Setup](docs/models/minimax.md)                                                                    |
+| **GLM 5.1**                 | Z.ai      | No                     | ❌           | [Setup](docs/models/glm.md)                                                                        |
+| **GLM 4.7 Flash (free)**    | Z.ai      | No                     | ❌           | [Setup](docs/models/glm.md)                                                                        |
+| **GLM 5V Turbo**            | Z.ai      | No                     | ✅           | [Setup](docs/models/glm.md)                                                                        |
 | **DeepSeek V4 Pro / Flash** | DeepSeek  | No (uses an extension) | ✅ via proxy | [Marketplace](https://marketplace.visualstudio.com/items?itemName=Vizards.deepseek-v4-for-copilot) |
 ## Setup
@@ -97,10 +100,27 @@ All prices are **USD per 1M tokens** (cache miss). 1 AI credit = $0.01.
 | **Qwen 3.6 Plus**            | $0.50 | $3.00  | 1M      |
 | **MiniMax M3**               | $0.60 | $2.40  | 1M      |
 | **MiMo V2.5 Pro**            | $1.00 | $3.00  | 1M      |
+| **GLM 5V Turbo**             | $1.20 | $4.00  | 200K    |
+| **GLM 5.1**                  | $1.40 | $4.40  | 200K    |
 | **Qwen 3.7 Max**             | $2.50 | $7.50  | 1M      |
 For the full pricing comparison (cached rates, full Copilot roster, footnotes, sources) see [docs/pricing.md](docs/pricing.md). For a copy-paste config containing **all providers at once**, see [docs/example-config.md](docs/example-config.md).
+## Companion tools
+These work alongside the providers above and fill gaps that VS Code's built-in tool surface doesn't cover natively.
+### 🎬 [Video Context MCP](https://www.videocontextmcp.com/) — _video understanding for AI coding assistants_
+VS Code's built-in `view_image` tool only accepts **static images** (PNG, JPG, GIF, WebP). That's a hard wall if you want to ask an AI assistant about a screen recording, a screencast, a product demo, or any other video. Several vision-capable models in this repo actually accept video natively — but VS Code's tool pipeline never gets the chance to forward it.
+**Video Context MCP** is a small MCP server that bridges that gap. It works with **GitHub Copilot, Cursor, and Claude Code** out of the box, and:
+- **Extracts frames** from local files or remote URLs (no `ffmpeg` gymnastics required).
+- **Routes them through a multi-provider fallback chain** — `Gemini → GLM-4.6V → Qwen3.6 → Kimi K2.6 → MiMo-V2.5` — so a single `GLM 5V Turbo` rate-limit hiccup doesn't kill your session.
+- **Answers natural-language questions** about the video grounded in actual frames: "what does the speaker click in the last 30 seconds?", "summarize the demo", "find the frame where the error appears".
+- **Extras:** timestamp search, audio transcription with speaker diarization, and video metadata (resolution, duration, codec).
 ## Need help?
 - **Per-model issues:** check the troubleshooting section at the bottom of each model's doc.

package/docs/example-config.md ADDED Viewed

@@ -0,0 +1,203 @@
+# Full example config
+Here's a complete, real-world `chatLanguageModels.json` that combines **all the providers documented in this repo**. Copy what you need, leave the rest out.
+> **Note:** The `apiKey` fields are left as empty strings — set them via the **Chat: Manage Language Models** UI (Command Palette → right-click provider group → **Update API Key**). After you set a key via the UI, VS Code replaces the empty string with a `${input:chat.lm.secret.<id>}` secret reference.
+```json
+[
+  {
+    "name": "Qwen",
+    "vendor": "customendpoint",
+    "apiKey": "",
+    "apiType": "chat-completions",
+    "models": [
+      {
+        "id": "qwen3.7-max",
+        "name": "Qwen 3.7 Max",
+        "url": "https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions",
+        "toolCalling": true,
+        "vision": false,
+        "streaming": true,
+        "requestBody": {
+          "enable_thinking": false
+        }
+      },
+      {
+        "id": "qwen3.6-plus",
+        "name": "Qwen 3.6 Plus",
+        "url": "https://dashscope-intl.aliyuncs.com/compatible-mode/v1/chat/completions",
+        "toolCalling": true,
+        "vision": true,
+        "streaming": true,
+        "requestBody": {
+          "enable_thinking": false
+        }
+      }
+    ]
+  },
+  {
+    "name": "Kimi",
+    "vendor": "customendpoint",
+    "apiKey": "",
+    "apiType": "chat-completions",
+    "models": [
+      {
+        "id": "kimi-k2.6",
+        "name": "Kimi K2.6",
+        "url": "http://127.0.0.1:3457/v1/chat/completions",
+        "requestBody": {
+          "temperature": 1
+        },
+        "toolCalling": true,
+        "vision": true,
+        "streaming": true,
+        "maxInputTokens": 262144,
+        "maxOutputTokens": 32768
+      }
+    ]
+  },
+  {
+    "name": "MiMo",
+    "vendor": "customendpoint",
+    "apiKey": "",
+    "apiType": "chat-completions",
+    "models": [
+      {
+        "id": "mimo-v2.5-pro",
+        "name": "MiMo V2.5 Pro",
+        "url": "https://api.xiaomimimo.com/v1/chat/completions",
+        "toolCalling": true,
+        "vision": false,
+        "streaming": true,
+        "maxInputTokens": 1048576,
+        "maxOutputTokens": 131072,
+        "requestBody": {
+          "thinking": { "type": "disabled" },
+          "temperature": 1,
+          "top_p": 0.95
+        }
+      },
+      {
+        "id": "mimo-v2.5",
+        "name": "MiMo V2.5",
+        "url": "https://api.xiaomimimo.com/v1/chat/completions",
+        "toolCalling": true,
+        "vision": true,
+        "streaming": true,
+        "maxInputTokens": 1048576,
+        "maxOutputTokens": 32768,
+        "requestBody": {
+          "thinking": { "type": "disabled" },
+          "temperature": 1,
+          "top_p": 0.95
+        }
+      },
+      {
+        "id": "mimo-v2-flash",
+        "name": "MiMo V2 Flash",
+        "url": "https://api.xiaomimimo.com/v1/chat/completions",
+        "toolCalling": true,
+        "vision": false,
+        "streaming": true,
+        "maxInputTokens": 262144,
+        "maxOutputTokens": 65536,
+        "requestBody": {
+          "thinking": { "type": "disabled" },
+          "temperature": 0.3,
+          "top_p": 0.95
+        }
+      }
+    ]
+  },
+  {
+    "name": "MiniMax",
+    "vendor": "customendpoint",
+    "apiKey": "",
+    "apiType": "chat-completions",
+    "models": [
+      {
+        "id": "MiniMax-M3",
+        "name": "MiniMax M3",
+        "url": "https://api.minimax.io/v1/chat/completions",
+        "toolCalling": true,
+        "vision": true,
+        "streaming": true,
+        "maxInputTokens": 1048576,
+        "maxOutputTokens": 131072,
+        "requestBody": {
+          "thinking": { "type": "adaptive" },
+          "reasoning_split": true,
+          "temperature": 1,
+          "top_p": 0.95
+        }
+      }
+    ]
+  },
+  {
+    "name": "GLM",
+    "vendor": "customendpoint",
+    "apiKey": "",
+    "apiType": "chat-completions",
+    "models": [
+      {
+        "id": "glm-5.1",
+        "name": "GLM 5.1 (flagship)",
+        "url": "https://api.z.ai/api/paas/v4/chat/completions",
+        "toolCalling": true,
+        "vision": false,
+        "streaming": true,
+        "maxInputTokens": 204800,
+        "maxOutputTokens": 131072,
+        "requestBody": {
+          "thinking": { "type": "enabled" },
+          "temperature": 1,
+          "top_p": 0.95
+        }
+      },
+      {
+        "id": "glm-4.7-flash",
+        "name": "GLM 4.7 Flash (free)",
+        "url": "https://api.z.ai/api/paas/v4/chat/completions",
+        "toolCalling": true,
+        "vision": false,
+        "streaming": true,
+        "maxInputTokens": 204800,
+        "maxOutputTokens": 131072,
+        "requestBody": {
+          "thinking": { "type": "enabled" },
+          "temperature": 1,
+          "top_p": 0.95
+        }
+      },
+      {
+        "id": "glm-5v-turbo",
+        "name": "GLM 5V Turbo (vision flagship)",
+        "url": "https://api.z.ai/api/paas/v4/chat/completions",
+        "toolCalling": true,
+        "vision": true,
+        "streaming": true,
+        "maxInputTokens": 204800,
+        "maxOutputTokens": 131072,
+        "requestBody": {
+          "thinking": { "type": "enabled" },
+          "temperature": 1,
+          "top_p": 0.95
+        }
+      }
+    ]
+  }
+]
+```
+## Per-model snippets
+If you only need one provider, jump straight to its setup guide:
+- [Kimi K2.6](kimi.md)
+- [Qwen 3.6 Plus / 3.7 Max](qwen.md)
+- [Xiaomi MiMo (V2.5 / V2.5 Pro / V2 Flash)](mimo.md)
+- [MiniMax M3](minimax.md)
+- [GLM (5.1 / 4.7 Flash / 5V Turbo)](glm.md)
+> **DeepSeek V4 Pro / V4 Flash** use the [DeepSeek V4 for Copilot Chat](https://marketplace.visualstudio.com/items?itemName=Vizards.deepseek-v4-for-copilot) extension — they don't appear in `chatLanguageModels.json`.

package/docs/models/glm.md ADDED Viewed

@@ -0,0 +1,366 @@
+# GLM (Z.ai / Zhipu AI) — VS Code Custom Endpoint Setup Guide
+> **TL;DR:** GLM works directly with VS Code's custom-endpoint provider — **no proxy needed**. The API is OpenAI Chat Completions compatible at `https://api.z.ai/api/paas/v4/chat/completions`, and Z.ai's default `thinking.clear_thinking: true` quietly strips `reasoning_content` from prior turns, which makes multi-turn tool loops stable even when VS Code doesn't preserve reasoning blocks. The **GLM Coding Plan** endpoint is **not** usable here — it is locked to a curated list of coding tools (Claude Code, Cline, OpenCode, etc.).
+> ⚠️ **Free-tier rate-limit warning.** Z.ai's free models — `glm-4.7-flash` and `glm-4.6v-flash` — are aggressively throttled. In practice, expect to see HTTP `{"code":"1302","message":"Rate limit reached for requests"}` (surfaced in VS Code as `Reason: Rate limit exceeded` / `ChatRateLimited`) **on a significant fraction of requests**, especially:
+>
+> - During long chat sessions or after several tool turns (context > 8K is throttled to **1%** of the standard concurrency cap).
+> - When `thinking: { type: "enabled" }` is set — thinking tokens still hold the in-flight slot, so the model occupies the throttle window longer.
+> - During peak hours, when many users are sharing the same free pool.
+>
+> **This is the free tier behaving as designed, not a bug.** For uninterrupted work, use a paid model: `glm-4.6v-flashx` is the cheapest paid option ($0.04/$0.40 per 1M), `glm-4.7` is the best cost/quality balance ($0.60/$2.20 per 1M), and `glm-5.1` is the flagship ($1.40/$4.40 per 1M). See [Rate limits](#rate-limits) for the full breakdown.
+## At a Glance
+| Field                  | Value                                                              |
+| ---------------------- | ------------------------------------------------------------------ |
+| Mode                   | **Direct** (no proxy)                                              |
+| Vision                 | ✅ Yes (`glm-5v-turbo` only)                                       |
+| Tool calling           | ✅ Yes (native multimodal tool use on `glm-4.6v` / `glm-5v-turbo`) |
+| Context (flagship)     | 200K (`glm-5.1` / `glm-4.7` / `glm-4.7-flash` / `glm-5v-turbo`)    |
+| Max output (flagship)  | 128K                                                               |
+| Required `requestBody` | `thinking: { type: "enabled" }` (recommended)                      |
+| Endpoint (intl)        | `https://api.z.ai/api/paas/v4/chat/completions`                    |
+| Endpoint (China)       | `https://open.bigmodel.cn/api/paas/v4/chat/completions`            |
+| Auth                   | `Authorization: Bearer $ZAI_API_KEY`                               |
+### Models at a glance
+| Model             | Vision | Context | Max output | Thinking      | Cost (in / out per 1M) | Role                                                      |
+| ----------------- | ------ | ------- | ---------- | ------------- | ---------------------- | --------------------------------------------------------- |
+| `glm-5.1`         | ❌     | 200K    | 128K       | `enabled`     | $1.40 / $4.40          | Current flagship — long-horizon / 8h autonomous work      |
+| `glm-4.7`         | ❌     | 200K    | 128K       | `enabled`     | $0.60 / $2.20          | Flagship 4.x — strong coding/agent                        |
+| `glm-4.7-flash`   | ❌     | 200K    | 128K       | `enabled`     | Free ¹                 | **Free** — newest 4.x tier at no cost                     |
+| `glm-5v-turbo`    | ✅     | 200K    | 128K       | `enabled`     | $1.20 / $4.00          | Multimodal **coding** model — vision-based agentic coding |
+| `glm-4.6v`        | ✅     | 128K    | 32K        | hybrid (auto) | $0.30 / $0.90          | Vision + **native multimodal tool calls**                 |
+| `glm-4.6v-flashx` | ✅     | 128K    | 32K        | hybrid (auto) | $0.04 / $0.40          | Cheap vision                                              |
+| `glm-4.6v-flash`  | ✅     | 128K    | 32K        | hybrid (auto) | Free ¹                 | **Free** vision tier                                      |
+> ¹ **Free-tier caveat:** the two `*flash` free models are heavily rate-limited — see the [warning at the top of this document](#glm-zai--zhipu-ai--vs-code-custom-endpoint-setup-guide) and the [Rate limits](#rate-limits) section. Expect frequent HTTP `1302 / ChatRateLimited` errors, especially on context > 8K or with thinking enabled. For reliable use, prefer `glm-4.6v-flashx` (cheapest paid) or `glm-4.7` (best cost/quality).
+> Other GLM models — `glm-5`, `glm-5-turbo`, `glm-4.7`, `glm-4.6`, `glm-4.6v`, `glm-4.6v-flashx`, `glm-4.6v-flash`, `glm-4.5`, `glm-4.5-air`, `glm-4.5-flash`, `glm-4.5-x`, `glm-4.5-airx`, `glm-4-32b-0414-128k` — are callable on the same endpoint but are intentionally **not** added to the default `chatLanguageModels.json` block below. Add them in the same shape if you need them.
+## Quick Start
+1. **Create a Z.ai account** at [z.ai/model-api](https://z.ai/model-api) and add credits on the [Billing page](https://z.ai/manage-apikey/billing).
+2. **Generate an API key** on the [API Keys page](https://z.ai/manage-apikey/apikey-list).
+3. **Edit `chatLanguageModels.json`** — add the GLM block from [Setup](#setup) below.
+4. **Set the API key** via Command Palette → **Chat: Manage Language Models** → right-click **GLM** → **Update API Key**.
+5. **Restart VS Code** and pick a GLM model from the picker.
+## Setup
+### 1. VS Code configuration
+Config file location:
+| OS      | Path                                                              |
+| ------- | ----------------------------------------------------------------- |
+| Windows | `%APPDATA%\Code\User\chatLanguageModels.json`                     |
+| macOS   | `~/Library/Application Support/Code/User/chatLanguageModels.json` |
+| Linux   | `~/.config/Code/User/chatLanguageModels.json`                     |
+<details>
+<summary><strong>GLM config — collapse for brevity</strong></summary>
+```json
+{
+  "name": "GLM",
+  "vendor": "customendpoint",
+  "apiKey": "",
+  "apiType": "chat-completions",
+  "models": [
+    {
+      "id": "glm-5.1",
+      "name": "GLM 5.1 (flagship)",
+      "url": "https://api.z.ai/api/paas/v4/chat/completions",
+      "toolCalling": true,
+      "vision": false,
+      "streaming": true,
+      "maxInputTokens": 204800,
+      "maxOutputTokens": 131072,
+      "requestBody": {
+        "thinking": { "type": "enabled" },
+        "temperature": 1,
+        "top_p": 0.95
+      }
+    },
+    {
+      "id": "glm-4.7-flash",
+      "name": "GLM 4.7 Flash (free)",
+      "url": "https://api.z.ai/api/paas/v4/chat/completions",
+      "toolCalling": true,
+      "vision": false,
+      "streaming": true,
+      "maxInputTokens": 204800,
+      "maxOutputTokens": 131072,
+      "requestBody": {
+        "thinking": { "type": "enabled" },
+        "temperature": 1,
+        "top_p": 0.95
+      }
+    },
+    {
+      "id": "glm-5v-turbo",
+      "name": "GLM 5V Turbo (vision flagship)",
+      "url": "https://api.z.ai/api/paas/v4/chat/completions",
+      "toolCalling": true,
+      "vision": true,
+      "streaming": true,
+      "maxInputTokens": 204800,
+      "maxOutputTokens": 131072,
+      "requestBody": {
+        "thinking": { "type": "enabled" },
+        "temperature": 1,
+        "top_p": 0.95
+      }
+    }
+  ]
+}
+```
+</details>
+> **Leave `apiKey` as `""`** — set it through the Language Models UI so VS Code stores it in the OS keychain (it will replace the empty string with a `${input:chat.lm.secret.<id>}` reference).
+### 2. API key
+1. Open the Command Palette (`Ctrl+Shift+P`).
+2. Run **Chat: Manage Language Models**.
+3. Find the **GLM** group → **Update API Key**.
+4. Paste your Z.ai API key.
+### 3. Regional endpoints
+| Region                            | Endpoint                                                | Notes                                                      |
+| --------------------------------- | ------------------------------------------------------- | ---------------------------------------------------------- |
+| **International** (default above) | `https://api.z.ai/api/paas/v4/chat/completions`         | [z.ai/model-api](https://z.ai/model-api) — USD billing     |
+| China                             | `https://open.bigmodel.cn/api/paas/v4/chat/completions` | [open.bigmodel.cn](https://open.bigmodel.cn) — CNY billing |
+> API keys are region-specific and **cannot** be used across regions. If you signed up on `bigmodel.cn` (China), swap the `url` values in the block above to the China endpoint — everything else is identical.
+## Configuration Reference
+### Sampling parameters
+| Parameter     | Range (hard cap) | Default                                       |
+| ------------- | ---------------- | --------------------------------------------- |
+| `temperature` | `[0.0, 1.0]`     | `1.0` for GLM-4.6 / 4.7 / 5.x · `0.6` for 4.5 |
+| `top_p`       | `[0.01, 1.0]`    | `0.95` for 4.5/4.6/4.7/5.x · `0.9` for 4-32B  |
+| `do_sample`   | bool             | `true` — set `false` to bypass sampling       |
+> **Important:** Z.ai's `temperature` is capped at `1.0` server-side. Sending a value like `1.2` will be rejected. VS Code's defaults (typically `0`–`1`) are within range, but the explicit `requestBody` values above are the recommended ones for coding/agent work.
+### Thinking mode
+`thinking` is a GLM-specific object. It only applies to **GLM-4.5 and above**.
+| Field                     | Values                 | Default                                                       | Meaning                                                                                                   |
+| ------------------------- | ---------------------- | ------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------- |
+| `thinking.type`           | `enabled` / `disabled` | `enabled` for 5.1/5/5-Turbo/4.7/4.5V; hybrid for 4.6/4.6V/4.5 | Force-enable or force-disable chain-of-thought. Hybrid lets the model decide.                             |
+| `thinking.clear_thinking` | bool                   | **`true`**                                                    | When `true`, Z.ai **strips historical `reasoning_content` from prior turns** before sending to the model. |
+> **Why the default is good for VS Code:** with `clear_thinking: true` (the server default), Z.ai doesn't require the client to forward `reasoning_content` between turns. VS Code's custom-endpoint provider doesn't preserve that field across tool turns — but for GLM it doesn't need to, because the server strips it. This avoids the same class of `reasoning_content` 400 errors that bite on MiMo.
+>
+> Set `clear_thinking: false` only if you are building a custom client that **does** forward `reasoning_content` verbatim. Do **not** set it from VS Code's `requestBody` — it would push GLM into a mode VS Code cannot satisfy.
+| Mode in VS Code     | Plain chat                       | Tool turns                                           |
+| ------------------- | -------------------------------- | ---------------------------------------------------- |
+| Recommended (above) | Thinking ON                      | Thinking ON, prior `reasoning_content` auto-stripped |
+| Faster/cheaper      | `thinking: { type: "disabled" }` | `thinking: { type: "disabled" }`                     |
+| Preserved Thinking  | not supported in VS Code         | do not enable `clear_thinking: false`                |
+### Capabilities
+- **OpenAI Chat Completions protocol** at `/api/paas/v4/chat/completions` — request and response shapes are standard OpenAI.
+- **Streaming** via SSE, terminated with `data: [DONE]`. (Same as OpenAI.)
+- **Tool calling** with the standard `tools` array. `tool_choice` accepts only `auto`.
+- **Max 128 functions** per request.
+- **Tool stream** (`tool_stream: true`) is supported on the `glm-4.6v` family and above for streaming tool-call deltas.
+- **Vision** on `glm-4.6v`, `glm-4.6v-flashx`, `glm-4.6v-flash`, and `glm-5v-turbo` using the OpenAI `image_url` content-part format. External URLs and base64 data URIs both work.
+- **Video input** on `glm-5v-turbo` — the model natively accepts video (Input Modality: **Video / Image / Text / File**). Use a public video URL in an `image_url` content part via direct API call; VS Code's chat UI does not currently forward video attachments to the model. For a turnkey VS Code integration that bridges the gap (extracts frames, routes them to GLM or a fallback provider, and answers natural-language questions about the video), see [**Video Context MCP**](https://www.videocontextmcp.com/) — an MCP server that gives Copilot/Cursor/Claude Code video understanding via the `glm-4.6v` provider and a multi-provider fallback chain (Gemini → GLM-4.6V → Qwen3.6 → Kimi K2.6 → MiMo-V2.5).
+- **Native multimodal tool calling** on `glm-4.6v` (and inherited by `glm-5v-turbo`) — images, screenshots, and document pages can be passed directly as tool parameters and tool results can be consumed visually.
+- **Built-in web search** is exposed as a tool type `web_search` (different from `function`).
+- **Context caching** is automatic — the API returns `usage.prompt_tokens_details.cached_tokens` on cache hits; cache writes are currently free of charge.
+- **Response shape extras:** `choices[].message.reasoning_content` (when thinking is on), `web_search[]` (when web search is used), `usage` with `cached_tokens`.
+### Rate limits
+Z.ai throttles **by in-flight concurrency**, not classic RPM/TPM. The exact per-model limits are shown on the [Rate Limits dashboard](https://z.ai/manage-apikey/rate-limits) once you are signed in. **Pay-as-you-go models share a generous pool; free-tier models (`glm-4.7-flash`, `glm-4.6v-flash`) are on a separate, much tighter bucket.**
+#### What "rate limited" looks like in VS Code
+When a free-tier model is throttled, Z.ai returns:
+```json
+{ "code": "1302", "message": "Rate limit reached for requests" }
+```
+VS Code surfaces this as a chat-side error:
+```
+Sorry, your request failed. Please try again.
+Client Request Id: <uuid>
+Reason: Rate limit exceeded
+ChatRateLimited: Rate limit exceeded
+{"code":"1302","message":"Rate limit reached for requests"} ...
+```
+> **This is normal for free models.** It is not a configuration problem, a bad API key, or a VS Code bug — it is the free tier protecting itself from abuse. Wait ~30 seconds and retry, or switch to a paid model.
+#### Free-tier specifics
+| Constraint                                                                                    | Impact                                                                              |
+| --------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------- |
+| Requests with **context > 8K tokens** are throttled to **1% of the standard concurrency cap** | Long chats and any session past a couple of tool turns fall into this bucket.       |
+| `thinking: { type: "enabled" }` keeps the in-flight slot held for the full reasoning duration | A thinking-on request counts as in-flight ~2–5× longer than a thinking-off request. |
+| Free quota is **per Z.ai account, shared across all free models**                             | Running `glm-4.7-flash` and `glm-4.6v-flash` concurrently drains the same bucket.   |
+| Peak-hour contention is significant                                                           | US/EU business hours see noticeably more `1302` errors than off-peak.               |
+#### Paid-tier specifics
+- Paid models (`glm-5.1`, `glm-4.7`, `glm-4.6v`, `glm-4.6v-flashx`, `glm-5v-turbo`) share a much larger concurrency pool sized to your prepaid balance.
+- `glm-4.6v-flashx` is the cheapest paid option ($0.04 input / $0.40 output per 1M) and is usually the right "I just want it to work" pick if you want to stay near-free in cost.
+- `glm-4.7` ($0.60 / $2.20 per 1M) is the recommended default for agent/coding work — strong quality at a low price.
+- `glm-5.1` ($1.40 / $4.40 per 1M) is the flagship and only worth it for long-horizon autonomous tasks.
+#### Reducing rate-limit pressure (still on the free tier)
+If you want to keep using `glm-4.7-flash` despite the limits:
+1. **Disable thinking** for tool-heavy sessions: set `thinking: { type: "disabled" }` in `requestBody`. Shorter responses free the slot faster.
+2. **Start new chats** instead of long-running ones — each new chat resets the per-conversation context length back under 8K.
+3. **Stagger agent runs** — don't run two Copilot agent sessions against the same Z.ai key in parallel; they share the same in-flight counter.
+4. **Retry with backoff** in your headspace: one `1302` is not a permanent block; it's a "right now is full, try again in 30s".
+> The **GLM Coding Plan** has separate (much higher) concurrency limits but is **not available via custom endpoints** — see [Why the Coding Plan is not an option](#why-the-glm-coding-plan-is-not-an-option-for-vs-code) below.
+## Troubleshooting
+| Symptom                                                    | Likely cause                                                          | Fix                                                                                                        |
+| ---------------------------------------------------------- | --------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------- |
+| Model not in picker                                        | Config not reloaded, or JSON syntax error                             | Restart VS Code; validate JSON                                                                             |
+| HTTP 400 on the first turn                                 | `requestBody` removed `do_sample` semantics or invalid `temperature`  | Ensure `temperature ≤ 1.0` and `top_p ∈ [0.01, 1.0]`                                                       |
+| `invalid temperature: only values ≤ 1.0 are allowed`       | Set `temperature > 1.0`                                               | Lower it to `1.0` or below                                                                                 |
+| Tool call succeeds but follow-up turn degrades             | `clear_thinking: false` set in `requestBody`                          | Remove the `clear_thinking` key and let the server default to `true`                                       |
+| `tool_choice: required` rejected                           | GLM only supports `auto`                                              | Don't override `tool_choice` (VS Code's default is `auto`)                                                 |
+| `Failed to download multimodal content` on a vision call   | Z.ai's servers couldn't reach the image URL                           | Use a base64 `data:image/...` URI instead                                                                  |
+| 401 Unauthorized                                           | Region mismatch (international key used on China URL, or vice versa)  | Match your key to the regional endpoint                                                                    |
+| Upstream complains about `reasoning_content is missing`    | You set `clear_thinking: false` from a client that doesn't forward it | Drop `clear_thinking` from `requestBody`                                                                   |
+| 429 / "concurrency limit exceeded"                         | Too many in-flight requests                                           | Reduce concurrent agent sessions, or upgrade your Z.ai plan                                                |
+| `1302` / `ChatRateLimited` on a free-tier model (`*flash`) | Expected behavior — free tier is heavily throttled                    | Wait ~30s and retry, disable `thinking`, start a new chat, or switch to `glm-4.6v-flashx` / `glm-4.7`      |
+| Long Chinese responses when the prompt is English          | Missing `Accept-Language: en-US,en` (Z.ai default)                    | Optional — VS Code's custom-endpoint provider doesn't set custom headers; usually the prompt language wins |
+## Pricing
+All prices are **USD per 1M tokens** (cache miss) on the Z.ai international platform. Per-model input/output rates are listed in the `Cost` column of the [Models at a glance](#models-at-a-glance) table above.
+> **Cache writes** are currently **Limited-time Free** for all models. Cached-input pricing is roughly 18% of the input price (e.g. `$0.60` input → `$0.11` cached for `glm-4.7`). China platform (`bigmodel.cn`) prices in CNY; see the [China pricing page](https://bigmodel.cn/pricing). For the cross-provider comparison, see [docs/pricing.md](../pricing.md).
+---
+## Background & Findings
+> This appendix preserves the validation narrative for future reference. It is not required to use the model.
+### Why GLM was a reasonable candidate
+Z.ai publishes an OpenAI-Chat-Completions-compatible API at `https://api.z.ai/api/paas/v4/chat/completions` with:
+- Standard Bearer auth (`Authorization: Bearer <key>`).
+- Standard request/response shapes (`messages`, `tools`, `tool_calls`, `stream`).
+- Standard `model`, `temperature`, `top_p`, `max_tokens`, `response_format`.
+- A documented OpenAI SDK `base_url` of `https://api.z.ai/api/paas/v4/`.
+That makes VS Code's `chat-completions` provider the obvious starting point — same shape as DashScope, Moonshot, and MiMo, all of which already work in this repo.
+### What differs from other providers in this repo
+| Concern                           | Z.ai / GLM behaviour                                                                                                                                                                                                             | Why it matters for VS Code                                                                                                                  |
+| --------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
+| Thinking default                  | Hybrid for `glm-4.6v` / `glm-4.6v-flashx` / `glm-4.6v-flash`; always-on for `glm-5.1` / `glm-4.7` / `glm-4.7-flash` / `glm-5v-turbo`.                                                                                            | VS Code can simply set `thinking: { type: "enabled" }` in `requestBody` to make thinking deterministic on every model.                      |
+| `reasoning_content` on tool turns | Z.ai defaults to `clear_thinking: true`, **silently stripping historical `reasoning_content`**.                                                                                                                                  | This is a near-perfect match for VS Code, which does **not** preserve `reasoning_content` between turns. Loops work without extra plumbing. |
+| `tool_choice`                     | Only `auto` is accepted.                                                                                                                                                                                                         | VS Code's default behaviour is `auto`, so no override needed.                                                                               |
+| `temperature` hard cap            | `[0.0, 1.0]` — strictly enforced server-side.                                                                                                                                                                                    | Use `1.0` for coding/agent work; never go above.                                                                                            |
+| `do_sample`                       | Default `true`. When `false`, `temperature` and `top_p` are ignored.                                                                                                                                                             | Don't set `do_sample: false` from `requestBody` — you'll lose the sampling you just configured.                                             |
+| Coding Plan endpoint              | A separate endpoint at `https://api.z.ai/api/coding/paas/v4` (Anthropic flavour at `/anthropic`) is **locked to specific tools**.                                                                                                | Cannot be used for VS Code custom endpoints — see below.                                                                                    |
+| Vision (image input + tool use)   | OpenAI `image_url` content-part format (external URLs and base64 data URIs both work). `glm-4.6v` introduced **native multimodal tool use** (images as tool args, tool results consumed visually); `glm-5v-turbo` inherits this. | Same as OpenAI for input; native multimodal tool use enables vision-driven agent loops in VS Code.                                          |
+### Why the GLM Coding Plan is **not** an option for VS Code
+The Coding Plan ($18–$160/mo) is a subscription that gates access to a small set of officially supported tools: **Claude Code, Cline, OpenCode, Kilo Code, Crush, Factory, and a handful of others**. The relevant callouts from Z.ai's own docs:
+- _"The GLM Coding Plan is strictly limited to use within officially supported tools and product environments; users may not use their subscription benefits for tools or scenarios outside of this scope."_
+- _"If the system detects usage through unauthorized or unsupported tools (such as SDK-based access or other third-party integrations), some subscription benefits may be restricted."_
+- The Coding endpoint base URL (`/api/coding/paas/v4`) is the only one that consumes the subscription quota — the general `/api/paas/v4` endpoint **always** charges against pay-as-you-go balance, even if you have an active Coding Plan.
+VS Code's custom-endpoint provider is not on the supported list, so the Coding Plan endpoint is the wrong target. The general `/api/paas/v4/chat/completions` endpoint is the correct one, and it bills against your prepaid Z.ai balance (or, for `glm-4.7-flash` / `glm-4.6v-flash`, is free).
+### Plan for this repo
+This file is the **research record and the user-facing setup guide**. The implementation work, when carried out, will be:
+1. **Add the `chatLanguageModels.json` block** shown above into [docs/example-config.md](../example-config.md) so users can copy all-providers-at-once.
+2. **Add a row to the model table** in [README.md](../../README.md) (and the per-model summary in [docs/competitors.md](../competitors.md) if relevant) pointing to `docs/models/glm.md`.
+3. **No new proxy.** The direct path is sufficient. If a future provider quirk surfaces (e.g. a sampling cap that VS Code sends above `1.0`, or a `clear_thinking` semantics change), the existing `lib/create-proxy.mjs` factory makes it cheap to add a `proxy/glm-proxy.mjs` mirroring `proxy/qwen-proxy.mjs`.
+4. **No CLI / npm-script changes.** `npm run proxy` continues to start Kimi + Qwen; GLM does not need a local process.
+5. **No test changes.** Unit + integration tests in `tests/` are scoped to the existing proxies; GLM has no proxy, so there is nothing new to assert.
+6. ~~**Live validation pending.**~~ ✅ **Live validation complete for `glm-5v-turbo` and `glm-5.1` (text-only).** See [Validation results](#validation-results) for the full pass/fail table. Remaining `curl`-based checks (`glm-4.7-flash`, `glm-4.7`, `glm-4.6v`) and `glm-4.6v` vision test are still pending.
+### Validation results
+#### VS Code live validation — `glm-5v-turbo` & `glm-5.1` (2026-06-04)
+##### `glm-5v-turbo` — full pass
+| #   | Check                                                                      | Expected                                            | Result             |
+| --- | -------------------------------------------------------------------------- | --------------------------------------------------- | ------------------ |
+| 1   | VS Code: `glm-5v-turbo` appears in picker                                  | Model selectable in Language Models picker          | ✅                 |
+| 2   | VS Code: plain chat, streaming on `glm-5v-turbo`                           | Streaming output visible                            | ✅                 |
+| 3   | VS Code: agent mode — tool calling (`open_browser_page`) on `glm-5v-turbo` | Multi-turn tool loop succeeds (Google opened)       | ✅                 |
+| 4   | VS Code: vision (image attached / screenshot) on `glm-5v-turbo`            | Image content described accurately (daily.dev page) | ✅                 |
+| 5   | Video input (local `.mp4` file)                                            | Rejected by VS Code `view_image` tool (images only) | ⚠️ Blocked by tool |
+##### `glm-5.1` — partial pass (text-only model)
+| #   | Check                                                                     | Expected                                             | Result |
+| --- | ------------------------------------------------------------------------- | ---------------------------------------------------- | ------ |
+| 6   | VS Code: `glm-5.1` appears in picker                                      | "Agent \| GLM 5.1 (flagship)"                        | ✅     |
+| 7   | VS Code: plain chat, streaming chat on `glm-5.1`                          | Streaming output visible                             | ✅     |
+| 8   | VS Code: agent mode — tool calling (browser open) on `glm-5.1`            | Multi-turn tool loop succeeds                        | ✅     |
+| 9   | `curl` non-streaming `glm-4.7-flash` against Z.ai                         | HTTP 200, assistant message in `content`             | ⏳     |
+| 10  | `curl` streaming `glm-4.7-flash`                                          | HTTP 200, SSE chunks with `data: [DONE]` terminator  | ⏳     |
+| 11  | `curl` tool call `glm-4.7` (function-call tool)                           | HTTP 200, `finish_reason: "tool_calls"`              | ⏳     |
+| 12  | `curl` vision `glm-4.6v` with base64 image                                | HTTP 200, image content described                    | ⏳     |
+| 13  | `curl` tool call when `thinking: { type: "enabled" }` is set on `glm-5.1` | HTTP 200, `reasoning_content` + `tool_calls`         | ⏳     |
+| 14  | `curl` tool-call follow-up turn (proves `clear_thinking`)                 | HTTP 200, prior `reasoning_content` is auto-stripped | ⏳     |
+| 15  | VS Code: vision (image attached) on `glm-4.6v`                            | Image content described in response                  | ⏳     |
+> **`glm-5v-turbo` fully validated** ✅ for VS Code custom-endpoint use: plain chat ✅, streaming ✅, tool calling ✅ (tested with `open_browser_page` opening Google), vision ✅ (accurately described a daily.dev screenshot including post titles, tags, sidebar navigation, browser tabs, and ad content).
+>
+> **Video input: GLM-5V-Turbo supports it natively, but VS Code's tool pipeline blocks it.** Z.ai's official docs state GLM-5V-Turbo's **Input Modality is "Video / Image / Text / File"**, and the Chat Completion API accepts **video** alongside images, audio, and files. There is even an official **"Video Object Tracking"** skill/example for `glm-5v-turbo`. However, VS Code's `view_image` tool only accepts static image formats (`png`, `jpg`, `jpeg`, `gif`, `webp`) and **rejects video files at the tool layer before they reach the model**. To test video input with GLM-5V-Turbo, use a direct API call (e.g., `curl`) with a public video URL in an `image_url` content part, or extract frames as images first (e.g., `ffmpeg -i video.mp4 -vframes 1 frame.png`). For a turnkey bridge that does this automatically inside VS Code, see [**Video Context MCP**](https://www.videocontextmcp.com/) — an MCP server that extracts frames from a video and routes them to GLM-4.6V (or one of four other providers in a fallback chain) so you can ask natural-language questions about any video. See [GLM-5V-Turbo docs](https://docs.z.ai/guides/vlm/glm-5v-turbo) for the official video input examples.
+>
+> **`glm-5.1` partially validated** ✅ for text-only use: plain chat ✅, streaming ✅, tool calling ✅. The remaining `curl`-based checks and `glm-4.6v` vision tests are pending.
+## Companion tools
+- [**Video Context MCP**](https://www.videocontextmcp.com/) — an MCP server that gives AI coding assistants (GitHub Copilot, Cursor, Claude Code) the ability to **understand video content** via natural language. Extracts frames from local or remote videos, routes them through a multi-provider fallback chain (**Gemini → GLM-4.6V → Qwen3.6 → Kimi K2.6 → MiMo-V2.5**), and returns answers grounded in actual video frames. Also handles summarization, timestamp search, audio transcription with speaker diarization, and video metadata. Works around the limitation that VS Code's built-in `view_image` tool only accepts static images — so it lets `glm-5v-turbo`'s native video support actually be exercised end-to-end from inside VS Code.
+## References
+- Z.ai (international) model API: `https://z.ai/model-api`
+- Z.ai quick start: `https://docs.z.ai/guides/overview/quick-start`
+- Z.ai pricing: `https://docs.z.ai/guides/overview/pricing`
+- Z.ai chat-completion reference: `https://docs.z.ai/api-reference/llm/chat-completion`
+- Z.ai thinking mode: `https://docs.z.ai/guides/capabilities/thinking-mode`
+- GLM-4.7: `https://docs.z.ai/guides/llm/glm-4.7`
+- GLM-4.6: `https://docs.z.ai/guides/llm/glm-4.6`
+- GLM-4.5: `https://docs.z.ai/guides/llm/glm-4.5`
+- GLM-4.6V (vision): `https://docs.z.ai/guides/vlm/glm-4.6v`
+- GLM-5V-Turbo (vision): `https://docs.z.ai/guides/vlm/glm-5v-turbo`
+- Z.ai Coding Plan overview: `https://docs.z.ai/devpack/overview`
+- Z.ai Coding Plan tool integration: `https://docs.z.ai/devpack/tool/others`
+- Z.ai rate limits dashboard (signed-in): `https://z.ai/manage-apikey/rate-limits`
+- China (BigModel) quick start: `https://docs.bigmodel.cn/cn/guide/start/quick-start`
+- China (BigModel) pricing: `https://bigmodel.cn/pricing`
+- VS Code custom-endpoint docs: `https://code.visualstudio.com/docs/copilot/customization/language-models#_add-a-custom-endpoint-model`