copilot-custom-endpoint 1.3.4 → 1.3.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -29,7 +29,6 @@ That's it. No code, no servers to manage (unless the model specifically needs th
29
29
  | **Qwen 3.7 Max** | DashScope | Optional | ❌ | [Setup](docs/models/qwen.md) |
30
30
  | **MiniMax M3** | MiniMax | No | ✅ | [Setup](docs/models/minimax.md) |
31
31
  | **GLM 5.1** | Z.ai | No | ❌ | [Setup](docs/models/glm.md) |
32
- | **GLM 4.7 Flash (free)** | Z.ai | No | ❌ | [Setup](docs/models/glm.md) |
33
32
  | **GLM 5V Turbo** | Z.ai | No | ✅ | [Setup](docs/models/glm.md) |
34
33
  | **DeepSeek V4 Pro / Flash** | DeepSeek | No (uses an extension) | ✅ via proxy | [Marketplace](https://marketplace.visualstudio.com/items?itemName=Vizards.deepseek-v4-for-copilot) |
35
34
 
@@ -117,7 +116,7 @@ VS Code's built-in `view_image` tool only accepts **static images** (PNG, JPG, G
117
116
  **Video Context MCP** is a small MCP server that bridges that gap. It works with **GitHub Copilot, Cursor, and Claude Code** out of the box, and:
118
117
 
119
118
  - **Extracts frames** from local files or remote URLs (no `ffmpeg` gymnastics required).
120
- - **Routes them through a multi-provider fallback chain** — `Gemini → GLM-4.6V-flash → Qwen3.7-plus → Kimi K2.6 → MiMo-V2.5` — so a single `GLM 5V Turbo` rate-limit hiccup doesn't kill your session.
119
+ - **Routes them through a multi-provider fallback chain** — `Gemini → GLM 5V Turbo → Qwen3.7-plus → Kimi K2.6 → MiMo-V2.5` — so a single `GLM 5V Turbo` rate-limit hiccup doesn't kill your session.
121
120
  - **Answers natural-language questions** about the video grounded in actual frames: "what does the speaker click in the last 30 seconds?", "summarize the demo", "find the frame where the error appears".
122
121
  - **Extras:** timestamp search, audio transcription with speaker diarization, and video metadata (resolution, duration, codec).
123
122
 
@@ -155,21 +155,7 @@ Here's a complete, real-world `chatLanguageModels.json` that combines **all the
155
155
  "top_p": 0.95
156
156
  }
157
157
  },
158
- {
159
- "id": "glm-4.7-flash",
160
- "name": "GLM 4.7 Flash (free)",
161
- "url": "https://api.z.ai/api/paas/v4/chat/completions",
162
- "toolCalling": true,
163
- "vision": false,
164
- "streaming": true,
165
- "maxInputTokens": 204800,
166
- "maxOutputTokens": 131072,
167
- "requestBody": {
168
- "thinking": { "type": "enabled" },
169
- "temperature": 1,
170
- "top_p": 0.95
171
- }
172
- },
158
+
173
159
  {
174
160
  "id": "glm-5v-turbo",
175
161
  "name": "GLM 5V Turbo (vision flagship)",
@@ -198,6 +184,6 @@ If you only need one provider, jump straight to its setup guide:
198
184
  - [Qwen 3.7 Plus / 3.7 Max](qwen.md)
199
185
  - [Xiaomi MiMo (V2.5 / V2.5 Pro / V2 Flash)](mimo.md)
200
186
  - [MiniMax M3](minimax.md)
201
- - [GLM (5.1 / 4.7 Flash / 5V Turbo)](glm.md)
187
+ - [GLM (5.1 / 5V Turbo)](glm.md)
202
188
 
203
189
  > **DeepSeek V4 Pro / V4 Flash** use the [DeepSeek V4 for Copilot Chat](https://marketplace.visualstudio.com/items?itemName=Vizards.deepseek-v4-for-copilot) extension — they don't appear in `chatLanguageModels.json`.
@@ -2,42 +2,28 @@
2
2
 
3
3
  > **TL;DR:** GLM works directly with VS Code's custom-endpoint provider — **no proxy needed**. The API is OpenAI Chat Completions compatible at `https://api.z.ai/api/paas/v4/chat/completions`, and Z.ai's default `thinking.clear_thinking: true` quietly strips `reasoning_content` from prior turns, which makes multi-turn tool loops stable even when VS Code doesn't preserve reasoning blocks. The **GLM Coding Plan** endpoint is **not** usable here — it is locked to a curated list of coding tools (Claude Code, Cline, OpenCode, etc.).
4
4
 
5
- > ⚠️ **Free-tier rate-limit warning.** Z.ai's free models — `glm-4.7-flash` and `glm-4.6v-flash` — are aggressively throttled. In practice, expect to see HTTP `{"code":"1302","message":"Rate limit reached for requests"}` (surfaced in VS Code as `Reason: Rate limit exceeded` / `ChatRateLimited`) **on a significant fraction of requests**, especially:
6
- >
7
- > - During long chat sessions or after several tool turns (context > 8K is throttled to **1%** of the standard concurrency cap).
8
- > - When `thinking: { type: "enabled" }` is set — thinking tokens still hold the in-flight slot, so the model occupies the throttle window longer.
9
- > - During peak hours, when many users are sharing the same free pool.
10
- >
11
- > **This is the free tier behaving as designed, not a bug.** For uninterrupted work, use a paid model: `glm-4.6v` is the cheapest paid vision option ($0.30/$0.90 per 1M), `glm-4.7` is the best cost/quality balance for text-only ($0.60/$2.20 per 1M), and `glm-5.1` is the flagship ($1.40/$4.40 per 1M). See [Rate limits](#rate-limits) for the full breakdown.
12
-
13
5
  ## At a Glance
14
6
 
15
- | Field | Value |
16
- | ---------------------- | ------------------------------------------------------------------ |
17
- | Mode | **Direct** (no proxy) |
18
- | Vision | ✅ Yes (`glm-5v-turbo` only) |
19
- | Tool calling | ✅ Yes (native multimodal tool use on `glm-4.6v` / `glm-5v-turbo`) |
20
- | Context (flagship) | 200K (`glm-5.1` / `glm-4.7` / `glm-4.7-flash` / `glm-5v-turbo`) |
21
- | Max output (flagship) | 128K |
22
- | Required `requestBody` | `thinking: { type: "enabled" }` (recommended) |
23
- | Endpoint (intl) | `https://api.z.ai/api/paas/v4/chat/completions` |
24
- | Endpoint (China) | `https://open.bigmodel.cn/api/paas/v4/chat/completions` |
25
- | Auth | `Authorization: Bearer $ZAI_API_KEY` |
7
+ | Field | Value |
8
+ | ---------------------- | ------------------------------------------------------- |
9
+ | Mode | **Direct** (no proxy) |
10
+ | Vision | ✅ Yes (`glm-5v-turbo` only) |
11
+ | Tool calling | ✅ Yes (native multimodal tool use on `glm-5v-turbo`) |
12
+ | Context (flagship) | 200K (`glm-5.1` / `glm-5v-turbo`) |
13
+ | Max output (flagship) | 128K |
14
+ | Required `requestBody` | `thinking: { type: "enabled" }` (recommended) |
15
+ | Endpoint (intl) | `https://api.z.ai/api/paas/v4/chat/completions` |
16
+ | Endpoint (China) | `https://open.bigmodel.cn/api/paas/v4/chat/completions` |
17
+ | Auth | `Authorization: Bearer $ZAI_API_KEY` |
26
18
 
27
19
  ### Models at a glance
28
20
 
29
- | Model | Vision | Context | Max output | Thinking | Cost (in / out per 1M) | Role |
30
- | ---------------- | ------ | ------- | ---------- | ------------- | ---------------------- | --------------------------------------------------------- |
31
- | `glm-5.1` | ❌ | 200K | 128K | `enabled` | $1.40 / $4.40 | Current flagship — long-horizon / 8h autonomous work |
32
- | `glm-4.7` | | 200K | 128K | `enabled` | $0.60 / $2.20 | Flagship 4.xstrong coding/agent |
33
- | `glm-4.7-flash` | ❌ | 200K | 128K | `enabled` | Free ¹ | **Free** — newest 4.x tier at no cost |
34
- | `glm-5v-turbo` | ✅ | 200K | 128K | `enabled` | $1.20 / $4.00 | Multimodal **coding** model — vision-based agentic coding |
35
- | `glm-4.6v` | ✅ | 128K | 32K | hybrid (auto) | $0.30 / $0.90 | Vision + **native multimodal tool calls** |
36
- | `glm-4.6v-flash` | ✅ | 128K | 32K | hybrid (auto) | Free ¹ | **Free** vision tier |
21
+ | Model | Vision | Context | Max output | Thinking | Cost (in / out per 1M) | Role |
22
+ | -------------- | ------ | ------- | ---------- | --------- | ---------------------- | --------------------------------------------------------- |
23
+ | `glm-5.1` | ❌ | 200K | 128K | `enabled` | $1.40 / $4.40 | Current flagship — long-horizon / 8h autonomous work |
24
+ | `glm-5v-turbo` | | 200K | 128K | `enabled` | $1.20 / $4.00 | Multimodal **coding** model vision-based agentic coding |
37
25
 
38
- > ¹ **Free-tier caveat:** the two `*flash` free models are heavily rate-limitedsee the [warning at the top of this document](#glm-zai--zhipu-ai--vs-code-custom-endpoint-setup-guide) and the [Rate limits](#rate-limits) section. Expect frequent HTTP `1302 / ChatRateLimited` errors, especially on context > 8K or with thinking enabled. For reliable use, prefer `glm-4.6v` (cheapest paid vision) or `glm-4.7` (best cost/quality, text-only).
39
-
40
- > Other GLM models — `glm-5`, `glm-5-turbo`, `glm-4.7`, `glm-4.6`, `glm-4.6v`, `glm-4.6v-flashx`, `glm-4.6v-flash`, `glm-4.5`, `glm-4.5-air`, `glm-4.5-flash`, `glm-4.5-x`, `glm-4.5-airx`, `glm-4-32b-0414-128k` — are callable on the same endpoint but are intentionally **not** added to the default `chatLanguageModels.json` block below. Add them in the same shape if you need them. Note: `glm-4.6v-flashx` was previously in the default block but has been **removed** because live testing showed it is not reliable for tool calling.
26
+ > Other GLM models `glm-5`, `glm-5-turbo`, `glm-4.6v-flashx`, `glm-4.5`, `glm-4.5-air`, `glm-4.5-flash`, `glm-4.5-x`, `glm-4.5-airx`, `glm-4-32b-0414-128k` are callable on the same endpoint but are intentionally **not** added to the default `chatLanguageModels.json` block below. Add them in the same shape if you need them. Note: `glm-4.6v-flashx` was previously in the default block but has been **removed** because live testing showed it is not reliable for tool calling.
41
27
 
42
28
  ## Quick Start
43
29
 
@@ -84,21 +70,6 @@ Config file location:
84
70
  "top_p": 0.95
85
71
  }
86
72
  },
87
- {
88
- "id": "glm-4.7-flash",
89
- "name": "GLM 4.7 Flash (free)",
90
- "url": "https://api.z.ai/api/paas/v4/chat/completions",
91
- "toolCalling": true,
92
- "vision": false,
93
- "streaming": true,
94
- "maxInputTokens": 204800,
95
- "maxOutputTokens": 131072,
96
- "requestBody": {
97
- "thinking": { "type": "enabled" },
98
- "temperature": 1,
99
- "top_p": 0.95
100
- }
101
- },
102
73
  {
103
74
  "id": "glm-5v-turbo",
104
75
  "name": "GLM 5V Turbo (vision flagship)",
@@ -142,11 +113,11 @@ Config file location:
142
113
 
143
114
  ### Sampling parameters
144
115
 
145
- | Parameter | Range (hard cap) | Default |
146
- | ------------- | ---------------- | --------------------------------------------- |
147
- | `temperature` | `[0.0, 1.0]` | `1.0` for GLM-4.6 / 4.7 / 5.x · `0.6` for 4.5 |
148
- | `top_p` | `[0.01, 1.0]` | `0.95` for 4.5/4.6/4.7/5.x · `0.9` for 4-32B |
149
- | `do_sample` | bool | `true` — set `false` to bypass sampling |
116
+ | Parameter | Range (hard cap) | Default |
117
+ | ------------- | ---------------- | ---------------------------------------- |
118
+ | `temperature` | `[0.0, 1.0]` | `1.0` for GLM-4.6 / 5.x · `0.6` for 4.5 |
119
+ | `top_p` | `[0.01, 1.0]` | `0.95` for 4.5/4.6/5.x · `0.9` for 4-32B |
120
+ | `do_sample` | bool | `true` — set `false` to bypass sampling |
150
121
 
151
122
  > **Important:** Z.ai's `temperature` is capped at `1.0` server-side. Sending a value like `1.2` will be rejected. VS Code's defaults (typically `0`–`1`) are within range, but the explicit `requestBody` values above are the recommended ones for coding/agent work.
152
123
 
@@ -154,10 +125,10 @@ Config file location:
154
125
 
155
126
  `thinking` is a GLM-specific object. It only applies to **GLM-4.5 and above**.
156
127
 
157
- | Field | Values | Default | Meaning |
158
- | ------------------------- | ---------------------- | ------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------- |
159
- | `thinking.type` | `enabled` / `disabled` | `enabled` for 5.1/5/5-Turbo/4.7/4.5V; hybrid for 4.6/4.6V/4.5 | Force-enable or force-disable chain-of-thought. Hybrid lets the model decide. |
160
- | `thinking.clear_thinking` | bool | **`true`** | When `true`, Z.ai **strips historical `reasoning_content` from prior turns** before sending to the model. |
128
+ | Field | Values | Default | Meaning |
129
+ | ------------------------- | ---------------------- | ---------------------------------------------------- | --------------------------------------------------------------------------------------------------------- |
130
+ | `thinking.type` | `enabled` / `disabled` | `enabled` for 5.1/5/5-Turbo/4.5V; hybrid for 4.6/4.5 | Force-enable or force-disable chain-of-thought. Hybrid lets the model decide. |
131
+ | `thinking.clear_thinking` | bool | **`true`** | When `true`, Z.ai **strips historical `reasoning_content` from prior turns** before sending to the model. |
161
132
 
162
133
  > **Why the default is good for VS Code:** with `clear_thinking: true` (the server default), Z.ai doesn't require the client to forward `reasoning_content` between turns. VS Code's custom-endpoint provider doesn't preserve that field across tool turns — but for GLM it doesn't need to, because the server strips it. This avoids the same class of `reasoning_content` 400 errors that bite on MiMo.
163
134
  >
@@ -175,86 +146,43 @@ Config file location:
175
146
  - **Streaming** via SSE, terminated with `data: [DONE]`. (Same as OpenAI.)
176
147
  - **Tool calling** with the standard `tools` array. `tool_choice` accepts only `auto`.
177
148
  - **Max 128 functions** per request.
178
- - **Tool stream** (`tool_stream: true`) is supported on the `glm-4.6v` family and above for streaming tool-call deltas.
179
- - **Vision** on `glm-4.6v`, `glm-4.6v-flash`, and `glm-5v-turbo` using the OpenAI `image_url` content-part format. External URLs and base64 data URIs both work.
180
- - **Video input** on `glm-5v-turbo` — the model natively accepts video (Input Modality: **Video / Image / Text / File**). Use a public video URL in an `image_url` content part via direct API call; VS Code's chat UI does not currently forward video attachments to the model. For a turnkey VS Code integration that bridges the gap (extracts frames, routes them to GLM or a fallback provider, and answers natural-language questions about the video), see [**Video Context MCP**](https://www.videocontextmcp.com/) — an MCP server that gives Copilot/Cursor/Claude Code video understanding via the `glm-4.6v` provider and a multi-provider fallback chain (Gemini → GLM-4.6V → Qwen 3.7 Plus → Kimi K2.6 → MiMo-V2.5).
181
- - **Native multimodal tool calling** on `glm-4.6v` (and inherited by `glm-5v-turbo`) — images, screenshots, and document pages can be passed directly as tool parameters and tool results can be consumed visually.
149
+ - **Tool stream** (`tool_stream: true`) is supported on the `glm-5v-turbo` family and above for streaming tool-call deltas.
150
+ - **Vision** on `glm-5v-turbo` using the OpenAI `image_url` content-part format. External URLs and base64 data URIs both work.
151
+ - **Video input** on `glm-5v-turbo` — the model natively accepts video (Input Modality: **Video / Image / Text / File**). Use a public video URL in an `image_url` content part via direct API call; VS Code's chat UI does not currently forward video attachments to the model. For a turnkey VS Code integration that bridges the gap (extracts frames, routes them to GLM or a fallback provider, and answers natural-language questions about the video), see [**Video Context MCP**](https://www.videocontextmcp.com/) — an MCP server that gives Copilot/Cursor/Claude Code video understanding via the `glm-5v-turbo` provider and a multi-provider fallback chain (Gemini → GLM 5V Turbo → Qwen 3.7 Plus → Kimi K2.6 → MiMo-V2.5).
152
+ - **Native multimodal tool calling** on `glm-5v-turbo` — images, screenshots, and document pages can be passed directly as tool parameters and tool results can be consumed visually.
182
153
  - **Built-in web search** is exposed as a tool type `web_search` (different from `function`).
183
154
  - **Context caching** is automatic — the API returns `usage.prompt_tokens_details.cached_tokens` on cache hits; cache writes are currently free of charge.
184
155
  - **Response shape extras:** `choices[].message.reasoning_content` (when thinking is on), `web_search[]` (when web search is used), `usage` with `cached_tokens`.
185
156
 
186
157
  ### Rate limits
187
158
 
188
- Z.ai throttles **by in-flight concurrency**, not classic RPM/TPM. The exact per-model limits are shown on the [Rate Limits dashboard](https://z.ai/manage-apikey/rate-limits) once you are signed in. **Pay-as-you-go models share a generous pool; free-tier models (`glm-4.7-flash`, `glm-4.6v-flash`) are on a separate, much tighter bucket.**
159
+ Z.ai throttles **by in-flight concurrency**, not classic RPM/TPM. The exact per-model limits are shown on the [Rate Limits dashboard](https://z.ai/manage-apikey/rate-limits) once you are signed in. **All paid models share a generous pool sized to your prepaid balance.**
189
160
 
190
- #### What "rate limited" looks like in VS Code
191
-
192
- When a free-tier model is throttled, Z.ai returns:
193
-
194
- ```json
195
- { "code": "1302", "message": "Rate limit reached for requests" }
196
- ```
197
-
198
- VS Code surfaces this as a chat-side error:
199
-
200
- ```
201
- Sorry, your request failed. Please try again.
202
- Client Request Id: <uuid>
203
- Reason: Rate limit exceeded
204
- ChatRateLimited: Rate limit exceeded
205
- {"code":"1302","message":"Rate limit reached for requests"} ...
206
- ```
207
-
208
- > **This is normal for free models.** It is not a configuration problem, a bad API key, or a VS Code bug — it is the free tier protecting itself from abuse. Wait ~30 seconds and retry, or switch to a paid model.
209
-
210
- #### Free-tier specifics
211
-
212
- | Constraint | Impact |
213
- | --------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------- |
214
- | Requests with **context > 8K tokens** are throttled to **1% of the standard concurrency cap** | Long chats and any session past a couple of tool turns fall into this bucket. |
215
- | `thinking: { type: "enabled" }` keeps the in-flight slot held for the full reasoning duration | A thinking-on request counts as in-flight ~2–5× longer than a thinking-off request. |
216
- | Free quota is **per Z.ai account, shared across all free models** | Running `glm-4.7-flash` and `glm-4.6v-flash` concurrently drains the same bucket. |
217
- | Peak-hour contention is significant | US/EU business hours see noticeably more `1302` errors than off-peak. |
218
-
219
- #### Paid-tier specifics
220
-
221
- - Paid models (`glm-5.1`, `glm-4.7`, `glm-4.6v`, `glm-5v-turbo`) share a much larger concurrency pool sized to your prepaid balance. (Note: `glm-4.6v-flashx` was previously listed here as the cheapest paid option, but it has been removed from the recommended set because live testing showed it is not reliable for tool calling.)
222
- - For the cheapest reliable paid option, use `glm-4.6v` ($0.30 / $0.90 per 1M) if you need vision, or `glm-4.7` ($0.60 / $2.20 per 1M) for text-only work.
223
- - `glm-4.7` ($0.60 / $2.20 per 1M) is the recommended default for agent/coding work — strong quality at a low price.
224
- - `glm-5.1` ($1.40 / $4.40 per 1M) is the flagship and only worth it for long-horizon autonomous tasks.
225
-
226
- #### Reducing rate-limit pressure (still on the free tier)
227
-
228
- If you want to keep using `glm-4.7-flash` despite the limits:
229
-
230
- 1. **Disable thinking** for tool-heavy sessions: set `thinking: { type: "disabled" }` in `requestBody`. Shorter responses free the slot faster.
231
- 2. **Start new chats** instead of long-running ones — each new chat resets the per-conversation context length back under 8K.
232
- 3. **Stagger agent runs** — don't run two Copilot agent sessions against the same Z.ai key in parallel; they share the same in-flight counter.
233
- 4. **Retry with backoff** in your headspace: one `1302` is not a permanent block; it's a "right now is full, try again in 30s".
161
+ For vision-capable work, use the `glm-5v-turbo` model ($1.20 / $4.00 per 1M). For text-only use, `glm-5.1` ($1.40 / $4.40 per 1M) is the flagship and recommended default for agent/coding work — strong quality for long-horizon autonomous tasks.
234
162
 
235
163
  > The **GLM Coding Plan** has separate (much higher) concurrency limits but is **not available via custom endpoints** — see [Why the Coding Plan is not an option](#why-the-glm-coding-plan-is-not-an-option-for-vs-code) below.
236
164
 
237
165
  ## Troubleshooting
238
166
 
239
- | Symptom | Likely cause | Fix |
240
- | ---------------------------------------------------------- | --------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------- |
241
- | Model not in picker | Config not reloaded, or JSON syntax error | Restart VS Code; validate JSON |
242
- | HTTP 400 on the first turn | `requestBody` removed `do_sample` semantics or invalid `temperature` | Ensure `temperature ≤ 1.0` and `top_p ∈ [0.01, 1.0]` |
243
- | `invalid temperature: only values ≤ 1.0 are allowed` | Set `temperature > 1.0` | Lower it to `1.0` or below |
244
- | Tool call succeeds but follow-up turn degrades | `clear_thinking: false` set in `requestBody` | Remove the `clear_thinking` key and let the server default to `true` |
245
- | `tool_choice: required` rejected | GLM only supports `auto` | Don't override `tool_choice` (VS Code's default is `auto`) |
246
- | `Failed to download multimodal content` on a vision call | Z.ai's servers couldn't reach the image URL | Use a base64 `data:image/...` URI instead |
247
- | 401 Unauthorized | Region mismatch (international key used on China URL, or vice versa) | Match your key to the regional endpoint |
248
- | Upstream complains about `reasoning_content is missing` | You set `clear_thinking: false` from a client that doesn't forward it | Drop `clear_thinking` from `requestBody` |
249
- | 429 / "concurrency limit exceeded" | Too many in-flight requests | Reduce concurrent agent sessions, or upgrade your Z.ai plan |
250
- | `1302` / `ChatRateLimited` on a free-tier model (`*flash`) | Expected behavior — free tier is heavily throttled | Wait ~30s and retry, disable `thinking`, start a new chat, or switch to `glm-4.7` |
251
- | Long Chinese responses when the prompt is English | Missing `Accept-Language: en-US,en` (Z.ai default) | Optional — VS Code's custom-endpoint provider doesn't set custom headers; usually the prompt language wins |
167
+ | Symptom | Likely cause | Fix |
168
+ | -------------------------------------------------------- | --------------------------------------------------------------------- | -------------------------------------------------------------------- |
169
+ | Model not in picker | Config not reloaded, or JSON syntax error | Restart VS Code; validate JSON |
170
+ | HTTP 400 on the first turn | `requestBody` removed `do_sample` semantics or invalid `temperature` | Ensure `temperature ≤ 1.0` and `top_p ∈ [0.01, 1.0]` |
171
+ | `invalid temperature: only values ≤ 1.0 are allowed` | Set `temperature > 1.0` | Lower it to `1.0` or below |
172
+ | Tool call succeeds but follow-up turn degrades | `clear_thinking: false` set in `requestBody` | Remove the `clear_thinking` key and let the server default to `true` |
173
+ | `tool_choice: required` rejected | GLM only supports `auto` | Don't override `tool_choice` (VS Code's default is `auto`) |
174
+ | `Failed to download multimodal content` on a vision call | Z.ai's servers couldn't reach the image URL | Use a base64 `data:image/...` URI instead |
175
+ | 401 Unauthorized | Region mismatch (international key used on China URL, or vice versa) | Match your key to the regional endpoint |
176
+ | Upstream complains about `reasoning_content is missing` | You set `clear_thinking: false` from a client that doesn't forward it | Drop `clear_thinking` from `requestBody` |
177
+ | 429 / "concurrency limit exceeded" | Too many in-flight requests | Reduce concurrent agent sessions, or upgrade your Z.ai plan |
178
+
179
+ | Long Chinese responses when the prompt is English | Missing `Accept-Language: en-US,en` (Z.ai default) | Optional — VS Code's custom-endpoint provider doesn't set custom headers; usually the prompt language wins |
252
180
 
253
181
  ## Pricing
254
182
 
255
183
  All prices are **USD per 1M tokens** (cache miss) on the Z.ai international platform. Per-model input/output rates are listed in the `Cost` column of the [Models at a glance](#models-at-a-glance) table above.
256
184
 
257
- > **Cache writes** are currently **Limited-time Free** for all models. Cached-input pricing is roughly 18% of the input price (e.g. `$0.60` input → `$0.11` cached for `glm-4.7`). China platform (`bigmodel.cn`) prices in CNY; see the [China pricing page](https://bigmodel.cn/pricing). For the cross-provider comparison, see [docs/pricing.md](../pricing.md).
185
+ > **Cache writes** are currently **Limited-time Free** for all models. Cached-input pricing is roughly 18% of the input price (e.g. `$1.40` input → `$0.25` cached for `glm-5.1`). China platform (`bigmodel.cn`) prices in CNY; see the [China pricing page](https://bigmodel.cn/pricing). For the cross-provider comparison, see [docs/pricing.md](../pricing.md).
258
186
 
259
187
  ---
260
188
 
@@ -275,15 +203,15 @@ That makes VS Code's `chat-completions` provider the obvious starting point —
275
203
 
276
204
  ### What differs from other providers in this repo
277
205
 
278
- | Concern | Z.ai / GLM behaviour | Why it matters for VS Code |
279
- | --------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
280
- | Thinking default | Hybrid for `glm-4.6v` / `glm-4.6v-flash`; always-on for `glm-5.1` / `glm-4.7` / `glm-4.7-flash` / `glm-5v-turbo`. | VS Code can simply set `thinking: { type: "enabled" }` in `requestBody` to make thinking deterministic on every model. |
281
- | `reasoning_content` on tool turns | Z.ai defaults to `clear_thinking: true`, **silently stripping historical `reasoning_content`**. | This is a near-perfect match for VS Code, which does **not** preserve `reasoning_content` between turns. Loops work without extra plumbing. |
282
- | `tool_choice` | Only `auto` is accepted. | VS Code's default behaviour is `auto`, so no override needed. |
283
- | `temperature` hard cap | `[0.0, 1.0]` — strictly enforced server-side. | Use `1.0` for coding/agent work; never go above. |
284
- | `do_sample` | Default `true`. When `false`, `temperature` and `top_p` are ignored. | Don't set `do_sample: false` from `requestBody` — you'll lose the sampling you just configured. |
285
- | Coding Plan endpoint | A separate endpoint at `https://api.z.ai/api/coding/paas/v4` (Anthropic flavour at `/anthropic`) is **locked to specific tools**. | Cannot be used for VS Code custom endpoints — see below. |
286
- | Vision (image input + tool use) | OpenAI `image_url` content-part format (external URLs and base64 data URIs both work). `glm-4.6v` introduced **native multimodal tool use** (images as tool args, tool results consumed visually); `glm-5v-turbo` inherits this. | Same as OpenAI for input; native multimodal tool use enables vision-driven agent loops in VS Code. |
206
+ | Concern | Z.ai / GLM behaviour | Why it matters for VS Code |
207
+ | --------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
208
+ | Thinking default | Always-on for `glm-5.1` / `glm-5v-turbo`. | VS Code can simply set `thinking: { type: "enabled" }` in `requestBody` to make thinking deterministic on every model. |
209
+ | `reasoning_content` on tool turns | Z.ai defaults to `clear_thinking: true`, **silently stripping historical `reasoning_content`**. | This is a near-perfect match for VS Code, which does **not** preserve `reasoning_content` between turns. Loops work without extra plumbing. |
210
+ | `tool_choice` | Only `auto` is accepted. | VS Code's default behaviour is `auto`, so no override needed. |
211
+ | `temperature` hard cap | `[0.0, 1.0]` — strictly enforced server-side. | Use `1.0` for coding/agent work; never go above. |
212
+ | `do_sample` | Default `true`. When `false`, `temperature` and `top_p` are ignored. | Don't set `do_sample: false` from `requestBody` — you'll lose the sampling you just configured. |
213
+ | Coding Plan endpoint | A separate endpoint at `https://api.z.ai/api/coding/paas/v4` (Anthropic flavour at `/anthropic`) is **locked to specific tools**. | Cannot be used for VS Code custom endpoints — see below. |
214
+ | Vision (image input + tool use) | OpenAI `image_url` content-part format (external URLs and base64 data URIs both work). `glm-5v-turbo` supports **native multimodal tool use** (images as tool args, tool results consumed visually). | Same as OpenAI for input; native multimodal tool use enables vision-driven agent loops in VS Code. |
287
215
 
288
216
  ### Why the GLM Coding Plan is **not** an option for VS Code
289
217
 
@@ -293,7 +221,7 @@ The Coding Plan ($18–$160/mo) is a subscription that gates access to a small s
293
221
  - _"If the system detects usage through unauthorized or unsupported tools (such as SDK-based access or other third-party integrations), some subscription benefits may be restricted."_
294
222
  - The Coding endpoint base URL (`/api/coding/paas/v4`) is the only one that consumes the subscription quota — the general `/api/paas/v4` endpoint **always** charges against pay-as-you-go balance, even if you have an active Coding Plan.
295
223
 
296
- VS Code's custom-endpoint provider is not on the supported list, so the Coding Plan endpoint is the wrong target. The general `/api/paas/v4/chat/completions` endpoint is the correct one, and it bills against your prepaid Z.ai balance (or, for `glm-4.7-flash` / `glm-4.6v-flash`, is free).
224
+ VS Code's custom-endpoint provider is not on the supported list, so the Coding Plan endpoint is the wrong target. The general `/api/paas/v4/chat/completions` endpoint is the correct one, and it bills against your prepaid Z.ai balance.
297
225
 
298
226
  ### Plan for this repo
299
227
 
@@ -304,7 +232,7 @@ This file is the **research record and the user-facing setup guide**. The implem
304
232
  3. **No new proxy.** The direct path is sufficient. If a future provider quirk surfaces (e.g. a sampling cap that VS Code sends above `1.0`, or a `clear_thinking` semantics change), the existing `lib/create-proxy.mjs` factory makes it cheap to add a `proxy/glm-proxy.mjs` mirroring `proxy/qwen-proxy.mjs`.
305
233
  4. **No CLI / npm-script changes.** `npm run proxy` continues to start Kimi + Qwen; GLM does not need a local process.
306
234
  5. **No test changes.** Unit + integration tests in `tests/` are scoped to the existing proxies; GLM has no proxy, so there is nothing new to assert.
307
- 6. ~~**Live validation pending.**~~ ✅ **Live validation complete for `glm-5v-turbo` and `glm-5.1` (text-only).** See [Validation results](#validation-results) for the full pass/fail table. Remaining `curl`-based checks (`glm-4.7-flash`, `glm-4.7`, `glm-4.6v`) and `glm-4.6v` vision test are still pending.
235
+ 6. ~~**Live validation pending.**~~ ✅ **Live validation complete for `glm-5v-turbo` and `glm-5.1` (text-only).** See [Validation results](#validation-results) for the full pass/fail table.
308
236
 
309
237
  ### Validation results
310
238
 
@@ -327,23 +255,18 @@ This file is the **research record and the user-facing setup guide**. The implem
327
255
  | 6 | VS Code: `glm-5.1` appears in picker | "Agent \| GLM 5.1 (flagship)" | ✅ |
328
256
  | 7 | VS Code: plain chat, streaming chat on `glm-5.1` | Streaming output visible | ✅ |
329
257
  | 8 | VS Code: agent mode — tool calling (browser open) on `glm-5.1` | Multi-turn tool loop succeeds | ✅ |
330
- | 9 | `curl` non-streaming `glm-4.7-flash` against Z.ai | HTTP 200, assistant message in `content` | ⏳ |
331
- | 10 | `curl` streaming `glm-4.7-flash` | HTTP 200, SSE chunks with `data: [DONE]` terminator | ⏳ |
332
- | 11 | `curl` tool call `glm-4.7` (function-call tool) | HTTP 200, `finish_reason: "tool_calls"` | ⏳ |
333
- | 12 | `curl` vision `glm-4.6v` with base64 image | HTTP 200, image content described | ⏳ |
334
- | 13 | `curl` tool call when `thinking: { type: "enabled" }` is set on `glm-5.1` | HTTP 200, `reasoning_content` + `tool_calls` | ⏳ |
335
- | 14 | `curl` tool-call follow-up turn (proves `clear_thinking`) | HTTP 200, prior `reasoning_content` is auto-stripped | ⏳ |
336
- | 15 | VS Code: vision (image attached) on `glm-4.6v` | Image content described in response | ⏳ |
258
+ | 9 | `curl` tool call when `thinking: { type: "enabled" }` is set on `glm-5.1` | HTTP 200, `reasoning_content` + `tool_calls` | ⏳ |
259
+ | 10 | `curl` tool-call follow-up turn (proves `clear_thinking`) | HTTP 200, prior `reasoning_content` is auto-stripped | ⏳ |
337
260
 
338
261
  > **`glm-5v-turbo` fully validated** ✅ for VS Code custom-endpoint use: plain chat ✅, streaming ✅, tool calling ✅ (tested with `open_browser_page` opening Google), vision ✅ (accurately described a daily.dev screenshot including post titles, tags, sidebar navigation, browser tabs, and ad content).
339
262
  >
340
- > **Video input: GLM-5V-Turbo supports it natively, but VS Code's tool pipeline blocks it.** Z.ai's official docs state GLM-5V-Turbo's **Input Modality is "Video / Image / Text / File"**, and the Chat Completion API accepts **video** alongside images, audio, and files. There is even an official **"Video Object Tracking"** skill/example for `glm-5v-turbo`. However, VS Code's `view_image` tool only accepts static image formats (`png`, `jpg`, `jpeg`, `gif`, `webp`) and **rejects video files at the tool layer before they reach the model**. To test video input with GLM-5V-Turbo, use a direct API call (e.g., `curl`) with a public video URL in an `image_url` content part, or extract frames as images first (e.g., `ffmpeg -i video.mp4 -vframes 1 frame.png`). For a turnkey bridge that does this automatically inside VS Code, see [**Video Context MCP**](https://www.videocontextmcp.com/) — an MCP server that extracts frames from a video and routes them to GLM-4.6V (or one of four other providers in a fallback chain) so you can ask natural-language questions about any video. See [GLM-5V-Turbo docs](https://docs.z.ai/guides/vlm/glm-5v-turbo) for the official video input examples.
263
+ > **Video input: GLM-5V-Turbo supports it natively, but VS Code's tool pipeline blocks it.** Z.ai's official docs state GLM-5V-Turbo's **Input Modality is "Video / Image / Text / File"**, and the Chat Completion API accepts **video** alongside images, audio, and files. There is even an official **"Video Object Tracking"** skill/example for `glm-5v-turbo`. However, VS Code's `view_image` tool only accepts static image formats (`png`, `jpg`, `jpeg`, `gif`, `webp`) and **rejects video files at the tool layer before they reach the model**. To test video input with GLM-5V-Turbo, use a direct API call (e.g., `curl`) with a public video URL in an `image_url` content part, or extract frames as images first (e.g., `ffmpeg -i video.mp4 -vframes 1 frame.png`). For a turnkey bridge that does this automatically inside VS Code, see [**Video Context MCP**](https://www.videocontextmcp.com/) — an MCP server that extracts frames from a video and routes them to a model provider (e.g. GLM 5V Turbo) in a fallback chain so you can ask natural-language questions about any video. See [GLM-5V-Turbo docs](https://docs.z.ai/guides/vlm/glm-5v-turbo) for the official video input examples.
341
264
  >
342
- > **`glm-5.1` partially validated** ✅ for text-only use: plain chat ✅, streaming ✅, tool calling ✅. The remaining `curl`-based checks and `glm-4.6v` vision tests are pending.
265
+ > **`glm-5.1` partially validated** ✅ for text-only use: plain chat ✅, streaming ✅, tool calling ✅. The remaining `curl`-based checks are pending.
343
266
 
344
267
  ## Companion tools
345
268
 
346
- - [**Video Context MCP**](https://www.videocontextmcp.com/) — an MCP server that gives AI coding assistants (GitHub Copilot, Cursor, Claude Code) the ability to **understand video content** via natural language. Extracts frames from local or remote videos, routes them through a multi-provider fallback chain (**Gemini → GLM-4.6V-flash → Qwen 3.7 Plus → Kimi K2.6 → MiMo-V2.5**), and returns answers grounded in actual video frames. Also handles summarization, timestamp search, audio transcription with speaker diarization, and video metadata. Works around the limitation that VS Code's built-in `view_image` tool only accepts static images — so it lets `glm-5v-turbo`'s native video support actually be exercised end-to-end from inside VS Code.
269
+ - [**Video Context MCP**](https://www.videocontextmcp.com/) — an MCP server that gives AI coding assistants (GitHub Copilot, Cursor, Claude Code) the ability to **understand video content** via natural language. Extracts frames from local or remote videos, routes them through a multi-provider fallback chain (**Gemini → GLM 5V Turbo → Qwen 3.7 Plus → Kimi K2.6 → MiMo-V2.5**), and returns answers grounded in actual video frames. Also handles summarization, timestamp search, audio transcription with speaker diarization, and video metadata. Works around the limitation that VS Code's built-in `view_image` tool only accepts static images — so it lets `glm-5v-turbo`'s native video support actually be exercised end-to-end from inside VS Code.
347
270
 
348
271
  ## References
349
272
 
@@ -352,10 +275,7 @@ This file is the **research record and the user-facing setup guide**. The implem
352
275
  - Z.ai pricing: `https://docs.z.ai/guides/overview/pricing`
353
276
  - Z.ai chat-completion reference: `https://docs.z.ai/api-reference/llm/chat-completion`
354
277
  - Z.ai thinking mode: `https://docs.z.ai/guides/capabilities/thinking-mode`
355
- - GLM-4.7: `https://docs.z.ai/guides/llm/glm-4.7`
356
- - GLM-4.6: `https://docs.z.ai/guides/llm/glm-4.6`
357
278
  - GLM-4.5: `https://docs.z.ai/guides/llm/glm-4.5`
358
- - GLM-4.6V (vision): `https://docs.z.ai/guides/vlm/glm-4.6v`
359
279
  - GLM-5V-Turbo (vision): `https://docs.z.ai/guides/vlm/glm-5v-turbo`
360
280
  - Z.ai Coding Plan overview: `https://docs.z.ai/devpack/overview`
361
281
  - Z.ai Coding Plan tool integration: `https://docs.z.ai/devpack/tool/others`
package/docs/pricing.md CHANGED
@@ -22,44 +22,43 @@ All prices below are in **USD per 1M tokens** (non-cached). To convert to AI cre
22
22
 
23
23
  These are the models available through GitHub Copilot's model roster as of June 1, 2026.
24
24
 
25
- | Model | Provider | Tier | Input (per 1M) | Cached input | Output (per 1M) | Context |
26
- | --------------------- | --------- | ----------- | -------------- | ------------ | --------------- | ------- |
27
- | **GPT-5.5** | OpenAI | Powerful | $5.00 | $0.50 | $30.00 | |
28
- | **Claude Opus 4.8** | Anthropic | Powerful | $5.00 | $0.50 | $25.00 | 1M |
29
- | **Claude Opus 4.7** | Anthropic | Powerful | $5.00 | $0.50 | $25.00 | 1M |
30
- | **GPT-5.4** | OpenAI | Versatile | $2.50 | $0.25 | $15.00 | |
31
- | **GPT-5.3-Codex** | OpenAI | Powerful | $1.75 | $0.175 | $14.00 | |
32
- | **Claude Sonnet 4.6** | Anthropic | Versatile | $3.00 | $0.30 | $15.00 | 1M |
33
- | **Gemini 3.1 Pro** | Google | Powerful | $2.00¹ | $0.20 | $12.00¹ | 1M |
34
- | **Claude Haiku 4.5** | Anthropic | Versatile | $1.00 | $0.10 | $5.00 | 1M |
35
- | **Gemini 3.5 Flash** | Google | Lightweight | $1.50 | $0.15 | $9.00 | 1M |
36
- | **Gemini 2.5 Pro** | Google | Powerful | $1.25¹ | $0.125 | $10.00¹ | 1M |
37
- | **GPT-5.4 mini** | OpenAI | Lightweight | $0.75 | $0.075 | $4.50 | |
38
- | **Gemini 3 Flash** | Google | Lightweight | $0.50 | $0.05 | $3.00 | 1M |
39
- | **Raptor mini** | GitHub | Versatile | $0.25 | $0.025 | $2.00 | |
25
+ | Model | Provider | Tier | Input (per 1M) | Cached input | Output (per 1M) | Context window |
26
+ | --------------------- | --------- | ----------- | -------------- | ------------ | --------------- | -------------- |
27
+ | **Raptor mini** | GitHub | Versatile | $0.25 | $0.025 | $2.00 | 264K |
28
+ | **Gemini 3 Flash** | Google | Lightweight | $0.50 | $0.05 | $3.00 | 173K |
29
+ | **GPT-5.4 mini** | OpenAI | Lightweight | $0.75 | $0.075 | $4.50 | 400K |
30
+ | **Claude Haiku 4.5** | Anthropic | Versatile | $1.00 | $0.10 | $5.00 | 160K |
31
+ | **Gemini 2.5 Pro** | Google | Powerful | $1.25¹ | $0.125 | $10.00¹ | 173K |
32
+ | **Gemini 3.5 Flash** | Google | Lightweight | $1.50 | $0.15 | $9.00 | 1M |
33
+ | **GPT-5.3-Codex** | OpenAI | Powerful | $1.75 | $0.175 | $14.00 | 400K |
34
+ | **Gemini 3.1 Pro** | Google | Powerful | $2.00¹ | $0.20 | $12.00¹ | 1M |
35
+ | **GPT-5.4** | OpenAI | Versatile | $2.50 | $0.25 | $15.00 | 1M |
36
+ | **Claude Sonnet 4.6** | Anthropic | Versatile | $3.00 | $0.30 | $15.00 | 1M |
37
+ | **Claude Opus 4.8** | Anthropic | Powerful | $5.00 | $0.50 | $25.00 | 1M |
38
+ | **Claude Opus 4.7** | Anthropic | Powerful | $5.00 | $0.50 | $25.00 | 1M |
39
+ | **GPT-5.5** | OpenAI | Powerful | $5.00 | $0.50 | $30.00 | 1M |
40
40
 
41
41
  ¹ Gemini 3.1 Pro and 2.5 Pro pricing applies to prompts ≤200K tokens.
42
42
 
43
43
  ## Custom-endpoint alternatives
44
44
 
45
- | Model | Provider | Input (per 1M) | Output (per 1M) | Context window |
46
- | --------------------- | --------- | ----------------------------- | --------------------------------------- | -------------- |
47
- | **DeepSeek V4 Flash** | DeepSeek | $0.14 | $0.28 | 1M |
48
- | **MiMo V2 Flash** 🏆 | Xiaomi | $0.10 | $0.30 | 256K |
49
- | **Kimi K2.6** | Moonshot | $0.16 | $0.95 (non-thinking) / $4.00 (thinking) | 256K |
50
- | **DeepSeek V4 Pro** | DeepSeek | $1.74 | $3.48 | 1M |
51
- | **MiMo V2.5** | Xiaomi | $0.40 | $2.00 | 1M |
52
- | **MiMo V2.5 Pro** | Xiaomi | $1.00 | $3.00 | 1M |
53
- | **Qwen 3.7 Plus** | DashScope | $0.40 (≤256K) / $1.20 (>256K) | $1.60 (≤256K) / $4.80 (>256K) | 1M |
54
- | **Qwen 3.7 Max** | DashScope | $2.50 (≤1M) | $7.50 (≤1M) | 1M |
55
- | **MiniMax M3** | MiniMax | $0.60 (≤512K) / $1.20 (>512K) | $2.40 (≤512K) / $4.80 (>512K) | 1M |
56
- | **GLM 4.7 Flash** | Z.ai | Free (rate-limited ¹) | Free (rate-limited ¹) | 200K |
57
- | **GLM 5V Turbo** | Z.ai | $1.20 | $4.00 | 200K |
58
- | **GLM 5.1** | Z.ai | $1.40 | $4.40 | 200K |
45
+ | Model | Provider | Input (per 1M) | Cached input | Output (per 1M) | Context window |
46
+ | --------------------- | --------- | ----------------------------- | ----------------------------- | --------------------------------------- | -------------- |
47
+ | **MiMo V2 Flash** | Xiaomi | $0.10 | $0.01 | $0.30 | 256K |
48
+ | **DeepSeek V4 Flash** | DeepSeek | $0.14 | $0.0028 | $0.28 | 1M |
49
+ | **Kimi K2.6** | Moonshot | $0.16 | — | $0.95 (non-thinking) / $4.00 (thinking) | 256K |
50
+ | **Qwen 3.7 Plus** | DashScope | $0.40 (≤256K) / $1.20 (>256K) | — | $1.60 (≤256K) / $4.80 (>256K) | 1M |
51
+ | **MiMo V2.5** | Xiaomi | $0.40 | $0.08 | $2.00 | 1M |
52
+ | **DeepSeek V4 Pro** | DeepSeek | $0.435 | $0.003625 | $0.87 | 1M |
53
+ | **MiniMax M3** | MiniMax | $0.60 (≤512K) / $1.20 (>512K) | $0.12 (≤512K) / $0.24 (>512K) | $2.40 (≤512K) / $4.80 (>512K) | 1M |
54
+ | **MiMo V2.5 Pro** | Xiaomi | $1.00 | $0.20 | $3.00 | 1M |
55
+ | **GLM 5V Turbo** | Z.ai | $1.20 | $0.24 | $4.00 | 200K |
56
+ | **GLM 5.1** | Z.ai | $1.40 | $0.26 | $4.40 | 200K |
57
+ | **Qwen 3.7 Max** | DashScope | $2.50 (≤1M) | — | $7.50 (≤1M) | 1M |
59
58
 
60
59
  > **Notes:**
61
60
  >
62
- > - **DeepSeek V4** input pricing shown is the **cache miss** price. Cache hits are significantly cheaper ($0.0028/M for Flash, $0.0145/M for Pro).
61
+ > - **DeepSeek V4** input pricing shown is the **cache miss** price. Cache hits are significantly cheaper ($0.0028/M for Flash, $0.003625/M for Pro).
63
62
  > - **MiMo** input pricing shown is the **cache miss** price. Cache hits are 5× cheaper for V2.5 Pro ($0.20/M) and V2.5 ($0.08/M), and 10× cheaper for V2 Flash ($0.01/M).
64
63
  > - **Gemini 3 Flash** is priced at $0.50/MTok input (text/image/video) and $1.00/MTok input for audio.
65
64
  > - **Anthropic (Claude)** models also have a cache write cost ($6.25/MTok for Opus, $3.75/MTok for Sonnet, $1.25/MTok for Haiku). Opus 4.7+ use a new tokenizer that may use up to 35% more tokens for the same text.
@@ -67,8 +66,8 @@ These are the models available through GitHub Copilot's model roster as of June
67
66
  > - **Qwen** models use **tiered pricing** — determined by total input tokens per request. Prices above are for non-thinking mode.
68
67
  > - **Kimi K2.6** pricing is from the **Moonshot platform** (direct). Via DashScope: $0.89 input / $3.71 output.
69
68
  > - **DashScope** offers a **free quota** of 1M input + 1M output tokens per model, valid for 90 days.
70
- > - **MiniMax M3** uses **tiered pricing** — input price doubles above 512K input tokens. A 7-day 50% off promotion is available for new accounts.
71
- > - **GLM** free-tier models (`glm-4.7-flash`) are aggressively rate-limited (HTTP `1302 / ChatRateLimited`), especially on context > 8K or with thinking enabled. Paid GLM models share a much larger concurrency pool.
69
+ > - **MiniMax M3** uses **tiered pricing** — input price doubles above 512K input tokens. Cache hits are priced at 20% of the input rate ($0.12/M ≤512K, $0.24/M >512K). A 7-day 50% off promotion is available for new accounts.
70
+ > - **GLM** models support prompt caching cache hits are priced at $0.24/M for 5V Turbo and $0.26/M for 5.1.
72
71
  > - **MiMo** offers a **Token Plan** subscription model with discounted rates and a free cache-writing promotion.
73
72
  > - For typical Copilot chat usage (short-to-medium prompts), you'll almost always fall in the lowest pricing tier.
74
73
 
@@ -76,32 +75,32 @@ These are the models available through GitHub Copilot's model roster as of June
76
75
 
77
76
  For a typical coding session (~10K input + ~2K output tokens per turn, 50 turns):
78
77
 
79
- | Model | Estimated session cost | Copilot Pro+ credits |
80
- | ------------------------ | ---------------------- | -------------------- |
81
- | MiMo V2 Flash 🏆 | ~$0.08 | — |
82
- | DeepSeek V4 Flash 🏆 | ~$0.10 | — |
83
- | Kimi K2.6 (non-thinking) | ~$0.18 | — |
84
- | MiMo V2.5 | ~$0.40 | — |
85
- | Kimi K2.6 (thinking) | ~$0.48 | — |
86
- | Qwen 3.7 Plus | ~$0.36 | — |
87
- | Gemini 3 Flash | ~$0.55 | ~55 |
88
- | MiniMax M3 | ~$0.54 | — |
89
- | MiMo V2.5 Pro | ~$0.80 | — |
90
- | GLM 4.7 Flash (free) | ~$0.00 ¹ | — |
91
- | GPT-5.4 mini | ~$0.83 | ~83 |
92
- | Claude Haiku 4.5 | ~$1.00 | ~100 |
93
- | DeepSeek V4 Pro | ~$1.22 | — |
94
- | Qwen 3.7 Max | ~$1.33 | — |
95
- | Gemini 2.5 Pro | ~$1.63 | ~163 |
96
- | Gemini 3.5 Flash | ~$1.65 | ~165 |
97
- | Gemini 3.1 Pro | ~$2.20 | ~220 |
98
- | GPT-5.3-Codex | ~$2.28 | ~228 |
99
- | GPT-5.4 | ~$2.75 | ~275 |
100
- | Claude Sonnet 4.6 | ~$3.00 | ~300 |
101
- | Claude Opus 4.8 / 4.7 | ~$5.00 | ~500 |
102
- | GPT-5.5 | ~$5.50 | ~550 |
78
+ | Model | Estimated session cost |
79
+ | ------------------------ | ---------------------- |
80
+ | MiMo V2 Flash | ~$0.08 |
81
+ | DeepSeek V4 Flash | ~$0.10 |
82
+ | Kimi K2.6 (non-thinking) | ~$0.18 |
83
+ | DeepSeek V4 Pro | ~$0.30 |
84
+ | Raptor mini | ~$0.33 |
85
+ | Qwen 3.7 Plus | ~$0.36 |
86
+ | MiMo V2.5 | ~$0.40 |
87
+ | Kimi K2.6 (thinking) | ~$0.48 |
88
+ | MiniMax M3 | ~$0.54 |
89
+ | Gemini 3 Flash | ~$0.55 |
90
+ | MiMo V2.5 Pro | ~$0.80 |
91
+ | GPT-5.4 mini | ~$0.83 |
92
+ | Claude Haiku 4.5 | ~$1.00 |
93
+ | Qwen 3.7 Max | ~$1.33 |
94
+ | Gemini 2.5 Pro | ~$1.63 |
95
+ | Gemini 3.5 Flash | ~$1.65 |
96
+ | Gemini 3.1 Pro | ~$2.20 |
97
+ | GPT-5.3-Codex | ~$2.28 |
98
+ | GPT-5.4 | ~$2.75 |
99
+ | Claude Sonnet 4.6 | ~$3.00 |
100
+ | Claude Opus 4.8 / 4.7 | ~$5.00 |
101
+ | GPT-5.5 | ~$5.50 |
103
102
 
104
- > **How long does 7,000 credits last?** A Pro+ subscriber running 50-turn sessions could afford roughly **13 GPT-5.5 sessions**, **23 Opus sessions**, or **212 Raptor mini sessions** per month — or mix and match.
103
+ > **How long does 7,000 credits last?** A Pro+ subscriber running 50-turn sessions could afford roughly **13 GPT-5.5 sessions**, **23 Opus sessions**, or **212 Raptor mini sessions** per month — or mix and match. (Multiply session cost by 100 to convert to AI credits.)
105
104
 
106
105
  > Prices last verified: June 1, 2026. Always check the official pages for the latest rates:
107
106
  >
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "copilot-custom-endpoint",
3
- "version": "1.3.4",
3
+ "version": "1.3.6",
4
4
  "description": "Local proxies for VS Code Copilot custom endpoints — Kimi K2 & Qwen 3.x",
5
5
  "license": "MIT",
6
6
  "type": "module",
@@ -55,4 +55,4 @@
55
55
  "dependencies": {
56
56
  "dotenv": "^17.4.2"
57
57
  }
58
- }
58
+ }