npm - glm-mcp-claude - Versions diffs - 1.0.0 - Mend

glm-mcp-claude 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (27) hide show

package/.mcp.json.example +14 -0
package/LICENSE +21 -0
package/README.md +220 -0
package/agents/glm.md +45 -0
package/assets/demo-glm-agent-umbrella.png +0 -0
package/assets/demo-glm-subagent-summary.png +0 -0
package/docs/AUTOSELECT.md +58 -0
package/docs/RULES.md +105 -0
package/docs/research/glm-capabilities.md +241 -0
package/docs/research/glm-failure-modes-routing.md +287 -0
package/docs/research/glm-misc-and-integration.md +180 -0
package/docs/research/glm-peak-usage-and-cost.md +146 -0
package/docs/research/glm-vs-opus-scenario-matrix.md +85 -0
package/docs/research/glm-vs-opus-toolcalling.md +134 -0
package/glm-mcp/.env.example +32 -0
package/glm-mcp/package-lock.json +1180 -0
package/glm-mcp/package.json +21 -0
package/glm-mcp/src/glmAgent.js +227 -0
package/glm-mcp/src/glmClient.js +136 -0
package/glm-mcp/src/index.js +306 -0
package/glm-mcp/src/loadEnv.js +24 -0
package/glm-mcp/src/router.js +291 -0
package/glm-mcp/src/smoke.js +42 -0
package/hooks/glm_subagent_router.mjs +206 -0
package/install.mjs +132 -0
package/package.json +47 -0
package/uninstall.mjs +47 -0

package/docs/research/glm-misc-and-integration.md ADDED Viewed

@@ -0,0 +1,180 @@
+# GLM (Zhipu AI / Z.ai) in Claude Code — Misc & Integration Research
+**Researcher focus:** Practical/operational + integration knowledge for building a real MCP that delegates to GLM via Claude Code's Anthropic-compatible path.
+**Date compiled:** 2026-06-30.
+**Confidence note:** Endpoint and env-var facts below are corroborated across the official Z.ai docs and multiple independent write-ups and are high-confidence. Exact *default* model mappings, per-tier concurrency limits, and "GLM 5.2" naming evolve fast — flagged inline where uncertain. WebFetch against `docs.z.ai`/`zcode.z.ai` failed during research (the fetch backend was itself routed to a broken `glm-4.7`), so direct-doc quotes below are reconstructed from search snippets of those same official pages, not first-party fetches. **Verify the two or three load-bearing values against the live docs before shipping.**
+---
+## 0. TL;DR for the MCP builder
+- **Anthropic-compatible base URL (use this for Claude Code):** `https://api.z.ai/api/anthropic`
+  Messages endpoint resolves to `https://api.z.ai/api/anthropic/v1/messages`.
+- **Auth:** set `ANTHROPIC_AUTH_TOKEN` (NOT `ANTHROPIC_API_KEY`) to your Z.ai key. `AUTH_TOKEN` → `Authorization: Bearer <key>`; `API_KEY` → `x-api-key: <key>`. Gateways like Z.ai expect Bearer, so use `ANTHROPIC_AUTH_TOKEN`.
+- **Current flagship model ID:** `glm-5.2` (and the 1M-context variant `glm-5.2[1m]`). "GLM 5.2" the user mentioned is **real** — released June 13–16, 2026.
+- **Lightweight / cheap model:** `glm-4.5-air` (default Haiku mapping). Also `glm-5-turbo` exists as a faster GLM-5-class model.
+- **Biggest operational gotcha for a *subagent/MCP* use case:** an **undocumented low concurrency limit (reported as 1 in-flight request on paid tiers)**. Parallel/background/multi-agent fan-out is exactly what triggers "Too much concurrency" errors. This is the single most important finding for delegating work to GLM.
+---
+## 1. Exact integration details
+### 1.1 Base URLs (all three)
+| Purpose | Base URL |
+|---|---|
+| **Anthropic-compatible** (Claude Code, Cline, etc.) | `https://api.z.ai/api/anthropic` |
+| OpenAI-compatible — **general** | `https://api.z.ai/api/paas/v4` |
+| OpenAI-compatible — **Coding Plan only** | `https://api.z.ai/api/coding/paas/v4` |
+- **China-mainland / BigModel variant** (Zhipu's domestic brand `open.bigmodel.cn`): OpenAI-compatible general `https://open.bigmodel.cn/api/paas/v4`, coding-only `https://open.bigmodel.cn/api/coding/paas/v4`. The install script is hosted at `https://cdn.bigmodel.cn/install/claude_code_zai_env.sh`. `z.ai` is the **international** brand; `bigmodel.cn` is the **domestic (China)** brand of the same company. For non-China users, prefer the `api.z.ai` hosts.
+- **Important:** the Coding Plan subscription quota is keyed to the *coding* endpoints. The general `/api/paas/v4` endpoint is billed against prepaid balance / resource packages, **not** the Coding Plan, and "is not applicable to general API scenarios" vice-versa. Don't cross them.
+- Some older write-ups reference `https://open.z.ai/api/paas/v4`; treat as stale. If a request 404s, try the other host and re-check live docs.
+Sources: [docs.z.ai HTTP API](https://docs.z.ai/guides/develop/http/introduction), [Z.ai API complete guide](https://www.aimadetools.com/blog/z-ai-api-complete-guide/), [ClaudeLog: Z.AI in Claude Code](https://claudelog.com/faqs/how-to-use-z-ai-in-claude-code/).
+### 1.2 Headers (raw HTTP, if the MCP calls the API directly instead of through Claude Code)
+- **Anthropic-compatible (`/v1/messages`):**
+  - `Content-Type: application/json`
+  - `x-api-key: <ZAI_KEY>`  *(or `Authorization: Bearer <ZAI_KEY>` — both observed working)*
+  - `anthropic-version: 2023-06-01`
+- **OpenAI-compatible (`/chat/completions`):**
+  - `Content-Type: application/json`
+  - `Authorization: Bearer <ZAI_KEY>`
+  - body: `{"model":"glm-5.2","messages":[...]}`
+### 1.3 API key format
+- Pass the key directly as a bearer/`x-api-key` token for normal use.
+- The key internally has the shape `<id>.<secret>` (a dot-separated pair) and can be used to mint a JWT for JWT-auth flows, but **you generally do not need to do this** — the raw key works as a bearer token.
+- Common auth errors: `401/1000` "Authentication Failed" (wrong/revoked key, or a stray space/newline got pasted in — regenerate), `401/1003` "Authentication Token expired."
+Source: [How to get a Z.AI API key](https://developer.puter.com/tutorials/how-to-get-zai-glm-api-key/), [docs.z.ai HTTP API](https://docs.z.ai/guides/develop/http/introduction).
+### 1.4 Claude Code setup (the canonical method)
+Edit `~/.claude/settings.json` (only add/replace these keys — don't clobber the file):
+```json
+{
+  "env": {
+    "ANTHROPIC_AUTH_TOKEN": "your_zai_api_key",
+    "ANTHROPIC_BASE_URL": "https://api.z.ai/api/anthropic",
+    "API_TIMEOUT_MS": "3000000",
+    "CLAUDE_CODE_AUTO_COMPACT_WINDOW": "1000000",
+    "ANTHROPIC_DEFAULT_HAIKU_MODEL": "glm-4.5-air",
+    "ANTHROPIC_DEFAULT_SONNET_MODEL": "glm-5.2[1m]",
+    "ANTHROPIC_DEFAULT_OPUS_MODEL": "glm-5.2[1m]"
+  }
+}
+```
+Notes:
+- `API_TIMEOUT_MS: 3000000` = 50 minutes. **Needed** — long 1M-context / Max-effort calls can take a long time to first token; the default timeout kills them mid-flight, producing confusing "connection error" failures.
+- `CLAUDE_CODE_AUTO_COMPACT_WINDOW: 1000000` pairs with the `[1m]` suffix to actually use the million-token window.
+- The `[1m]` suffix selects the 1M-context variant. If Claude Code says the `[1m]` model "does not exist," **upgrade Claude Code** to the latest version.
+- **Z.ai recommends NOT hardcoding model mappings.** If you delete the three `ANTHROPIC_DEFAULT_*` keys, the server applies its own current default mapping (so you auto-track the latest GLM). Hardcoding pins you to a model that may be deprecated. For an MCP, decide deliberately: explicit IDs = reproducible/predictable; deleted = auto-latest.
+- Use `ANTHROPIC_AUTH_TOKEN`, **not** `ANTHROPIC_API_KEY`. If both are set, Claude Code prefers the token. The token goes out as `Authorization: Bearer`, which Z.ai's gateway expects.
+- Optional: in `~/.claude.json`, set `"hasCompletedOnboarding": true` to skip the first-run onboarding/auth prompt.
+- macOS/Linux auto-installer: `curl -O "https://cdn.bigmodel.cn/install/claude_code_zai_env.sh" && bash ./claude_code_zai_env.sh` (**no Windows support** — Windows users must edit `settings.json` manually, which is the case for this MCP's host environment).
+Source: [docs.z.ai Claude Code guide](https://docs.z.ai/scenario-example/develop-tools/claude), [docs.z.ai model switching](https://docs.z.ai/devpack/latest-model), [larridin GLM-5.2 in Claude Code](https://larridin.com/blog/use-glm-5-2-claude-code-cut-costs-50), [aiengineerguide](https://aiengineerguide.com/til/glm-5-2-claude-code/).
+### 1.5 Confirmed / current model IDs
+| Model ID | Role | Notes |
+|---|---|---|
+| `glm-5.2` | Current flagship coding/agentic model | ~753B MoE, ~40B active, 1M context (1,048,576 tok), up to ~128K–131K output. Released Jun 2026. |
+| `glm-5.2[1m]` | 1M-context variant of 5.2 | Suffix form used in Claude Code mappings. |
+| `glm-5-turbo` | Faster/cheaper GLM-5-class | Marketed alongside 5.2 on the Coding Plan. |
+| `glm-4.7` | Prior-gen flagship | Still offered; cheaper-than-flagship general workhorse. Note: more rigid refusal behavior (see §4). |
+| `glm-4.6` | Older coding model | Widely documented, OpenAI slug `z-ai/glm-4.6`. |
+| `glm-4.5` | Older | First GLM gen to support Claude Code. |
+| `glm-4.5-air` | Lightweight | **Default Haiku mapping**; good for cheap background tasks. |
+**"GLM 5.2" disambiguation:** the user's "GLM 5.2" maps directly to model ID **`glm-5.2`** (or `glm-5.2[1m]` for full context). It is NOT a typo for 4.5/4.6 — GLM-5.2 is the genuine current flagship (open weights MIT-licensed on Hugging Face / ModelScope). Pricing on the standalone API: ~$1.40 / 1M input, ~$4.40 / 1M output.
+- Aggregator slugs differ from bare IDs: OpenRouter `z-ai/glm-5.2`; AI/ML API `zhipu/glm-4.6`; Mastra `zhipuai-coding-plan/glm-4.5-air`. **Use bare IDs (`glm-5.2`, `glm-4.5-air`) when talking to Z.ai directly.**
+Sources: [docs.z.ai GLM-5.2](https://docs.z.ai/guides/llm/glm-5.2), [Together AI GLM-5.2](https://www.together.ai/models/glm-52), [eigent GLM-5.2](https://www.eigent.ai/blog/glm-5-2), [zai-org/GLM-5 GitHub](https://github.com/zai-org/GLM-5).
+### 1.6 Quick verification
+- Curl test: `POST https://api.z.ai/api/anthropic/v1/messages` with your key → response JSON should show `"model":"glm-..."`.
+- In a Claude Code session: `/status` should show the settings source as your `~/.claude/settings.json` and the model as `glm-5.2` / `glm-5.2[1m]`. `/effort` switches thinking intensity (GLM-5.2 has **High** and **Max**; Z.ai recommends **Max** for coding).
+- Identity probe ("what model are you?") and watching that the first call hits `api.z.ai` (not `api.anthropic.com`) catch silent fallback/misconfiguration.
+---
+## 2. claude-code-router (CCR) — how it routes to GLM (brief)
+- **What it is:** an open-source local proxy (`musistudio/claude-code-router`, run via `ccr code`). Claude Code speaks Anthropic format only; CCR intercepts `/v1/messages`, converts to a unified OpenAI-style intermediate format, routes per-request to any provider, and converts the response back. Config: `~/.claude-code-router/config.json` with `Providers[]` + `Router{}` blocks.
+- **Routing categories:** `default`, `background` (cheap/fast), `think` (reasoning), `longContext` (past `longContextThreshold`, default 60000 tok), `webSearch`. Models referenced as `provider,model`. Supports **fallback chains** (auto-retry next model on HTTP error).
+- **GLM-specific quirk it fixes — reasoning:** GLM's `/chat/completions` has reasoning on by default but the model self-decides whether to think; Claude Code's heavy system prompt suppresses that, so GLM rarely reasons. CCR ships a small **`reasoning` transformer** (<40 lines) that explicitly signals GLM to think, restoring chain-of-thought for GLM-4.5/4.6. Other useful transformers: `enhancetool` (tolerance for malformed tool-call params, but disables streaming of tool calls), `cleancache` (strips `cache_control`).
+- **When to use CCR vs. native env-vars:** the native `ANTHROPIC_BASE_URL` route (§1.4) is simpler and is the official Z.ai path — fine for a single GLM target. CCR is worth it when you want **per-request-type routing** (e.g., GLM for default/background but Claude for `think`), **fallback between providers**, or to fix GLM's reasoning suppression. The project is now sponsored by Z.ai via the GLM Coding Plan.
+Sources: [musistudio/claude-code-router](https://github.com/musistudio/claude-code-router), [CCR transformers](https://musistudio.github.io/claude-code-router/docs/server/config/transformers/), [CCR routing](https://musistudio.github.io/claude-code-router/docs/server/config/routing/), [CCR blog (GLM reasoning)](https://musistudio.github.io/claude-code-router/blog/).
+---
+## 3. Data privacy / residency / retention (CRITICAL for proprietary code)
+- **Servers are in China.** Zhipu AI (international brand Z.ai) is Beijing-based. Latency ~100–200ms from EU/US.
+- **Chinese jurisdiction risk:** API users are subject to China's **National Intelligence Law**, which can compel Chinese companies to cooperate with state intelligence. The privacy policy's "compliance with applicable laws" clause is the hook for this. Independent coverage explicitly flags "China data risk" for API (vs. self-hosted) use.
+- **Entity List:** the US Bureau of Industry and Security added Zhipu AI to its **Entity List in January 2025** (AI / military-modernization rationale). Relevant for some corporate/government users' procurement and export-compliance posture.
+- **Stated retention (per Z.ai privacy policy / DPA):** For API, the company says it **does not store** the content you input/generate — processed in real time, not saved — **except** that "Customer Data other than that covered above" may be **temporarily stored** to provide the service or "in compliance with applicable laws." Account-level data (account info + inputs) is retained while the account exists.
+- **Practical guidance for an MCP handling proprietary code:**
+  - Treat anything sent to GLM cloud as potentially exposed to a foreign jurisdiction, regardless of the no-storage claim. Do **not** route secrets, credentials, or regulated/IP-sensitive code through the cloud API without legal sign-off.
+  - Consider a **sensitivity gate** in the MCP: only delegate non-sensitive tasks (boilerplate, scaffolding, public-API glue) to GLM; keep proprietary/regulated paths on a trusted provider or local model.
+  - **Self-hosting the open weights** (MIT-licensed, on HF/ModelScope) fully removes the China-server/jurisdiction issue — the recommended route for strict-privacy orgs (at the cost of substantial GPU infra for a ~753B MoE).
+- **Not legal advice** — have counsel assess against your specific compliance regime (GDPR, ITAR/EAR, sector rules).
+Sources: [Z.ai privacy policy](https://docs.z.ai/legal-agreement/privacy-policy), [TechTimes: China data risk](https://www.techtimes.com/articles/318543/20260617/glm-52-open-weights-live-top-coding-benchmark-api-use-carries-china-data-risk.htm), [SCMP on GLM-5.2 open-source](https://www.scmp.com/tech/tech-trends/article/3357115/zhipu-ais-stock-rockets-after-chinese-firm-makes-glm-52-open-source).
+---
+## 4. Known limitations, failure modes, quirks
+- **Tool-call streaming corruption:** during multi-step agentic editing, GLM responses intermittently emit **malformed / duplicated tool-call markers** in streamed text (non-deterministic, more frequent over long sessions), which can crash the agent. CCR's `enhancetool` transformer mitigates by tolerating malformed params (trade-off: tool calls stop streaming). On benchmarks GLM tool-calling is actually strong (GLM-4.5 ~90.6% success, edging Claude Sonnet 4's 89.5%), so this is a streaming/format issue, not a capability gap.
+- **Reasoning suppression in Claude Code:** GLM often **won't think** under Claude Code's system prompt unless explicitly nudged (the CCR reasoning-transformer problem, §2). If GLM seems "shallow," this is likely why. Use `/effort` → Max.
+- **Degenerate loops:** GLM-5 has been observed entering long identical-failing-call loops (e.g., 137 steps repeating a failing call with no argument adaptation). Note this is **not GLM-specific** — Claude Opus produced even longer empty-call trajectories in the same benchmark. Build loop/step caps into the MCP regardless of provider.
+- **Refusals:** generally *lighter* than ChatGPT; GLM-4.6 described as a "happy medium." **But GLM-4.7 specifically has more rigid, dominant refusal behavior** that can interfere. If using 4.7 as the workhorse, watch for spurious refusals; `glm-5.2` is reportedly better (fewer false positives, better multi-turn jailbreak resilience).
+- **Context degradation:** industry-wide degradation past ~100K tokens applies here too. Although 5.2 advertises 1M context, effective agentic reliability is best kept well under that; some sources note ~128K as the practical sweet spot for agentic loops. Don't assume 1M = 1M usable.
+- **Non-coding weakness:** more hallucination and lower reliability on non-coding tasks; weaker on open-ended reasoning / nuanced judgment and graduate-level science (GLM-4.6 ranked ~#15 on GPQA). Best for **well-scoped coding**, not the hardest multi-file SWE or open-ended reasoning — keep those on a stronger model.
+- **Self-hosted small variants** (`glm-4.7-flash` etc.) are notably flakier in Claude Code (context loss, tool errors) than the hosted flagship.
+Sources: [Cirra: GLM-4.6 tool calling](https://cirra.ai/articles/glm-4-6-tool-calling-mcp-analysis), [sglang issue #15721 (4.7 tool calling in Claude Code)](https://github.com/sgl-project/sglang/issues/15721), [ollama issue #13820 (4.7-flash)](https://github.com/ollama/ollama/issues/13820), [MindStudio 5.2 vs GPT-5.5 vs Opus](https://www.mindstudio.ai/blog/glm-5-2-vs-gpt-5-5-vs-claude-opus-agentic-workflows), ["When Refusals Fail" arXiv](https://arxiv.org/pdf/2512.02445).
+---
+## 5. Reliability / uptime / regional access
+- **General uptime:** considered reliable for single-session work; occasional slowdowns during **Chinese business hours**. No first-party public status/uptime page with historical incidents was found — monitor Z.ai's own channels.
+- **Latency:** ~100–200ms from EU/US (servers in China). For interactive Claude Code sessions this is reportedly not very noticeable.
+- **Regional timing upside:** Z.ai peak hours are 14:00–18:00 UTC+8 — i.e., early morning / overnight in EU/US, so Western users often hit off-peak naturally.
+- **THE concurrency limit (most important for a subagent MCP):** an **undocumented low concurrency cap — reported as 1 in-flight request even on paid (Pro) tiers** for GLM-4.7. A paying user could only consume ~4% of their 5-hour quota before hitting "Too much concurrency." **Multi-agent / background-task fan-out overwhelms it immediately.** Z.ai's plan pages don't publish concurrency numbers. **This directly constrains using GLM as a parallel delegated subagent** — design the MCP to serialize requests (a queue / mutex), add exponential backoff on "Too much concurrency"/429, and avoid spawning concurrent GLM calls. Confirm current limits with Z.ai support before relying on parallelism.
+- **Rate-limit/quota model:** quota is "prompts per 5 hours" + weekly caps, not classic RPM. Each "prompt" ≈ 15–20 model calls.
+Sources: [opencode issue #8618 (concurrency=1)](https://github.com/anomalyco/opencode/issues/8618), [docs.z.ai FAQ](https://docs.z.ai/devpack/faq), [Z.ai rate limits](https://z.ai/manage-apikey/rate-limits), [Z.ai API guide](https://www.aimadetools.com/blog/z-ai-api-complete-guide/).
+---
+## 6. Other operationally important notes (for GLM as a delegated subagent)
+- **Plan tiers & quota economics (approximate, USD/mo):** Lite ~$18 (~80 prompts/5h, ~400/wk), Pro ~$72 (~400/5h, ~2,000/wk), Max ~$160 (~1,600/5h, ~8,000/wk), plus monthly MCP web-search/reader allowances (100 / 1,000 / 4,000). A cheaper Lite tier near **$3/mo** has also been cited historically. **Pricing/tier names shift often — verify on [z.ai/subscribe](https://z.ai/subscribe).**
+- **Quota burn multiplier:** GLM-5.2 and GLM-5-turbo deduct at **3× during peak, 2× off-peak** (vs. 1× standard for older models). A **1× off-peak promo runs through end of Sept 2026.** Z.ai's own guidance: use GLM-5.2 for complex tasks, fall back to GLM-4.7 for routine work to conserve quota. For an MCP, route cheap/background work to `glm-4.5-air` or `glm-4.7` and reserve `glm-5.2[1m]` for hard tasks.
+- **Coding-plan quota only counts on the coding/Anthropic endpoints.** If the MCP accidentally hits the general `/api/paas/v4` endpoint, it bills prepaid balance instead of the plan — a silent cost leak.
+- **Drop-in compatibility breadth:** Z.ai is (per its own marketing) the main non-Anthropic vendor offering an Anthropic-compatible endpoint, so the same `ANTHROPIC_BASE_URL` trick works for Cline, OpenCode, Cursor, Kilo Code, etc. — the MCP's approach is portable.
+- **Settings isolation:** because GLM config is just env-vars in `~/.claude/settings.json`, it's global to Claude Code. To run GLM as a *delegated* path without hijacking the user's main Claude session, consider an **isolated settings dir / separate process** with its own `ANTHROPIC_BASE_URL`+token (cf. `MG-Cafe/claudecode-glm-stack` and `ankurkakroo2/claude-code-glm-setup`, which run GLM inside Claude Code with isolated settings + session indicators), or call the `/v1/messages` endpoint directly from the MCP rather than shelling out to a globally-reconfigured Claude Code.
+- **Windows caveat:** the official auto-install script is macOS/Linux only; on this Windows host, configure `settings.json` manually.
+Sources: [aipricing.guru GLM plan pricing](https://www.aipricing.guru/z-ai-subscription-pricing/), [z.ai/subscribe](https://z.ai/subscribe), [MG-Cafe/claudecode-glm-stack](https://github.com/MG-Cafe/claudecode-glm-stack), [ankurkakroo2/claude-code-glm-setup](https://github.com/ankurkakroo2/claude-code-glm-setup).
+---
+## Open items to verify against live docs before shipping the MCP
+1. **Current default model mapping** when `ANTHROPIC_DEFAULT_*` are omitted (changes when Z.ai promotes a new flagship). Check [docs.z.ai/devpack/latest-model](https://docs.z.ai/devpack/latest-model).
+2. **Exact concurrency limit per tier** (the "1" figure is a user report, not official). Confirm with Z.ai support.
+3. Whether `glm-5.2[1m]` vs `glm-5.2` is needed for your context sizes, and current output-token cap.
+4. Live pricing/tier names and the 1× off-peak promo end date.

package/docs/research/glm-peak-usage-and-cost.md ADDED Viewed

@@ -0,0 +1,146 @@
+# GLM (Zhipu AI / Z.ai) — When to Use vs. Avoid: Peak Hours, Cost & Throttling
+**Research date:** 2026-06-30
+**Scope:** When is it most cost- and time-ideal to lean on GLM (GLM-5.2 / GLM-4.7 / "GLM Coding Plan", which the user calls "GLM 5.2") inside Claude Code as a cheaper alternative to Anthropic Opus — and when to avoid it.
+> **Reliability warning.** GLM pricing, quotas, and the peak/off-peak rules change *frequently* and are heavily promotional. Several numbers below are reported by third-party guides and individual users, not all confirmed on official pages (and the official pages are JS-rendered and could not be fetched directly during this research). Treat specific numbers as "last-known, verify before relying." Where a figure is uncertain it is explicitly flagged. **Always confirm live at [z.ai/subscribe](https://z.ai/subscribe) and [docs.z.ai/devpack/usage-policy](https://docs.z.ai/devpack/usage-policy).**
+---
+## TL;DR — the decision rules that matter
+1. **GLM bills the flagship model (GLM-5.2 / GLM-5-Turbo) at a time-of-day multiplier: ~3x quota during peak hours, ~2x off-peak** — currently **1x off-peak under a promo through ~end of Sept 2026**. ([docs.z.ai](https://docs.z.ai/devpack/usage-policy), [aipricing.guru](https://www.aipricing.guru/z-ai-subscription-pricing/))
+2. **Peak window is most commonly cited as 14:00–18:00 China time (UTC+8).** That is roughly **02:00–06:00 US Eastern / 06:00–10:00 UTC** — i.e. the **middle of the night in the Americas**, which is *good* for Western users. (A separate Z.ai promo described the restricted window as "2–6 AM ET," which is consistent with the UTC+8 afternoon window. The exact billing-peak definition is **not fully confirmed** and the two phrasings don't perfectly align — see "Uncertainties.")
+3. **Quota is metered in PROMPTS per 5-hour rolling window + a weekly cap, not tokens or RPM.** Heavy/agentic use hits the prompt ceiling, not a token budget. ([docs.z.ai FAQ](https://docs.z.ai/devpack/faq))
+4. **Concurrency is the real-world choke point.** Limits are dynamic (Max > Pro > Lite) and *not published as numbers*; users have reported effectively **1 in-flight request** on lower tiers, breaking multi-agent/subagent workflows. Concurrency is raised during off-peak. ([opencode #8618](https://github.com/anomalyco/opencode/issues/8618), [docs.z.ai usage-policy](https://docs.z.ai/devpack/usage-policy))
+5. **GLM is much cheaper but slower and less reliable than Opus.** Among the slowest frontier coding models in interactive use (~50–100 tok/s on standard servers), with documented uptime wobbles under load. Best as a *complement* to Opus, not a replacement. ([aitoolanalysis](https://aitoolanalysis.com/glm-coding-plan-review/), [cirra.ai](https://cirra.ai/articles/glm-4-6-vs-claude-sonnet-comparison))
+---
+## 1. Pricing & subscription tiers
+### GLM Coding Plan (subscription — the relevant product for Claude Code)
+Flat monthly subscription that exposes an OpenAI-compatible endpoint usable inside Claude Code, Cline, Roo Code, etc. All tiers include the **same models** (GLM-5.2 flagship, GLM-5-Turbo, GLM-4.7, GLM-4.5-air); tiers differ by **quota, MCP allowance, and concurrency/priority**, not model access. ([z.ai/subscribe](https://z.ai/subscribe), [docs.z.ai/devpack/overview](https://docs.z.ai/devpack/overview))
+| Tier | Std monthly (mid-2026)* | Prompts / 5h | Prompts / week | MCP calls/mo | Concurrency guidance |
+|------|------------------------|--------------|----------------|--------------|----------------------|
+| **Lite** | ~$18 (promo ~$12.60) | ~80 | ~400 | 100 | 1 project at a time |
+| **Pro** | ~$72 (promo ~$50.40) | ~400 | ~2,000 | 1,000 | 1–2 projects |
+| **Max** | ~$160 (promo ~$112) | ~1,600 | ~8,000 | 4,000 | 2+ projects |
+| **Team** | custom | higher shared | — | — | org-level |
+\* **Pricing has moved a lot.** Earlier 2026 tiers were Lite **$10** / Pro **$30** / Max **$80** (post-Feb 2026 reset; original launch had a $3/mo promo, now gone). Some sources still quote Pro at $15/mo. Billing discounts ~10% monthly / 20% quarterly / 30% yearly; a ~30% intro promo has also been live. **Verify the live number.** ([aipricing.guru](https://www.aipricing.guru/z-ai-subscription-pricing/), [codingplan.run](https://codingplan.run/plans/glm-coding-plan), [distk.in](https://distk.in/blog/glm-coding-plan-pricing-guide-2026.html), [vibecoding.app](https://vibecoding.app/blog/zhipu-ai-glm-pricing-2026))
+**Restriction:** the Coding Plan only works inside officially supported coding tools — it is *not* a general API-key replacement. ([grokipedia](https://grokipedia.com/page/GLM_Coding_Plan))
+### Pay-as-you-go API token pricing (for comparison)
+| Model | Input $/M | Output $/M | Source |
+|-------|-----------|-----------|--------|
+| GLM-4.6 (list) | $0.60 | $2.20 | [docs.z.ai/guides/overview/pricing](https://docs.z.ai/guides/overview/pricing) |
+| GLM-4.6 (OpenRouter) | $0.43 | $1.74 | [openrouter](https://openrouter.ai/z-ai/glm-4.6) |
+| GLM-4.7 (OpenRouter) | $0.40 | $1.75 | [openrouter](https://openrouter.ai/z-ai/glm-4.7) |
+| GLM-5.2 (approx) | ~$1.40 | ~$4.40 | [pricepertoken](https://pricepertoken.com/pricing-page/provider/z-ai) |
+### Anthropic Claude (for the cost gap)
+| Model | Input $/M | Output $/M |
+|-------|-----------|-----------|
+| Claude Opus 4.8 / 4.7 | $5.00 | $25.00 |
+| Claude Sonnet 4.6 | $3.00 | $15.00 |
+| Claude Haiku 4.5 | $1.00 | $5.00 |
+Source: [platform.claude.com/docs/pricing](https://platform.claude.com/docs/en/about-claude/pricing).
+**Cost gap:** On raw tokens GLM-4.7 (~$0.40/$1.75) is roughly **~10x cheaper input and ~14x cheaper output than Opus** ($5/$25). GLM-5.2 (~$1.40/$4.40) is ~3.5x/5.7x cheaper than Opus. Via the subscription, Z.ai claims effective cost ~1% of standard API rates for heavy users. GLM also uses ~15% fewer tokens/task than GLM-4.5, partly offsetting slower speed. ([cirra.ai](https://cirra.ai/articles/glm-4-6-vs-claude-sonnet-comparison), [docs.z.ai FAQ](https://docs.z.ai/devpack/faq))
+---
+## 2. Peak vs. off-peak — the core timing rule
+**Mechanism:** GLM-5.2 and GLM-5-Turbo (the "Opus-class" flagships) consume your prompt quota at a **multiplier based on time of day**:
+- **Peak hours: ~3x** quota burn
+- **Off-peak: ~2x** normally — **currently 1x under a promo valid through ~end of September 2026**
+Older models (GLM-4.7, GLM-4.5-air) do **not** carry this premium multiplier — they burn quota at the base rate. ([docs.z.ai/devpack/usage-policy](https://docs.z.ai/devpack/usage-policy), [aipricing.guru](https://www.aipricing.guru/z-ai-subscription-pricing/), [Ivan Fioravanti / X](https://x.com/ivanfioravanti/status/2043685076186120442))
+**Peak window (timezone-critical):**
+- Most-cited: **14:00–18:00 China Standard Time (UTC+8)**.
+- That converts to **≈06:00–10:00 UTC**, **≈02:00–06:00 US Eastern**, **≈23:00–03:00 US Pacific (prev night)**, **≈07:00–11:00 UK**.
+- A Z.ai promo separately described its restricted window as **"2–6 AM ET"** (consistent with the UTC+8 afternoon). ([Z.ai / X](https://x.com/Zai_org/status/2033233961669783600))
+**Implication for a US/European Claude Code user:** your normal working day **lands almost entirely in GLM's off-peak window** → you currently pay **1x** (promo) and avoid the 3x peak surcharge. This is the single biggest reason GLM is attractive *right now* for Western-timezone devs. The exception is late-night US-East work (roughly 2–6 AM ET) which hits peak.
+---
+## 3. Quotas, rate limits & concurrency
+- **Quota unit = prompts, not tokens.** One "prompt" = one user query, internally fanning out to ~15–20 model calls. Measured per **5-hour rolling window** plus a **weekly cap** (weekly counter starts at purchase, resets on a 7-day cycle). ([docs.z.ai FAQ](https://docs.z.ai/devpack/faq))
+- **No published RPM limit.** The constraint is prompts/cycle, not requests-per-minute.
+- **Concurrency limits are dynamic and unpublished**, ordered Max > Pro > Lite; "recommended concurrent projects" = 1 (Lite) / 1–2 (Pro) / 2+ (Max). Raised during off-peak. ([docs.z.ai/devpack/usage-policy](https://docs.z.ai/devpack/usage-policy))
+- **Real-world concurrency complaint (important):** a Pro user reported an *undocumented* effective concurrent limit of **~1 in-flight request**, hitting "Too much concurrency" after using only ~4% of the 5h quota — making subagent / parallel-agent workflows fail. ([opencode #8618](https://github.com/anomalyco/opencode/issues/8618))
+**Takeaway:** for **parallel / multi-agent / subagent** workflows, GLM lower tiers can be unusable regardless of remaining quota. Use Opus for fan-out work, or budget Max tier + run during off-peak when concurrency is lifted.
+---
+## 4. Latency & reliability
+- **Speed:** Among the **slowest** frontier coding models for interactive use. ~**50–100 tok/s** on standard optimized servers; noticeably slower than Opus/Grok. Prompt loading / input processing also adds latency, especially on local quantized variants. The exception is **Cerebras-hosted GLM-4.7 at ~1,000–1,700 tok/s** — but that is a different deployment, not the standard Z.ai Coding Plan endpoint. ([aitoolanalysis](https://aitoolanalysis.com/glm-coding-plan-review/), [cerebras.ai](https://www.cerebras.ai/blog/glm-4-7))
+- **Reliability:** Documented instability under load — after the GLM-5 launch, traffic spiked ~10x, the service was unstable for days, new subscriptions were briefly capped, and Z.ai issued a public apology. GLM-4.6 hallucinates / produces erroneous code (off-by-one, edge-case misses) somewhat more than Claude; Claude leads on deep debugging and tricky reasoning. GLM-4.7 improved stability and tool-use consistency. ([cirra.ai](https://cirra.ai/articles/glm-4-6-vs-claude-sonnet-comparison), [medium/leucopsis](https://medium.com/@leucopsis/glm-4-6-review-0600e9425c73))
+- **Data-residency note:** the cloud API routes through servers in China — a consideration for sensitive code. ([aitoolanalysis](https://aitoolanalysis.com/glm-coding-plan-review/))
+---
+## 5. Recommendation table — when to use GLM heavily / lightly / avoid
+Times shown in **US Eastern** with **UTC+8** in parentheses (peak = 14:00–18:00 UTC+8 ≈ 02:00–06:00 ET).
+| Condition | Recommendation | Why |
+|-----------|----------------|-----|
+| **Western working hours (≈08:00–24:00 ET = off-peak in UTC+8)** | **USE GLM HEAVILY** for routine/bulk/boilerplate coding | Off-peak = currently 1x quota (promo) on flagship; cheapest possible. |
+| **Late night US-East ~02:00–06:00 ET (= 14:00–18:00 UTC+8 peak)** | **AVOID flagship GLM-5.2 / use lightly** | 3x quota burn on GLM-5.2/Turbo; concurrency tightened. Switch to GLM-4.7 (no multiplier) or to Opus. |
+| **Routine edits, refactors, tests, scaffolding (any time)** | **USE GLM** (prefer GLM-4.7 to dodge the multiplier) | Cheap; GLM-4.7 has no peak surcharge; quality is adequate. |
+| **Hard debugging, tricky reasoning, architecture, correctness-critical code** | **AVOID GLM → use Opus** | GLM hallucinates/errs more; Opus more reliable for deep work. |
+| **Parallel / multi-agent / subagent fan-out** | **AVOID GLM on Lite/Pro → use Opus** (or Max off-peak) | Effective concurrency can be ~1; subagent workflows fail. |
+| **Latency-sensitive / tight interactive loop** | **Prefer Opus**; GLM only if cost dominates | GLM among slowest models; ~50–100 tok/s. |
+| **Promo ends (after ~Sept 2026), off-peak reverts to 2x** | **Re-evaluate**; shift more flagship load to GLM-4.7, keep Opus for hard tasks | Off-peak flagship cost doubles; GLM-4.7 stays base-rate. |
+| **Big context / long-running autonomous loop where speed matters less** | **USE GLM** | Token efficiency + low cost outweigh slowness when not waiting on it. |
+| **Sensitive / proprietary code with data-residency concerns** | **AVOID GLM cloud API** | Routes through China-based servers. |
+**Codified cost rule of thumb:** If you must run the flagship **GLM-5.2 during peak (3x)**, its effective subscription cost-per-prompt roughly triples — at that point Opus's reliability/speed often wins. **During off-peak (currently 1x), GLM is dramatically cheaper than Opus** and is the default choice for non-critical work. **GLM-4.7 sidesteps the multiplier entirely** and is the safest "always-on cheap" pick.
+---
+## Uncertainties / verify-before-relying
+- **Exact peak window** — 14:00–18:00 UTC+8 is the most-cited figure but not confirmed on a fetched official page; the "2–6 AM ET" promo phrasing is from Z.ai's X account and roughly aligns. The *billing*-peak definition vs. *promo-availability* window may differ. **Confirm on [docs.z.ai/devpack/usage-policy](https://docs.z.ai/devpack/usage-policy).**
+- **Off-peak 1x promo end date** — reported "end of September 2026"; earlier promos said "end of April." Dates shift.
+- **Tier prices** — range across sources ($10/$30/$80 vs $18/$72/$160 vs Pro $15). Region-, cycle-, and promo-dependent.
+- **Concurrency numbers** — not officially published; "~1 in-flight" is a user report on one tier/tool.
+- **GLM-5.2 token pricing** (~$1.40/$4.40) is approximate from an aggregator.
+- Official Z.ai pages are JS-rendered; WebFetch could not load them directly, so the above leans on WebSearch summaries and third-party guides.
+## Sources
+- Z.ai subscribe / plans: https://z.ai/subscribe
+- Z.ai dev docs overview: https://docs.z.ai/devpack/overview
+- Z.ai usage policy: https://docs.z.ai/devpack/usage-policy
+- Z.ai FAQ: https://docs.z.ai/devpack/faq
+- Z.ai API pricing: https://docs.z.ai/guides/overview/pricing
+- AI Pricing Guru: https://www.aipricing.guru/z-ai-subscription-pricing/
+- BuyGLM guide: https://buyglm.com/guides/glm-coding-plan-guide
+- Distk guide: https://distk.in/blog/glm-coding-plan-pricing-guide-2026.html
+- codingplan.run: https://codingplan.run/plans/glm-coding-plan
+- vibecoding.app: https://vibecoding.app/blog/zhipu-ai-glm-pricing-2026
+- pricepertoken (Z-ai): https://pricepertoken.com/pricing-page/provider/z-ai
+- OpenRouter GLM-4.6: https://openrouter.ai/z-ai/glm-4.6
+- OpenRouter GLM-4.7: https://openrouter.ai/z-ai/glm-4.7
+- Cirra GLM-4.6 vs Sonnet: https://cirra.ai/articles/glm-4-6-vs-claude-sonnet-comparison
+- AI Tool Analysis review: https://aitoolanalysis.com/glm-coding-plan-review/
+- Cerebras GLM-4.7 speed: https://www.cerebras.ai/blog/glm-4-7
+- opencode concurrency bug #8618: https://github.com/anomalyco/opencode/issues/8618
+- Ivan Fioravanti / X (multiplier): https://x.com/ivanfioravanti/status/2043685076186120442
+- Z.ai / X (peak window promo): https://x.com/Zai_org/status/2033233961669783600
+- Anthropic pricing: https://platform.claude.com/docs/en/about-claude/pricing

package/docs/research/glm-vs-opus-scenario-matrix.md ADDED Viewed

@@ -0,0 +1,85 @@
+# GLM vs. Opus: Developer Task Scenario Matrix
+**Purpose:** A routing reference for deciding, per concrete developer task type, whether to delegate to **GLM** (Zhipu/Z.ai, ~10x cheaper) or keep on **Anthropic Claude Opus**.
+**Date:** 2026-06-30
+**Models in scope:** GLM-5.2 (current GLM coding flagship) vs. Claude Opus 4.8 (current practical Claude comparison point).
+---
+## Decision framework
+Two forces drive every call:
+1. **Quality parity.** On *coding* benchmarks GLM-5.2 now lands at ~95-100% of Opus 4.8 for a fraction of the price (Terminal-Bench 2.1: 81.0 vs Opus 85.0; SWE-bench Pro: 62.1, edging some closed frontier models). So for well-specified code where the compiler/tests catch errors, GLM is a near-peer. ([danilchenko.dev](https://www.danilchenko.dev/posts/glm-5-2-review/), [Semgrep](https://semgrep.dev/blog/2026/we-have-mythos-at-home-glm-52-beats-claude-in-our-cyber-benchmarks/))
+2. **Cost of being wrong.** The dominant cost is *retry rate*, not token price: "A cheaper model that requires human review on 15% of outputs costs more per completed task than an expensive model with a 3% review rate." ([morphllm](https://www.morphllm.com/comparisons/claude-code-alternatives)) So favor GLM where errors are cheap and self-evident; favor Opus where a wrong answer cascades silently or is expensive to detect.
+**GLM strengths** (grounded): frontend/visual code generation, boilerplate, reliable tool-calling, well-specified single-purpose work, algorithmic codegen, multilingual robustness, and ~10x lower price. Holds long context well for code-tracing even at ~400K in practice.
+**GLM weaknesses** (grounded): ~28% hallucination on factual recall (dangerous for obscure APIs recalled from memory), open-ended multi-step reasoning and high-stakes planning, nuanced/constraint-dense instruction-following, intricate debugging, weaker on systems-language edge cases, and instruction-fidelity erosion above ~64K context. ([danilchenko.dev](https://www.danilchenko.dev/posts/glm-5-2-review/), [MindStudio](https://www.mindstudio.ai/blog/glm-5-2-vs-gpt-5-5-vs-claude-opus-agentic-workflows), [Medium/GLM-4.6 review](https://medium.com/@leucopsis/glm-4-6-review-0600e9425c73))
+**Fit weight scale:** `+2` = strongly GLM-favored; `+1` = GLM-leaning; `0` = toss-up / cost breaks the tie toward GLM; `-1` = Opus-leaning; `-2` = Opus; `-3` = strongly Opus.
+---
+## Scenario matrix
+| Task type | Engine | Fit weight | One-line reason |
+|---|---|---|---|
+| Unit test generation | GLM | +2 | Well-specified, low-ambiguity, errors caught by the test runner itself — ideal cheap-model work. |
+| Integration / e2e test writing | GLM | +1 | Mostly boilerplate-shaped, but cross-service flow reasoning adds a little risk; verify the assertions. |
+| Database schema migrations | OPUS | -2 | Irreversible, stateful, silent data-loss risk — a wrong answer is expensive and hard to detect. |
+| SQL query writing / optimization | GLM | +1 | Writing is easy GLM territory; optimization on complex plans leans Opus, so verify EXPLAIN output. |
+| Regex writing | GLM | +2 | Classic single-purpose codegen; quick to test against examples, cheap to be wrong. |
+| Data pipeline / ETL code | GLM | +1 | Well-specified transform code suits GLM; watch edge-case/null handling and schema drift. |
+| Infrastructure-as-code (Terraform, K8s, Docker) | OPUS | -1 | Boilerplate-ish but mistakes hit prod infra/cost/security; review plans, lean Opus for non-trivial. |
+| CI/CD pipeline config (GitHub Actions, etc.) | GLM | +1 | Templated, YAML-shaped, fast feedback loop on failure — cheap and easy to iterate. |
+| Code review of a diff | OPUS | -2 | Requires subtle reasoning to catch what's *absent*; GLM's hallucination + miss rate is costly here. |
+| Small/local refactor (one file/function) | GLM | +2 | Bounded scope, near-parity quality at single-file refactors — prime GLM use case. |
+| Large cross-cutting refactor (many files) | OPUS | -3 | Long-horizon, multi-step planning + context fidelity past 64K — exactly GLM's weak spot. |
+| i18n / translation of UI strings | GLM | +2 | Multilingual robustness is a GLM strength; low-stakes, easily spot-checked. |
+| Documentation / README / docstrings | GLM | +2 | Low-ambiguity prose-over-code, minimal retry risk, large volume — best cost/quality ratio. |
+| CLI / shell scripting / automation | GLM | +1 | Single-purpose scripting suits GLM; review destructive ops (rm, deletes) before running. |
+| Jupyter notebook / exploratory data analysis | GLM | +1 | Iterative, self-correcting via cell output; cheap to retry, fits exploratory loops. |
+| ML model training code | GLM | 0 | Boilerplate (data loaders, loops) is GLM-fine; subtle correctness (loss, shapes) needs review — toss-up. |
+| Performance optimization of existing code | OPUS | -2 | Needs deep reasoning about hot paths and trade-offs; GLM trails on this analysis. |
+| Third-party API integration (unfamiliar docs) | OPUS | -2 | High hallucination on obscure APIs from memory; risky unless you paste real docs into context. |
+| Type errors / linting fixes | GLM | +2 | Mechanical, compiler/linter is ground truth — wrong answers are caught instantly and cheaply. |
+| Dependency upgrades / migration to new lib versions | OPUS | -1 | Subtle breaking-change reasoning + possibly-obscure new APIs; verify against changelogs. |
+| Greenfield prototyping of small app/feature | GLM | +2 | Fast, polished, especially frontend; low stakes, throwaway-friendly — GLM shines. |
+| Boilerplate / scaffolding / config files | GLM | +2 | The canonical cheap-model task: templated, well-specified, trivially verified. |
+| Frontend UI components / styling | GLM | +2 | GLM's single most-praised strength — visually polished output needing less manual cleanup. |
+| Algorithmic / competitive-programming problems | GLM | +1 | Strong algorithmic codegen with self-checkable I/O; hardest problems still lean Opus. |
+| Security-sensitive code (auth, crypto, validation) | OPUS | -3 | Errors are silent, catastrophic, hard to detect — never optimize for cost here. |
+| Systems programming (Rust/Go/C concurrency, memory) | OPUS | -2 | Weaker on systems-language edge cases and concurrency reasoning; compiler helps but doesn't catch logic races. |
+---
+## Six-line pattern summary
+1. **GLM wins the well-specified middle:** boilerplate, scaffolding, frontend/UI, tests, regex, docs, i18n, type/lint fixes — bounded tasks where the compiler, linter, or test runner is ground truth and being wrong is cheap and obvious.
+2. **Opus wins where errors are silent or expensive:** security/crypto, schema migrations, code review, performance work — places a wrong answer cascades or hides, so the ~10x price is cheap insurance against retry/incident cost.
+3. **Opus wins long-horizon reasoning:** large cross-cutting refactors and multi-step planning hit GLM's known weak spots (instruction-fidelity erosion >64K, open-ended multi-step reasoning).
+4. **The obscure-API trap:** GLM's ~28% factual-recall hallucination makes unfamiliar third-party integrations and dependency migrations risky *unless you paste the real docs in-context* — then it's fine.
+5. **Decision rule:** route by retry cost, not token price — GLM where output is fast to verify and cheap to redo; Opus where verification is hard or failure is costly.
+6. **Always verify GLM output** and escalate to Opus on the first sign of low quality; the borderline cases (ML training code, IaC, SQL optimization) are the ones to watch most closely.
+---
+## Uncertainty flags
+- **Benchmark provenance:** several GLM-5.x scores were originally vendor-self-reported; independent corroboration (e.g. Semgrep's security tests) exists but is partial. Treat exact percentages as directional.
+- **Long-context behavior is contested:** one source flags instruction-fidelity erosion >64K; another reports clean handling of 400K-token monorepo traces. Real performance likely depends heavily on task type (retrieval/trace vs. constraint-dense generation).
+- **Fast-moving field:** model versions and pricing shift monthly (Opus 4.6/4.7/4.8 all appear across 2026 sources). Re-validate before hardcoding any cost assumption.
+- **Security caveat:** Semgrep found GLM-5.2 *beating* Claude on one IDOR-detection benchmark — so the blanket "Opus for security" rule is about cost-of-being-wrong/risk posture, not a claim GLM is incapable. For *writing* security-critical code, the conservative routing still holds.
+### Sources
+- [Serenities AI — GLM-5.1 vs Opus coding benchmark](https://serenitiesai.com/articles/glm-5-1-zhipu-coding-benchmark-claude-opus-comparison-2026)
+- [Semgrep — GLM 5.2 beats Claude in cyber benchmarks](https://semgrep.dev/blog/2026/we-have-mythos-at-home-glm-52-beats-claude-in-our-cyber-benchmarks/)
+- [danilchenko.dev — GLM-5.2 review](https://www.danilchenko.dev/posts/glm-5-2-review/)
+- [MindStudio — GLM 5.2 vs Opus 4.8 agentic workflows](https://www.mindstudio.ai/blog/glm-5-2-vs-gpt-5-5-vs-claude-opus-agentic-workflows)
+- [Morph — Claude Code alternatives / retry-rate economics](https://www.morphllm.com/comparisons/claude-code-alternatives)
+- [Morph — Best AI model for coding June 2026](https://www.morphllm.com/best-ai-model-for-coding)
+- [Medium — GLM-4.6 review (frontend/tool-calling strengths)](https://medium.com/@leucopsis/glm-4-6-review-0600e9425c73)
+- [IntuitionLabs — GLM-4.6 vs Sonnet & GPT-5](https://intuitionlabs.ai/articles/glm-4-6-open-source-coding-model)

package/docs/research/glm-vs-opus-toolcalling.md ADDED Viewed

@@ -0,0 +1,134 @@
+# GLM (Zhipu / Z.ai) vs Claude Opus: Tool-Calling & Agentic Behavior
+**Purpose:** Resolve the apparent contradiction — GLM scores *well* on tool-calling/function-calling benchmarks yet often performs *worse* in sustained multi-step agentic loops — so a router can decide whether to delegate a coding subtask to GLM (cheap) or keep it on Opus.
+**Research date:** 2026-06-30. Sources are cited inline. Vendor claims vs independent results are flagged.
+---
+## TL;DR (the resolution of the contradiction)
+The contradiction is **mostly real but mostly fixable, and the cause is state-handling, not raw capability.**
+1. **One-shot schema adherence:** GLM is genuinely excellent. It matches or beats Opus on BFCL and tau-bench. Clean JSON, refuses unknown tools, minimal argument hallucination.
+2. **Sustained agentic loops:** GLM degrades — but the dominant observed failure mode is **NOT** "GLM is dumb at planning." It is that **GLM requires its `reasoning_content` (thinking blocks) to be fed back into the conversation on every turn** ("Preserved Thinking"). When a harness drops that field, GLM **loses its plan after each tool call and falls into infinite loops** — redoing work, corrupting files, spawning redundant subagents.
+3. **"Plan-then-act" framing:** Independent analysis confirms GLM's default tool-use pattern is more "Thinking/Planning → Calling tools as planned → Synthesizing" (front-loaded planning) than Opus's turn-by-turn interleave. This makes GLM more brittle when downstream steps depend on earlier *tool results* it didn't anticipate during the upfront plan.
+4. **Opus interleaves natively.** Claude 4+ reasons *between* tool calls (interleaved thinking), adapting mid-execution. On Opus 4.6+ adaptive mode it is automatic. This is the structural reason Opus holds up over long loops.
+5. **Routing rule of thumb:** GLM is fine for *one tool call or a short, independent fan-out of calls with a clean schema*. Keep Opus for *long, dependent, MCP-heavy agentic loops* — **especially in any harness whose `reasoning_content` passthrough you have not verified.** Within Claude Code (which does preserve thinking), GLM is far more usable for multi-turn work than in third-party harnesses.
+---
+## 1. How GLM-4.7 / GLM-5.x handle multi-turn tool use
+### The hybrid reasoning model
+GLM is not purely interleaved *or* purely plan-then-execute. Z.ai documents a **hybrid** that supports several modes: thinking, non-thinking, **interleaved reasoning**, **planned-before-response**, and **planned-before-tool-call** ([GLM-4.5 GitHub](https://github.com/zai-org/GLM-4.5), [Z.AI Thinking Mode docs](https://docs.z.ai/guides/capabilities/thinking-mode)). Interleaved thinking has been *supported* since GLM-4.5 — GLM can think between tool calls and after receiving tool results.
+GLM-4.7 adds:
+- **Interleaved Thinking** — "thinks before every response and tool calling."
+- **Preserved Thinking** — "in coding agent scenarios, the model automatically retains all thinking blocks across multi-turn conversations, reusing existing reasoning instead of re-deriving from scratch."
+- **Turn-level / per-turn thinking control.**
+- Thinking is **enabled by default** in GLM-4.7+ (different from GLM-4.6's hybrid default).
+([GLM-4.7 README](https://huggingface.co/zai-org/GLM-4.7/blob/main/README.md))
+### The catch: Preserved Thinking is a hard dependency, and harnesses break it
+This is the heart of the "worse at agentic loops" observation. GLM via Z.AI **requires `reasoning_content` to be passed back to the API on every assistant turn**, using `thinking: {type: "enabled", clear_thinking: false}` and appending `reasoning_content` into each assistant message. If the harness **drops `reasoning_content`**, GLM loses its reasoning state and **loops**.
+Concrete independent evidence:
+- **Goose issue #7363:** GLM-4.7/GLM-5 "lose all reasoning context after every single action and spiral into infinite loops — redoing completed work, corrupting files, and spawning redundant subagents." The *same task* completed in **Claude Code in ~2 minutes**, but in Goose took **40+ tool calls and 40+ minutes with no completion** ([goose#7363](https://github.com/aaif-goose/goose/issues/7363)).
+- **oh-my-pi issue #517:** GLM-5 "gets stuck repeating the same actions" after **2–5 messages/tool calls**; reasoning resets between turns; root cause is the harness dropping `reasoning_content`. **GLM-4.7 is NOT affected** by the same loop ([oh-my-pi#517](https://github.com/can1357/oh-my-pi/issues/517)). This is important: GLM-5's *stronger* reasoning made it *more* dependent on state preservation, so it breaks in harnesses that GLM-4.7 tolerated.
+- **ClawsBench (cited via secondary):** GLM-5 reportedly entered a **137-step loop of identical failing calls with zero argument adaptation**, with the actual plan only appearing at the final step — a severe variant of the same stop-and-replan / state-loss pattern.
+**The fix** (per [Cerebras GLM-4.7 migration guide](https://www.cerebras.ai/blog/glm-4-7-migration-guide)): set `clear_thinking: false` for agent loops / multi-step plans / coding sessions so internal state carries across calls; use `clear_thinking: true` only for one-off calls, batch jobs, or when you see drift.
+### Plan-then-act vs interleave (the observed default)
+Independent technical analysis of GLM-4.6 tool calling describes the default trace as: **"Thinking/Planning… Calling tool(s) as planned… Synthesizing final answer"** — i.e., **explicit planning before execution rather than interleaved improvisation** ([Cirra: GLM-4.6 Tool Calling & MCP analysis](https://cirra.ai/articles/glm-4-6-tool-calling-mcp-analysis)). So even when the *capability* to interleave exists, GLM's behavioral tendency skews toward front-loaded plans. That is exactly what hurts in loops where step N's correct action only becomes knowable after step N-1's tool *result*.
+### Known issues summary
+- **Reasoning suppressed/lost during tool use** → caused by harness dropping `reasoning_content`, not the model. Fixable.
+- **Stop-and-replan / infinite loops** → same root cause; worse on GLM-5 than GLM-4.7.
+- **Streaming tool-call corruption** → not strongly documented as a GLM-specific defect; MoE latency variability noted as a general caveat ([Cirra](https://cirra.ai/articles/glm-4-6-tool-calling-mcp-analysis)).
+---
+## 2. How Opus handles interleaved reasoning + tool use (and why it helps loops)
+Claude 4 models support **interleaved thinking**: reason *between* tool calls, not just before them — "think, call a tool, see the result, think again, proceed" ([Anthropic adaptive thinking docs](https://platform.claude.com/docs/en/build-with-claude/adaptive-thinking), [Grokipedia: Interleaved thinking](https://grokipedia.com/page/Interleaved_thinking_Claude_AI)).
+Why it matters for long loops:
+- **Mid-execution adaptability.** Without it, Claude commits to a plan upfront; with it, Claude adapts after each tool result. This is decisive when downstream decisions depend on what earlier calls returned.
+- **Automatic on Opus 4.6+ adaptive mode** (no beta header). On older Claude 4 models, enable via `interleaved-thinking-2025-05-14`. *Caveat:* Opus 4.6 **manual** mode does **not** support interleaved thinking — use adaptive mode for agentic work.
+- **Thinking budget can exceed `max_tokens`** within a turn (whole context window available), and a `task_budget` beta (`task-budgets-2026-03-13`) lets the model pace itself across an *entire* loop. Both are built for long-horizon agentic runs.
+Anthropic explicitly recommends adaptive thinking for "multi-step tool use, complex coding tasks, and long-horizon agent loops." The practical upshot, mirrored by the Goose data point, is that **Claude Code preserves thinking state for you**, so Opus rarely hits the state-loss loop that bites GLM in third-party harnesses.
+---
+## 3. Single-shot schema adherence vs sustained orchestration — clearly distinguished
+| Dimension | Single tool call / structured extraction | Sustained agentic orchestration (many dependent calls) |
+|---|---|---|
+| **What's tested** | Correct function name, correct/typed args, valid JSON, refuse unknown tools | Carrying intent across N turns, adapting to tool *results*, not looping, recovering from errors |
+| **GLM** | **Strong.** "Boring magic": strict schema compliance, won't fabricate extra params, JSON-native, explicit errors, "zero hallucination in tool calls" ([Cirra](https://cirra.ai/articles/glm-4-6-tool-calling-mcp-analysis)) | **Fragile in practice.** Plan-then-act tendency + Preserved-Thinking dependency → loops/state loss when harness imperfect ([goose#7363](https://github.com/aaif-goose/goose/issues/7363), [oh-my-pi#517](https://github.com/can1357/oh-my-pi/issues/517)) |
+| **Opus** | Strong (slightly lower raw BFCL than GLM, see §4) | **Strong & robust** via native interleaved thinking + harness-managed state |
+| **Bench proxy** | BFCL v3 | tau/tau²-bench, SWE-bench Verified/Pro, Terminal Bench |
+GLM may be *fully fine* at the former and *weak/unreliable* at the latter. The benchmarks that conflate the two are what create the illusion of contradiction.
+---
+## 4. Benchmarks that separate the two (with numbers vs Opus)
+### Function-calling / schema benchmarks (GLM looks great)
+- **BFCL v3 (GLM technical report, vendor):** GLM-4.5 **77.8** vs Claude Opus 4 **74.4**, Claude Sonnet 4 **75.2** — GLM best overall ([GLM-4.5 paper](https://arxiv.org/pdf/2508.06471)).
+- **BFCL v3 live leaderboard (independent, ~Jun 2026):** GLM-4.5 **76.7%**, Claude Opus 4.7 **76.6%**, Gemini 3.1 Flash Lite **76.5%** — essentially tied at the top ([pricepertoken BFCL v3](https://pricepertoken.com/leaderboards/benchmark/bfcl-v3)).
+### Agentic tool-use benchmarks (close; GLM competitive, Opus edges some)
+- **tau-bench (GLM-4.5 report, vendor):** Retail GLM **79.7** vs Opus 4 **81.4**; Airline GLM **60.4** vs Opus 4 **59.6** — basically on par ([GLM-4.5 paper](https://arxiv.org/pdf/2508.06471)).
+- **τ²-bench (GLM-4.7 README, vendor):** GLM-4.7 **87.4%** (GLM-4.6 75.2%). Strong, but vendor-reported, no head-to-head Opus number in same table ([GLM-4.7 README](https://huggingface.co/zai-org/GLM-4.7/blob/main/README.md)).
+### Agentic SWE benchmarks (Opus leads, gap narrowing)
+- **SWE-bench Verified (GLM-4.7 README, vendor):** **73.8%** (+5.8 over GLM-4.6); SWE-bench Multilingual 66.7%; Terminal Bench 2.0 41% ([GLM-4.7 README](https://huggingface.co/zai-org/GLM-4.7/blob/main/README.md)).
+- **SWE-bench Pro ([morphllm leaderboard](https://www.morphllm.com/swe-bench-pro)):**
+  - Claude **Opus 4.8** 69.2% (vendor aggregate, tuned scaffold)
+  - Claude **Opus 4.7** 64.3% (vendor aggregate)
+  - **GLM-5.2** **62.1%** (third-party measured; "Z.ai published no SWE-bench number at launch")
+  - Claude Opus 4.6 51.9% on *Scale standardized* public set — note the huge gap vs vendor aggregates, driven by scaffold/harness differences.
+- **GLM-4.6 vs Claude Sonnet 4, CC-Bench (vendor, multi-turn coding):** ~**48.6% win rate** (near parity), with ~15% fewer tokens than GLM-4.5 ([Cirra](https://cirra.ai/articles/glm-4-6-tool-calling-mcp-analysis), [IntuitionLabs](https://intuitionlabs.ai/articles/glm-4-6-open-source-coding-model)). Z.ai itself conceded GLM-4.6 "still lags Claude Sonnet 4.5 in coding."
+**Interpretation:** GLM **wins/ties pure function-calling (BFCL)**, **ties tau-bench**, and **trails Opus on agentic SWE** — and the SWE gap is where dependent, long-horizon orchestration is actually stressed. **Strong caveat:** SWE numbers are extremely scaffold-sensitive (Opus 4.6 swings 51.9% → 69.2% by harness), and many GLM figures are vendor-reported. Treat absolute numbers as directional, not exact.
+### Naming note
+"GLM-5.2" exists: it is the latest iterative release in the GLM-5 family (after GLM-5, GLM-5-Turbo, GLM-5.1), MoE 744B total / 40B active, 1M context, MIT-licensed, released ~mid-June 2026, ~$1.40/$4.40 per M tokens vs Opus ~$5/$25 ([trendingtopics](https://www.trendingtopics.eu/glm-5-2-chinas-zhipu-ai-beats-even-googles-top-models-with-its-new-open-llm/), [SCMP](https://www.scmp.com/tech/tech-trends/article/3357115/zhipu-ais-stock-rockets-after-chinese-firm-makes-glm-52-open-source)). The agentic loop/state-loss issues are documented for **GLM-4.7 and GLM-5**; expect GLM-5.2 to inherit the same Preserved-Thinking dependency (uncertain — not yet independently confirmed for 5.2 specifically).
+---
+## 5. Concrete routing rules (GLM vs Opus, with why)
+Fit weight scale: **-3 = strongly Opus … 0 = neutral … +2 = GLM-favored.**
+| # | Scenario | Route | Weight | Why |
+|---|---|---|---|---|
+| 1 | Single structured extraction / one function call, clean schema | **GLM** | +2 | GLM's schema adherence is best-in-class; cheap; no loop risk in one shot |
+| 2 | Short fan-out of *independent* tool calls (no cross-dependency) | **GLM** | +1 | No state-carry needed; plan-then-act is fine when steps don't depend on each other |
+| 3 | Boilerplate/CRUD/scaffolding codegen, no long tool loop | **GLM** | +2 | Self-contained, well-specified; ~10× cheaper; quality near-parity |
+| 4 | Docs / summarization / local refactor, few or no tool calls | **GLM** | +2 | Plays to GLM strengths; minimal orchestration |
+| 5 | Algorithmic codegen with a couple of verification calls | **GLM** | +1 | Short, mostly self-contained; verify output |
+| 6 | MCP-heavy task with many dependent tool calls | **Opus** | -3 | Dependent chains need interleaved adapt-to-result; GLM plan-then-act + loop risk |
+| 7 | Long-horizon agentic loop (20+ turns, build-on-prior-state) | **Opus** | -3 | Preserved-Thinking fragility → loops; Opus interleaves + harness preserves state |
+| 8 | Running in a 3rd-party harness with **unverified** `reasoning_content` passthrough | **Opus** | -3 | This is the exact trigger for GLM infinite loops (goose#7363, oh-my-pi#517) |
+| 9 | Multi-step debugging where each fix depends on prior tool output | **Opus** | -2 | Needs mid-execution replanning on results; GLM tends to lose the thread |
+| 10 | Subtle/long debugging, large multi-step refactor across many files | **Opus** | -3 | Long + complex; policy + capability both favor Opus |
+| 11 | Multi-turn coding *inside Claude Code* (state preserved), medium length | **GLM** then verify | 0 | Claude Code preserves thinking, so GLM is usable; still verify, escalate on drift |
+| 12 | Security-sensitive / proprietary / parallel-agent / latency-critical | **Opus** | -3 | Project policy hard rule regardless of capability |
+**Operational guardrails:**
+- If delegating an agentic task to GLM, ensure `clear_thinking: false` and that the harness re-sends `reasoning_content`. If you can't guarantee that, treat it as scenario #8 → Opus.
+- GLM-5/5.2 are *more* sensitive to dropped reasoning state than GLM-4.7. Bias newer GLM toward Opus for loops unless state handling is verified.
+- Always verify GLM output on anything past a single call; retry once, then escalate to Opus (per project policy).
+---
+## Confidence & caveats
+- **High confidence:** the loop/state-loss root cause (`reasoning_content` passthrough) — multiple independent harness bug reports converge.
+- **Medium confidence:** the "plan-then-act vs interleave" behavioral characterization — supported by one strong independent analysis (Cirra) plus vendor mode docs.
+- **Lower confidence / directional only:** absolute benchmark deltas — scaffold-sensitive and partly vendor-reported. GLM-5.2-specific agentic-loop behavior is extrapolated, not independently confirmed.