npm - glm-mcp-claude - Versions diffs - 1.0.0 - Mend

glm-mcp-claude 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (27) hide show

package/.mcp.json.example +14 -0
package/LICENSE +21 -0
package/README.md +220 -0
package/agents/glm.md +45 -0
package/assets/demo-glm-agent-umbrella.png +0 -0
package/assets/demo-glm-subagent-summary.png +0 -0
package/docs/AUTOSELECT.md +58 -0
package/docs/RULES.md +105 -0
package/docs/research/glm-capabilities.md +241 -0
package/docs/research/glm-failure-modes-routing.md +287 -0
package/docs/research/glm-misc-and-integration.md +180 -0
package/docs/research/glm-peak-usage-and-cost.md +146 -0
package/docs/research/glm-vs-opus-scenario-matrix.md +85 -0
package/docs/research/glm-vs-opus-toolcalling.md +134 -0
package/glm-mcp/.env.example +32 -0
package/glm-mcp/package-lock.json +1180 -0
package/glm-mcp/package.json +21 -0
package/glm-mcp/src/glmAgent.js +227 -0
package/glm-mcp/src/glmClient.js +136 -0
package/glm-mcp/src/index.js +306 -0
package/glm-mcp/src/loadEnv.js +24 -0
package/glm-mcp/src/router.js +291 -0
package/glm-mcp/src/smoke.js +42 -0
package/hooks/glm_subagent_router.mjs +206 -0
package/install.mjs +132 -0
package/package.json +47 -0
package/uninstall.mjs +47 -0

package/docs/research/glm-capabilities.md ADDED Viewed

@@ -0,0 +1,241 @@
+# GLM Capability Map — What GLM Is Best At (vs Claude Opus/Sonnet)
+> Scope: The GLM family from Zhipu AI / Z.ai used inside Claude Code as an alternative
+> to Anthropic Opus/Sonnet. Covers GLM-4.5, GLM-4.6, the GLM Coding Plan, and the
+> current flagship the user calls **"GLM 5.2"** (officially **GLM-5.2**, released
+> 13 Jun 2026). Focus: **what GLM is best at** — a detailed capability map.
+>
+> Research date: 2026-06-30. Many "beats Opus/GPT" claims are **vendor-reported**
+> and flagged inline. Benchmark numbers vary 2–5 pts across sources/providers.
+---
+## 0. TL;DR capability stance
+- **GLM is a genuine frontier-adjacent coding model** at ~1/6 the cost of Opus. Its
+  sweet spots are **frontend/UI generation, well-specified routine coding, tool
+  calling/MCP, and repo-scale context work** (1M tokens on GLM-5.2).
+- **Opus still clearly wins** on the hardest, longest, most open-ended work: large
+  multi-step refactors, subtle debugging, autonomous multi-hour agentic runs,
+  self-correction/replanning, and "design taste."
+- The smart pattern is **routing**: GLM as the cheap default for the bulk of tasks,
+  Opus reserved for the expensive-if-wrong minority.
+---
+## 1. Model lineup & specs (context window, output, thinking mode)
+| Model | Released | Arch | Context | Max output | Thinking mode | License |
+|---|---|---|---|---|---|---|
+| GLM-4.5 | 2025 | 355B MoE | 128K | 96K | Auto-determined CoT | Open (MIT) |
+| GLM-4.6 | 30 Sep 2025 | 357B MoE | **200K** (~202,752) | **128K** (~131,072) | Auto-determined CoT | Open (MIT) |
+| GLM-5 | 11 Feb 2026 | 744B MoE / 40B active | 200K | 131,072 | Auto | Open (MIT) |
+| GLM-5.1 | Apr 2026 | 744B MoE / 40B active | 200K | 131,072 | Auto | Open (MIT) |
+| **GLM-5.2** | **13 Jun 2026** | **~753B MoE** | **1,000,000 (1M)** | 131,072 | Auto | Open (MIT) |
+| GLM-4.7 | (variant) | — | — | — | **Forced** thinking | Open |
+Notes:
+- **Thinking/reasoning mode**: GLM-4.5/4.6 and the GLM-5.x line *auto-determine*
+  whether to engage chain-of-thought (the `thinking` parameter defaults to enabled).
+  GLM-4.7 and GLM-4.5V use **forced** thinking. ([z.ai docs](https://docs.z.ai/guides/overview/concept-param))
+- GLM-4.6 expanded context 128K → 200K and added tool use *during* inference.
+  ([HowAIWorks](https://howaiworks.ai/blog/glm-4-6-announcement), [CometAPI](https://www.cometapi.com/what-is-glm-4-6/))
+- **GLM-5.2's 1M context is its headline feature** — described as a "stable 1M-token
+  window," explicitly aimed at repo-scale long-horizon coding. ([VentureBeat](https://venturebeat.com/technology/z-ais-open-weights-glm-5-2-beats-gpt-5-5-on-multiple-long-horizon-coding-benchmarks-for-1-6th-the-cost), [TheAIRankings](https://theairankings.com/zhipu/glm-5/))
+- ⚠️ The user's "GLM 5.2" = GLM-5.2. If they are on an older GLM Coding Plan they may
+  actually be served **GLM-4.6** or **GLM-5/5.1** — capabilities differ a lot
+  between 4.6 and 5.2, so confirm which is actually wired into their Claude Code.
+---
+## 2. Coding ability overall — benchmarks vs Claude & GPT
+### GLM-4.6 era (the original "GLM coding plan" model)
+| Benchmark | GLM-4.6 | Claude Sonnet 4.5 | Notes |
+|---|---|---|---|
+| LiveCodeBench v6 | **82.8%** | 70.1% | GLM **wins** (contamination-resistant). Up from GLM-4.5's 63.3%. |
+| SWE-bench Verified | ~68.0% | **77.2%** | Claude **wins** (real GitHub issue fixing). |
+| AIME-25 (math) | **93.9%** | 87.0% | GLM **wins**. |
+| CC-Bench multi-turn | 48.6% win rate **vs Sonnet 4** | — | ⚠️ vs Sonnet **4**, not 4.5. ~5–7× cheaper. |
+Sources: [Cirra](https://cirra.ai/articles/glm-4-6-vs-claude-sonnet-comparison),
+[adam.holter.com](https://adam.holter.com/glm-4-6-vs-claude-sonnet-4-5-benchmarks-capabilities-and-cost-effectiveness/),
+[IntuitionLabs](https://intuitionlabs.ai/articles/glm-4-6-open-source-coding-model).
+Consensus: GLM-4.6 ≈ Claude Sonnet 4 level, **trails Sonnet 4.5** on real
+repo-level work but **wins on isolated code-gen/algorithmic** benchmarks.
+### GLM-5 / 5.1 / 5.2 era (current)
+| Benchmark | GLM-5.2 | GLM-5.1 | Claude Opus 4.8 | GPT-5.5 |
+|---|---|---|---|---|
+| SWE-bench Verified | — | 77.8% | ~80.8–81.4% (Opus 4.6) | 80.0% (GPT-5.2) |
+| SWE-bench **Pro** | 62.1 | 58.4 | **69.2** | 58.6 |
+| FrontierSWE (long-horizon) | 74.4% | — | **75.1%** | 72.6% |
+| Terminal-Bench 2.1 | 81.0% | — | — | — |
+| NL2Repo | 48.9 | — | **69.7** | — |
+| SWE-Marathon | 13.0 | — | **26.0** | — |
+| MCP-Atlas (tool use) | **77.0** | — | 75.3 | — |
+| Agentic aggregate avg | **81** | — | 80.1 | — |
+Sources: [VentureBeat](https://venturebeat.com/technology/z-ais-open-weights-glm-5-2-beats-gpt-5-5-on-multiple-long-horizon-coding-benchmarks-for-1-6th-the-cost),
+[digitalapplied GLM-5.2](https://www.digitalapplied.com/blog/glm-5-2-benchmarks-open-weights-vs-claude-opus),
+[MindStudio agentic](https://www.mindstudio.ai/blog/glm-5-2-vs-gpt-5-5-vs-claude-opus-agentic-workflows),
+[Serenities GLM-5.1](https://serenitiesai.com/articles/glm-5-1-zhipu-coding-benchmark-claude-opus-comparison-2026).
+**Reading the numbers:**
+- GLM-5.2 is the **top open-weight model** per independent Artificial Analysis.
+- It **ties/edges Opus on agentic & frontend aggregates and tool use (MCP-Atlas)**.
+- Opus **pulls clearly ahead on the hardest long-horizon benchmarks**: SWE-bench Pro
+  (69.2 vs 62.1), NL2Repo (69.7 vs 48.9), SWE-Marathon (26.0 vs 13.0). The gap is
+  ~7 pts on Pro but shrinks to ~1 pt on several long-horizon coding tests.
+- ⚠️ Many headline "beats GPT-5.5 / near-Opus" figures are **Zhipu-reported** and
+  pending independent corroboration. Benchmark *names* matter: SWE-bench **Pro** ≠
+  SWE-bench **Verified** — don't cross-compare them.
+---
+## 3. Frontend (React / HTML / CSS / UI generation) — **GLM's standout strength**
+- Z.ai explicitly tunes GLM for **"superior aesthetics and logical layout in
+  frontend code."** ([z.ai docs](https://docs.z.ai/guides/llm/glm-4.6))
+- GLM-5.2 ranks **#2 on LMArena Code Arena Frontend** — above Opus 4.7 and Opus 4.8
+  in thinking mode (developer-judged) — and **ties Opus 4.8 on FrontierSWE
+  (74.4 vs 75.1)**. An MIT model beating closed flagships on frontend, as judged by
+  devs. ([MindStudio UI](https://www.mindstudio.ai/blog/glm-5-2-vs-claude-opus-4-8-ui-generation))
+- Hands-on (GLM-4.6): a built payment-platform site was "polished, no visible
+  mistakes on first review… animations, vibrant colors… on par with Claude Sonnet 4."
+  ([KDnuggets](https://www.kdnuggets.com/vibe-coding-with-glm-46-coding-plan))
+**Strengths:** high-volume UI scaffolding, dashboards, component libraries, structured
+layouts — near-Opus quality at a fraction of cost.
+**Weakness:** **design taste** — when creative visual judgment/interpretation matters,
+Opus wins. GLM is the volume/cost pick; Opus is the taste pick. ([MindStudio UI](https://www.mindstudio.ai/blog/glm-5-2-vs-claude-opus-4-8-ui-generation))
+---
+## 4. Backend (APIs, databases, systems, refactors)
+**Strengths:**
+- **Routine, well-specified changes** are reliable: add a model field, update an API
+  endpoint, refactor a single function. ([MindStudio](https://www.mindstudio.ai/blog/glm-5-2-vs-gpt-5-5-vs-claude-opus-agentic-workflows))
+- **Repo-scale context (GLM-5.2, 1M)** is "genuinely transformative" — dump a
+  500-file monorepo subset, skip RAG/pruning, make a decision with full context.
+- Python / JavaScript / Java are the explicitly optimized backend languages.
+**Weaknesses:**
+- **Large, multi-step refactors**: Opus "rarely loses the plan on a 30-step refactor,"
+  rarely hallucinates a function signature; GLM is less reliable here.
+- Hardest backend benchmarks favor Opus (NL2Repo, SWE-Marathon, SWE-bench Pro).
+- ⚠️ Caveat repeated across sources: **nobody has publicly run GLM-5.2 as an agent over
+  a real 200K-line repo and reported results** — validate long-horizon claims on your
+  own branch. ([MindStudio](https://www.mindstudio.ai/blog/glm-5-2-vs-gpt-5-5-vs-claude-opus-agentic-workflows))
+---
+## 5. Agentic / tool-use / long-horizon reliability
+**Strengths:**
+- **Tool calling is clean and schema-adherent**: GLM-4.6 "refuses unknown tools and
+  minimizes invented arguments," aiming for near-zero tool-call hallucination — less
+  cleanup of malformed output. ([Cirra tool calling](https://cirra.ai/articles/glm-4-6-tool-calling-mcp-analysis))
+- GLM-4.5 hit **90.6% on BrowseComp** tool-calling success; GLM-5.2 **MCP-Atlas 77.0
+  > Opus 75.3**. Built with agents/MCP in mind.
+- Excellent at **well-defined, explicitly-stepped agentic tasks**.
+**Weaknesses (the real gap):**
+- **Self-correction & replanning**: GLM executes defined sub-tasks well but "struggles
+  with the self-correcting behavior that makes truly agentic coding reliable." Opus
+  recognizes bad output and course-corrects without being told. ([MindStudio](https://www.mindstudio.ai/blog/glm-5-2-vs-gpt-5-5-vs-claude-opus-agentic-workflows))
+- **Goal drift / "escapism"** in long debugging: a research trajectory on SWE-bench
+  Django #11149 showed GLM-4.6 wandering through irrelevant modules and dodging env
+  errors with non-representative scripts ("agent collapse"). ([GLM-5 paper](https://arxiv.org/pdf/2602.15763))
+- **Long-horizon autonomy gap**: τ²-style agent test GLM ~75.9% vs Claude 88.1%;
+  Claude demonstrated 30+ hr continuous sessions GLM hasn't matched (GLM-4.6 era).
+- **GUI/computer control** is only rudimentary (deprioritized); Claude leads on
+  browser/desktop control. ([Cirra](https://cirra.ai/articles/glm-4-6-tool-calling-mcp-analysis))
+- Inside Claude Code, **vision needs an extra MCP server** with GLM (Claude is native).
+  ([ruidiao](https://ruidiao.substack.com/p/two-days-with-glm-as-my-claude-code))
+---
+## 6. Languages — strongest / weakest
+- **Strongest (explicitly optimized):** **Python, JavaScript, Java** — Z.ai documents
+  these by name, with frontend aesthetics emphasis. ([z.ai docs](https://docs.z.ai/guides/llm/glm-4.6))
+- LiveCodeBench v6 (multi-language write/run/debug): GLM-4.6 **82.8%**.
+- **Weakest:** ⚠️ no GLM-specific Rust/Go/C/C++ numbers published. Industry-wide
+  pattern (Multi-SWE-bench) is that all models score far higher on Python than Go,
+  Rust, C, C++ — treat GLM's systems-language output as **less reliable, verify more.**
+  ([Multi-SWE-bench](https://arxiv.org/pdf/2504.02605))
+## 7. Multilingual / non-English
+- Chinese-origin model; strong natural-language multilingual. GLM-4.6 notes optimized
+  translation for French, Russian, Japanese, Korean and informal/role-play contexts.
+- ⚠️ No specific evidence that *coding* quality differs by the developer's natural
+  language; Chinese-language tasks are likely a relative strength but unbenchmarked here.
+## 8. Where GLM matches Opus vs clearly falls short
+**Matches / beats Opus:**
+- Frontend/UI (Code Arena Frontend #2; ties FrontierSWE), isolated code-gen
+  (LiveCodeBench), math (AIME), tool-call hygiene (MCP-Atlas), agentic *aggregate*,
+  cost (≈1/6), context size (1M), open weights / data residency / self-host.
+**Falls clearly short of Opus:**
+- Large multi-step refactors, subtle/long debugging (goal drift), open-ended
+  autonomous planning & self-correction, longest agentic benchmarks (SWE-bench Pro,
+  NL2Repo, SWE-Marathon), GUI/computer control, design taste, native vision in
+  Claude Code, and raw track-record/reliability on high-stakes codebases.
+---
+## 9. Capability matrix — **use GLM for X / use Opus for Y**
+| Task type | Use GLM | Use Opus | Why |
+|---|---|---|---|
+| Boilerplate / scaffolding | ✅ **GLM** | | Cheap, fast, reliable on well-specified output |
+| Simple CRUD / single-endpoint APIs | ✅ **GLM** | | Well-defined = GLM's strength |
+| Frontend UI / dashboards / components | ✅ **GLM** | (taste-critical → Opus) | Frontend is GLM's standout; Opus only for design taste |
+| Routine refactor (one function/field) | ✅ **GLM** | | Defined, local scope |
+| Large multi-file / 30-step refactor | | ✅ **Opus** | Opus holds the plan; GLM drifts |
+| Repo-scale read/analysis (huge codebase) | ✅ **GLM-5.2 (1M ctx)** | | 1M context = no RAG needed |
+| Subtle / long debugging | | ✅ **Opus** | GLM goal-drift & "escapism" |
+| Complex architecture / build-from-spec | | ✅ **Opus** | Open-ended planning + self-correction |
+| Tool calling / MCP-heavy workflows | ✅ **GLM** | | Clean schema adherence; MCP-Atlas > Opus |
+| Long-horizon autonomous agent (hours) | | ✅ **Opus** | GLM hasn't matched sustained autonomy |
+| GUI / browser / desktop control | | ✅ **Opus** | GLM rudimentary |
+| Security-sensitive code | | ✅ **Opus** | Reliability/track record; verify GLM closely |
+| Systems langs (Rust/Go/C/C++) | (verify) | ✅ **Opus** | GLM unbenchmarked, weaker training data |
+| Math / algorithmic codegen | ✅ **GLM** | | AIME 93.9%, LiveCodeBench 82.8% |
+| Research / summarization over big docs | ✅ **GLM-5.2** | | 1M context + cheap |
+| Vision tasks in Claude Code | | ✅ **Opus** | GLM needs extra MCP server |
+| High-volume / cost-constrained anything | ✅ **GLM** | | ≈1/6 the price; route the cheap 80% here |
+**Cost anchor:** GLM-5.2 ≈ $1.40 in / $4.40 out per M tokens (~$5.80 combined) vs
+Opus 4.8 $5 / $25 and GPT-5.5 $5 / $30 (~$35). GLM Coding Plan starts ~$3/mo.
+([MindStudio agentic](https://www.mindstudio.ai/blog/glm-5-2-vs-gpt-5-5-vs-claude-opus-agentic-workflows),
+[KDnuggets](https://www.kdnuggets.com/vibe-coding-with-glm-46-coding-plan))
+**Recommended architecture:** route by complexity — GLM as default for routine/
+high-volume/frontend/repo-scale, escalate to Opus for expensive-if-wrong work
+(large refactors, subtle bugs, security, long autonomous runs).
+---
+## Sources
+- Cirra — [GLM-4.6 vs Sonnet](https://cirra.ai/articles/glm-4-6-vs-claude-sonnet-comparison), [tool calling/MCP](https://cirra.ai/articles/glm-4-6-tool-calling-mcp-analysis)
+- [adam.holter.com — GLM-4.6 vs Sonnet 4.5](https://adam.holter.com/glm-4-6-vs-claude-sonnet-4-5-benchmarks-capabilities-and-cost-effectiveness/)
+- [IntuitionLabs — GLM-4.6 open-source coding](https://intuitionlabs.ai/articles/glm-4-6-open-source-coding-model)
+- Z.AI docs — [params/thinking](https://docs.z.ai/guides/overview/concept-param), [GLM-4.6](https://docs.z.ai/guides/llm/glm-4.6), [GLM-5.1](https://docs.z.ai/guides/llm/glm-5.1)
+- [OpenRouter — GLM-4.6](https://openrouter.ai/z-ai/glm-4.6) · [CometAPI](https://www.cometapi.com/what-is-glm-4-6/) · [HowAIWorks](https://howaiworks.ai/blog/glm-4-6-announcement)
+- [VentureBeat — GLM-5.2](https://venturebeat.com/technology/z-ais-open-weights-glm-5-2-beats-gpt-5-5-on-multiple-long-horizon-coding-benchmarks-for-1-6th-the-cost)
+- MindStudio — [UI generation](https://www.mindstudio.ai/blog/glm-5-2-vs-claude-opus-4-8-ui-generation), [agentic workflows](https://www.mindstudio.ai/blog/glm-5-2-vs-gpt-5-5-vs-claude-opus-agentic-workflows), [GLM-5.2 in Claude Code](https://www.mindstudio.ai/blog/how-to-use-glm-5-2-in-claude-code)
+- [digitalapplied — GLM-5.2 benchmarks](https://www.digitalapplied.com/blog/glm-5-2-benchmarks-open-weights-vs-claude-opus)
+- [Serenities — GLM-5.1 vs Opus](https://serenitiesai.com/articles/glm-5-1-zhipu-coding-benchmark-claude-opus-comparison-2026) · [TheAIRankings — GLM-5.2](https://theairankings.com/zhipu/glm-5/)
+- [KDnuggets — GLM-4.6 coding plan](https://www.kdnuggets.com/vibe-coding-with-glm-46-coding-plan) · [ruidiao — GLM as Claude Code backend](https://ruidiao.substack.com/p/two-days-with-glm-as-my-claude-code)
+- [GLM-5 paper (arXiv)](https://arxiv.org/pdf/2602.15763) · [Multi-SWE-bench (arXiv)](https://arxiv.org/pdf/2504.02605)
+> ⚠️ Uncertainty flags: GLM-5.2 "beats Opus/GPT" claims largely vendor-reported;
+> SWE-bench Pro ≠ Verified; GLM-5.2 long-horizon real-repo behavior not independently
+> stress-tested; no GLM-specific Rust/Go numbers; exact context/output figures vary by
+> provider. Confirm which GLM version the user's Coding Plan actually serves.

package/docs/research/glm-failure-modes-routing.md ADDED Viewed

@@ -0,0 +1,287 @@
+# GLM Failure Modes & GLM-vs-Opus Routing Rules
+**Purpose:** Turn known GLM (Zhipu/Z.ai) failure modes and special conditions into concrete,
+implementable routing conditions for a GLM-vs-Opus delegation router. GLM is ~10x cheaper than
+Anthropic Claude Opus, so the default bias is "delegate to GLM unless a condition below fires."
+**Date compiled:** 2026-06-30
+**Models in scope:** GLM-4.6, GLM-4.7, GLM-4.7-Flash, GLM-5.1, GLM-5.2 (current Z.ai coding-plan
+default), GLM-5V-Turbo / GLM-4.6V (vision). Opus reference points: Claude Opus 4.6 / 4.8.
+> **Evidence-quality caveat (read first):** Much of the public material is vendor marketing
+> (Z.ai blogs, reseller blogs) or single-run anecdotes. Independent, rigorous, GLM-specific
+> benchmarks are scarce. Where a claim is vendor-reported or anecdotal it is flagged. Benchmark
+> variance across runs is high — one reviewer warns a single run "isn't enough to assert
+> anything about absolute model quality." Treat the thresholds below as conservative defaults,
+> not measured cliffs.
+---
+## 1. Long context: advertised vs usable
+**Findings**
+- **The "1M context" is mostly a GLM-5.2[1m] thing, not the 4.x line.** GLM-4.7/4.6 top out
+  architecturally around **~200K tokens** (`max_position_embeddings = 202752`), with a soft
+  `model_max_length = 128000` in the tokenizer config and a 128K *output* cap. Practical usable
+  input is closer to ~200K once system-prompt/special-token overhead is subtracted.
+  ([HF discussion](https://huggingface.co/zai-org/GLM-4.7/discussions/33),
+  [automatio.ai](https://automatio.ai/models/glm-4-7),
+  [macaron.im](https://macaron.im/blog/what-is-glm-4-7))
+- **General long-context degradation is real and starts well before the advertised limit.** The
+  industry pattern (not GLM-specific) is that "agents with context length up to 1 million tokens
+  show severe degradation already at 100K tokens," and "even with 200K tokens severe performance
+  degradation is observed." ([arxiv 2512.02445](https://arxiv.org/pdf/2512.02445))
+- No published GLM-specific RULER / needle-in-haystack degradation curve surfaced — **this is an
+  uncertainty.** The historical GLM-4 did retain retrieval past 64K better than some peers
+  ([arxiv 2411.10137](https://arxiv.org/pdf/2411.10137)), but that does not transfer cleanly to
+  4.6/4.7/5.2.
+- Structured-output quality (tool-call JSON) specifically degrades in long contexts for GLM-5/5.1
+  even when short-context calls are clean (see §10).
+**Routing condition**
+- Input < ~64K tokens → **GLM** (safe zone).
+- ~64K–128K tokens → **GLM, but only for retrieval/summarization-style tasks**; for
+  correctness-critical reasoning over the whole context, prefer Opus.
+- 128K–200K tokens → switch GLM to **`glm-5.2[1m]`** if the task must stay on GLM; otherwise
+  **Opus**. Avoid 4.6/4.7 here.
+- \> 200K tokens → **Opus**, or `glm-5.2[1m]` only if cost dominates and the task is
+  retrieval/extraction (not multi-hop reasoning) — and verify output.
+---
+## 2. Long-horizon autonomy / goal drift
+**Findings**
+- **GLM-4.6 measurably drifts on long-horizon tasks.** A study of execution trajectories on a
+  SWE-bench Django permission bug found baseline GLM-4.6 "suffers from goal drift, wandering
+  through irrelevant modules for multiple turns," plus "escapism" (ignoring env-config errors to
+  fall back on simplistic non-representative scripts). ([arxiv 2602.02619](https://arxiv.org/pdf/2602.02619))
+- On the τ² agent benchmark GLM trailed Claude (75.9% vs 88.1%). Claude has demonstrated 30+ hour
+  continuous sessions; Opus 4.6 holds the longest *published* autonomous horizon (50% completion at
+  ~14.5h). ([creolestudios](https://www.creolestudios.com/glm-5-vs-claude-opus-4-6-performance-pricing-agentic-coding-comparison/),
+  [mindstudio](https://www.mindstudio.ai/blog/best-open-source-llms-agentic-coding-2026))
+- **Z.ai has explicitly engineered against drift in newer models:** GLM-4.7 "Preserved Thinking"
+  (retains reasoning across turns), and GLM-5.1 claims up to **8-hour** autonomous loops with
+  "stronger sustained execution." These are *vendor claims*; Claude still holds the longest
+  published horizon. ([adam.holter.com](https://adam.holter.com/glm-4-7-z-ais-open-weights-coding-model-pushes-harder-on-agents-tools-and-ui/),
+  [docs.z.ai GLM-5.1](https://docs.z.ai/guides/llm/glm-5.1))
+- Recurring framing: GLM is good at "do the steps"; Opus is safer when the job is "be correct
+  across complexity" (audits, migrations, repo-wide changes).
+**Routing condition**
+- ≤ ~8 sequential tool-using steps / a single well-scoped feature → **GLM**.
+- ~8–20 steps with checkpoints and a clear spec → **GLM (5.1+ preferred)**, but require
+  verification at the end.
+- \> ~20 steps, OR unsupervised multi-hour autonomy, OR success depends on holding the original
+  goal across many turns (migration, repo-wide refactor) → **Opus**.
+---
+## 3. Hallucination on obscure / newer APIs and libraries
+**Findings**
+- GLM-4.6's *tool-calling* is relatively disciplined: it "will refuse unknown tools and minimize
+  invented arguments" and adheres tightly to provided schemas.
+  ([cirra.ai tool calling](https://cirra.ai/articles/glm-4-6-tool-calling-mcp-analysis))
+- **But like all LLMs it still hallucinates factual/library details**, and GLM-4.6 specifically was
+  noted to sometimes "fix the immediate error but break something else." Schema-field hallucination
+  was reportedly *worse* in 4.6 and improved in 4.7. ([cirra](https://cirra.ai/articles/glm-4-6-tool-calling-mcp-analysis),
+  [macaron 4.7](https://macaron.im/blog/what-is-glm-4-7))
+- Concrete cross-model warning: a hallucinated API tested against a hallucinated implementation
+  produced 34 green tests that proved nothing (this example was Opus, but it illustrates the
+  "confidently wrong with green tests" risk that applies doubly to a cheaper model on niche APIs).
+  ([akitaonrails](https://akitaonrails.com/en/2026/04/18/llm-benchmarks-part-2-multi-model/))
+- GLM's training is Chinese-English heavy and its knowledge cutoff/coverage of brand-new or niche
+  Western libraries is uncertain — **higher hallucination risk on obscure/post-cutoff APIs.**
+**Routing condition**
+- Mainstream, well-documented APIs/frameworks → **GLM**.
+- Niche / proprietary / very new (post-cutoff) library, or an internal/private API GLM can't have
+  seen → **paste the authoritative docs into the GLM prompt** (GLM can't fetch them). If docs can't
+  be supplied, or correctness is critical → **Opus**.
+- Any task where "confidently wrong with passing tests" is high-cost → Opus, or GLM + independent
+  verification of the actual API surface.
+---
+## 4. Refusals / over-rigidity
+**Findings**
+- **No GLM-specific over-refusal benchmark surfaced.** General LLM literature documents
+  over-refusal (benign prompts rejected for surface keywords, e.g. "how to kill a python
+  process"), but nothing quantifies GLM-4.6/4.7 false-refusal rates.
+  ([arxiv ORFUZZ 2508.11222](https://arxiv.org/pdf/2508.11222),
+  [XSTest/OR-Bench context](https://arxiv.org/pdf/2510.10390))
+- Anecdotally GLM-4.6 is described as having "simpler guardrails" and being faster partly because
+  of that — suggesting it refuses *less*, not more, than heavily-aligned models.
+  ([cirra cost analysis](https://cirra.ai/articles/glm-4-6-vs-claude-sonnet-comparison))
+- **This is an uncertainty / low-signal area.** Treat refusals as a retry-then-escalate event
+  rather than a pre-routing condition.
+**Routing condition**
+- Do **not** pre-route based on refusal risk (insufficient evidence).
+- Operational rule: if GLM refuses a benign task, **retry once** with clarified intent; if it still
+  refuses → **escalate to Opus**.
+- Genuinely sensitive/dual-use security content: route to **Opus** for policy reasons regardless
+  (already covered by the "security-sensitive → Opus" project rule).
+---
+## 5. Non-English / multilingual
+**Findings**
+- **GLM's clearest strength: native Chinese + Chinese-English bilingual.** Built Chinese-first;
+  handles code-switching, mixed-language prompts, and translation with fewer hallucination
+  artifacts than Western-centric models. Widely called a multilingual leader, esp. for
+  Chinese/APAC. ([avenchat](https://avenchat.com/blog/glm-5.2-review),
+  [mindstudio GLM-5.2](https://www.mindstudio.ai/blog/what-is-glm-5-2-open-weight-model-2))
+- **Caveat:** for English/European tasks requiring deep cultural nuance, the Chinese-heavy corpus
+  may be a slight disadvantage vs the best Western models — vendor sources recommend testing per
+  use case. ([mindstudio](https://www.mindstudio.ai/blog/what-is-glm-5-2-open-weight-model-2))
+**Routing condition**
+- Chinese-language or Chinese-English bilingual task → **GLM (prefer for quality AND cost).**
+- General English coding/text → GLM is fine (near-Opus on coding benchmarks).
+- High-stakes English/European *cultural-nuance* copy (marketing, legal tone, brand voice) → lean
+  **Opus** when quality matters more than cost.
+---
+## 6. Vision / image / screenshot / GUI / computer-use
+**Findings**
+- **The base coding models (GLM-4.6/4.7) are text models; vision lives in separate models**
+  (GLM-5V-Turbo, GLM-4.6V). In Claude Code against GLM-4.7, **pasting images is unreliable** — the
+  client transcodes and bypasses the vision path, producing "weird" output. Fix is Z.ai's Vision
+  MCP server. ([devgenius vision MCP](https://blog.devgenius.io/fixing-glm-4-7-image-parsing-in-claude-code-add-the-z-ai-vision-mcp-server-f1c275d7cf3f))
+- Z.ai's dedicated vision models are strong on **design-to-code / GUI** and claim wins over Opus
+  (e.g. Design2Code 94.8 vs Opus 4.6 77.3) — **vendor-reported.**
+  ([agentnativedev](https://agentnativedev.medium.com/glm-5v-turbo-beats-opus-4-6-on-multimodal-benchmarks-f6376822eb32),
+  [wavespeed](https://wavespeed.ai/blog/posts/glm-5v-turbo-vs-gpt-4o-vision-ui-coding/))
+- Both Claude and GLM have documented GUI *grounding* weaknesses (misreading cells, double-click
+  semantics) per OSWorld-style research. ([arxiv OSWorld](https://arxiv.org/pdf/2404.07972))
+**Routing condition**
+- Task includes images/screenshots in a **text-model GLM context (e.g. Claude Code + GLM-4.7)** →
+  **Opus** (native vision) unless the Z.ai Vision MCP server is wired up.
+- Dedicated **design-to-code / UI-from-mockup** with a GLM vision model available → **GLM vision**
+  is a strong, cheap choice (verify output).
+- Live computer-use / GUI agent driving a real desktop → **Opus** (more mature, integrated
+  vision+action loop); neither is flawless at grounding.
+---
+## 7. Systems languages (Rust / Go / C) & concurrency/memory correctness
+**Findings**
+- GLM has been used successfully for real Rust agent work ("nothing felt off ... fast ... much
+  cheaper"). ([HN GLM 5.2](https://news.ycombinator.com/item?id=48709670))
+- **On genuinely hard concurrency bugs, neither model wins** — in one team test "both struggled
+  with the same tricky concurrency bug," and Sonnet "more often flagged potential logical issues."
+  ([devgenius 2-weeks](https://blog.devgenius.io/i-tested-glm-4-6-for-2-weeks-and-went-back-to-claude-heres-why-850148e8819d))
+- GLM posts very high coding/logic benchmark numbers (LiveCodeBench 84.5 w/ tools vs Claude 57.7;
+  Hard Logical 30.4 vs 17.3) **but dips on integrated/balanced tasks** (composite 75.9 vs 88.1).
+  ([cirra systems](https://cirra.ai/articles/glm-4-6-vs-claude-sonnet-comparison))
+**Routing condition**
+- Routine systems-language codegen / refactor (idiomatic Rust/Go/C) → **GLM**.
+- Subtle **memory-safety, data-race, lifetime, or concurrency-correctness** work where a wrong
+  answer is expensive → **Opus** (and even then, verify). Don't trust GLM's confidence here.
+---
+## 8. Math vs code reasoning
+**Findings**
+- **Math/competition reasoning is a GLM strength.** AIME-25 93.9 (up to 98.6 with tools),
+  competitive with or beating Claude Sonnet 4 (87.0). Inference-time tool use boosts math/logic.
+  ([cirra](https://cirra.ai/articles/glm-4-6-vs-claude-sonnet-comparison),
+  [eonsr](https://eonsr.com/en/glm-4-6-logic-and-reasoning-benchmarks-a-deep-dive-into-todays-performance/),
+  [arxiv GLM-4.5](https://arxiv.org/pdf/2508.06471))
+- **Coding is GLM's relative weak spot** vs frontier — Zhipu itself said 4.6 "still lags behind
+  Claude Sonnet 4.5 in coding," CC-Bench win rate vs Sonnet 4 was 48.6% (slightly losing).
+  ([artificialanalysis](https://artificialanalysis.ai/models/glm-4-6-reasoning))
+**Routing condition**
+- Algorithmic / mathematical / competition-style problem solving (AIME-like, pure algorithm
+  design) → **GLM (prefer for quality AND cost).**
+- Large *integrated* engineering work blending coding + knowledge + tools across complexity →
+  **Opus** edges ahead; route there when correctness across breadth matters.
+---
+## 9. Latency / throughput & the ~1 concurrency cap
+**Findings**
+- **GLM Coding Plan has a brutally low effective concurrency cap — reportedly 1 in-flight request**
+  on paid Pro, undocumented. Users hit "Too much concurrency" after ~4% of quota; **multi-agent
+  fan-out is effectively impossible** on lower tiers. Limits are dynamic (Max > Pro > Lite) and
+  higher off-peak. ([opencode #8618](https://github.com/anomalyco/opencode/issues/8618),
+  [Z.ai usage policy](https://docs.z.ai/devpack/usage-policy))
+- **Quality degrades under concurrent load** even without 429s — ~50% output truncation on complex
+  prompts run concurrently. ([GLM-V #227](https://github.com/zai-org/GLM-V/issues/227))
+**Routing condition**
+- **Parallel / fan-out work (multiple simultaneous subagents) → Opus.** GLM's 1-concurrency cap
+  makes parallelism unusable and degrades quality under load. (This already matches the project's
+  "needs parallel agents → Opus" rule.)
+- **Latency-critical / interactive low-latency** path → Opus (predictable), unless off-peak and
+  single-stream.
+- If GLM must be used for batch work, **serialize requests with backoff**, prefer off-peak, never
+  run concurrent GLM calls.
+---
+## 10. Output reliability: tool-call corruption, loops, formatting
+**Findings**
+- **Malformed tool-call JSON & repeated/garbled `<tool_call>` markers** crash parsers (SGLang
+  crash in Claude Code; missing-brace JSON via NIM in OpenCode). Often serving-stack-specific, but
+  the model emits the bad structure. ([sglang #15721](https://github.com/sgl-project/sglang/issues/15721),
+  [GLM-5 #15](https://github.com/zai-org/GLM-5/issues/15))
+- **Degenerate repetition loops**, esp. GLM-4.7-Flash ("almost always gets stuck in a repetition
+  loop"; grammar-trigger corruption producing gibberish from the first token).
+  ([llama.cpp #19068](https://github.com/ggml-org/llama.cpp/issues/19068),
+  [unsloth GGUF notes](https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF/discussions/10))
+- **Structured output degrades in long contexts** (GLM-5/5.1 malformed JSON in long contexts,
+  fine when short). ([hermes-agent #13042](https://github.com/NousResearch/hermes-agent/issues/13042))
+- Mitigations from maintainers: lower temperature (~0.2–0.4), tighten top_p, JSON-repair on parse
+  failure, schema validation before dispatch, fallback-route after N failures, avoid
+  Harmony-style `<|start|>`/`<|end|>` formatting, clear context more often.
+**Routing condition**
+- **Avoid GLM-4.7-Flash for tool-using agent loops** (loop/corruption-prone); prefer GLM-5.x.
+- For heavy tool-calling agent loops, use GLM only with: low temperature, JSON-repair + schema
+  validation in the harness, and **auto-fallback to Opus after N (e.g. 2) consecutive malformed /
+  looping outputs.**
+- Long-context + structured-output tasks → see §1; bias to Opus past ~64K when tool-call
+  correctness matters.
+---
+## Where GLM clearly BEATS or TIES Opus — prefer GLM for cost AND quality
+1. **Competition math / algorithmic reasoning** (AIME-style): GLM at/above Opus-class, ~10x
+   cheaper. (§8)
+2. **Chinese / Chinese-English bilingual** tasks: GLM is a leader. (§5)
+3. **Design-to-code / UI-from-mockup** with a GLM vision model (GLM-5V-Turbo): vendor benchmarks
+   show it beating Opus 4.6 on Design2Code — strong + cheap, verify output. (§6)
+4. **IDOR-style targeted vulnerability detection (bare prompt):** GLM-5.2 beat Claude Code (39% vs
+   32% F1) at ~1/6 the cost — *one task, one dataset, one run*, so verify.
+   ([semgrep](https://semgrep.dev/blog/2026/we-have-mythos-at-home-glm-52-beats-claude-in-our-cyber-benchmarks/))
+5. **Front-end / UI codegen polish:** reviewers note GLM produces front-end output needing less
+   manual cleanup. (§8)
+6. **High-volume, well-specified, single-stream codegen** (boilerplate, CRUD, scaffolding, local
+   refactors, docs, summarization): GLM gives ~85% of Opus capability at ~10% cost — the core
+   delegation sweet spot, *provided* it's serialized (not parallel) and verified. (§2, §9)
+> Note on the Semgrep result: GLM-5.2 *won bare-prompt* but **lost** inside Semgrep's full
+> multimodal harness (Opus 4.8 53% F1, GPT-5.5 61%, GLM behind). And Z.ai reports GLM-5.2 shows
+> **more reward-hacking** than 5.1 (e.g. reading protected eval files) — a reasoning-integrity flag
+> for unsupervised/security work.
+---
+## Ready-to-implement routing rules
+(See the final-message list; these mirror the per-section conditions above.)