glm-mcp-claude 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,241 @@
1
+ # GLM Capability Map — What GLM Is Best At (vs Claude Opus/Sonnet)
2
+
3
+ > Scope: The GLM family from Zhipu AI / Z.ai used inside Claude Code as an alternative
4
+ > to Anthropic Opus/Sonnet. Covers GLM-4.5, GLM-4.6, the GLM Coding Plan, and the
5
+ > current flagship the user calls **"GLM 5.2"** (officially **GLM-5.2**, released
6
+ > 13 Jun 2026). Focus: **what GLM is best at** — a detailed capability map.
7
+ >
8
+ > Research date: 2026-06-30. Many "beats Opus/GPT" claims are **vendor-reported**
9
+ > and flagged inline. Benchmark numbers vary 2–5 pts across sources/providers.
10
+
11
+ ---
12
+
13
+ ## 0. TL;DR capability stance
14
+
15
+ - **GLM is a genuine frontier-adjacent coding model** at ~1/6 the cost of Opus. Its
16
+ sweet spots are **frontend/UI generation, well-specified routine coding, tool
17
+ calling/MCP, and repo-scale context work** (1M tokens on GLM-5.2).
18
+ - **Opus still clearly wins** on the hardest, longest, most open-ended work: large
19
+ multi-step refactors, subtle debugging, autonomous multi-hour agentic runs,
20
+ self-correction/replanning, and "design taste."
21
+ - The smart pattern is **routing**: GLM as the cheap default for the bulk of tasks,
22
+ Opus reserved for the expensive-if-wrong minority.
23
+
24
+ ---
25
+
26
+ ## 1. Model lineup & specs (context window, output, thinking mode)
27
+
28
+ | Model | Released | Arch | Context | Max output | Thinking mode | License |
29
+ |---|---|---|---|---|---|---|
30
+ | GLM-4.5 | 2025 | 355B MoE | 128K | 96K | Auto-determined CoT | Open (MIT) |
31
+ | GLM-4.6 | 30 Sep 2025 | 357B MoE | **200K** (~202,752) | **128K** (~131,072) | Auto-determined CoT | Open (MIT) |
32
+ | GLM-5 | 11 Feb 2026 | 744B MoE / 40B active | 200K | 131,072 | Auto | Open (MIT) |
33
+ | GLM-5.1 | Apr 2026 | 744B MoE / 40B active | 200K | 131,072 | Auto | Open (MIT) |
34
+ | **GLM-5.2** | **13 Jun 2026** | **~753B MoE** | **1,000,000 (1M)** | 131,072 | Auto | Open (MIT) |
35
+ | GLM-4.7 | (variant) | — | — | — | **Forced** thinking | Open |
36
+
37
+ Notes:
38
+ - **Thinking/reasoning mode**: GLM-4.5/4.6 and the GLM-5.x line *auto-determine*
39
+ whether to engage chain-of-thought (the `thinking` parameter defaults to enabled).
40
+ GLM-4.7 and GLM-4.5V use **forced** thinking. ([z.ai docs](https://docs.z.ai/guides/overview/concept-param))
41
+ - GLM-4.6 expanded context 128K → 200K and added tool use *during* inference.
42
+ ([HowAIWorks](https://howaiworks.ai/blog/glm-4-6-announcement), [CometAPI](https://www.cometapi.com/what-is-glm-4-6/))
43
+ - **GLM-5.2's 1M context is its headline feature** — described as a "stable 1M-token
44
+ window," explicitly aimed at repo-scale long-horizon coding. ([VentureBeat](https://venturebeat.com/technology/z-ais-open-weights-glm-5-2-beats-gpt-5-5-on-multiple-long-horizon-coding-benchmarks-for-1-6th-the-cost), [TheAIRankings](https://theairankings.com/zhipu/glm-5/))
45
+ - ⚠️ The user's "GLM 5.2" = GLM-5.2. If they are on an older GLM Coding Plan they may
46
+ actually be served **GLM-4.6** or **GLM-5/5.1** — capabilities differ a lot
47
+ between 4.6 and 5.2, so confirm which is actually wired into their Claude Code.
48
+
49
+ ---
50
+
51
+ ## 2. Coding ability overall — benchmarks vs Claude & GPT
52
+
53
+ ### GLM-4.6 era (the original "GLM coding plan" model)
54
+ | Benchmark | GLM-4.6 | Claude Sonnet 4.5 | Notes |
55
+ |---|---|---|---|
56
+ | LiveCodeBench v6 | **82.8%** | 70.1% | GLM **wins** (contamination-resistant). Up from GLM-4.5's 63.3%. |
57
+ | SWE-bench Verified | ~68.0% | **77.2%** | Claude **wins** (real GitHub issue fixing). |
58
+ | AIME-25 (math) | **93.9%** | 87.0% | GLM **wins**. |
59
+ | CC-Bench multi-turn | 48.6% win rate **vs Sonnet 4** | — | ⚠️ vs Sonnet **4**, not 4.5. ~5–7× cheaper. |
60
+
61
+ Sources: [Cirra](https://cirra.ai/articles/glm-4-6-vs-claude-sonnet-comparison),
62
+ [adam.holter.com](https://adam.holter.com/glm-4-6-vs-claude-sonnet-4-5-benchmarks-capabilities-and-cost-effectiveness/),
63
+ [IntuitionLabs](https://intuitionlabs.ai/articles/glm-4-6-open-source-coding-model).
64
+ Consensus: GLM-4.6 ≈ Claude Sonnet 4 level, **trails Sonnet 4.5** on real
65
+ repo-level work but **wins on isolated code-gen/algorithmic** benchmarks.
66
+
67
+ ### GLM-5 / 5.1 / 5.2 era (current)
68
+ | Benchmark | GLM-5.2 | GLM-5.1 | Claude Opus 4.8 | GPT-5.5 |
69
+ |---|---|---|---|---|
70
+ | SWE-bench Verified | — | 77.8% | ~80.8–81.4% (Opus 4.6) | 80.0% (GPT-5.2) |
71
+ | SWE-bench **Pro** | 62.1 | 58.4 | **69.2** | 58.6 |
72
+ | FrontierSWE (long-horizon) | 74.4% | — | **75.1%** | 72.6% |
73
+ | Terminal-Bench 2.1 | 81.0% | — | — | — |
74
+ | NL2Repo | 48.9 | — | **69.7** | — |
75
+ | SWE-Marathon | 13.0 | — | **26.0** | — |
76
+ | MCP-Atlas (tool use) | **77.0** | — | 75.3 | — |
77
+ | Agentic aggregate avg | **81** | — | 80.1 | — |
78
+
79
+ Sources: [VentureBeat](https://venturebeat.com/technology/z-ais-open-weights-glm-5-2-beats-gpt-5-5-on-multiple-long-horizon-coding-benchmarks-for-1-6th-the-cost),
80
+ [digitalapplied GLM-5.2](https://www.digitalapplied.com/blog/glm-5-2-benchmarks-open-weights-vs-claude-opus),
81
+ [MindStudio agentic](https://www.mindstudio.ai/blog/glm-5-2-vs-gpt-5-5-vs-claude-opus-agentic-workflows),
82
+ [Serenities GLM-5.1](https://serenitiesai.com/articles/glm-5-1-zhipu-coding-benchmark-claude-opus-comparison-2026).
83
+
84
+ **Reading the numbers:**
85
+ - GLM-5.2 is the **top open-weight model** per independent Artificial Analysis.
86
+ - It **ties/edges Opus on agentic & frontend aggregates and tool use (MCP-Atlas)**.
87
+ - Opus **pulls clearly ahead on the hardest long-horizon benchmarks**: SWE-bench Pro
88
+ (69.2 vs 62.1), NL2Repo (69.7 vs 48.9), SWE-Marathon (26.0 vs 13.0). The gap is
89
+ ~7 pts on Pro but shrinks to ~1 pt on several long-horizon coding tests.
90
+ - ⚠️ Many headline "beats GPT-5.5 / near-Opus" figures are **Zhipu-reported** and
91
+ pending independent corroboration. Benchmark *names* matter: SWE-bench **Pro** ≠
92
+ SWE-bench **Verified** — don't cross-compare them.
93
+
94
+ ---
95
+
96
+ ## 3. Frontend (React / HTML / CSS / UI generation) — **GLM's standout strength**
97
+
98
+ - Z.ai explicitly tunes GLM for **"superior aesthetics and logical layout in
99
+ frontend code."** ([z.ai docs](https://docs.z.ai/guides/llm/glm-4.6))
100
+ - GLM-5.2 ranks **#2 on LMArena Code Arena Frontend** — above Opus 4.7 and Opus 4.8
101
+ in thinking mode (developer-judged) — and **ties Opus 4.8 on FrontierSWE
102
+ (74.4 vs 75.1)**. An MIT model beating closed flagships on frontend, as judged by
103
+ devs. ([MindStudio UI](https://www.mindstudio.ai/blog/glm-5-2-vs-claude-opus-4-8-ui-generation))
104
+ - Hands-on (GLM-4.6): a built payment-platform site was "polished, no visible
105
+ mistakes on first review… animations, vibrant colors… on par with Claude Sonnet 4."
106
+ ([KDnuggets](https://www.kdnuggets.com/vibe-coding-with-glm-46-coding-plan))
107
+
108
+ **Strengths:** high-volume UI scaffolding, dashboards, component libraries, structured
109
+ layouts — near-Opus quality at a fraction of cost.
110
+ **Weakness:** **design taste** — when creative visual judgment/interpretation matters,
111
+ Opus wins. GLM is the volume/cost pick; Opus is the taste pick. ([MindStudio UI](https://www.mindstudio.ai/blog/glm-5-2-vs-claude-opus-4-8-ui-generation))
112
+
113
+ ---
114
+
115
+ ## 4. Backend (APIs, databases, systems, refactors)
116
+
117
+ **Strengths:**
118
+ - **Routine, well-specified changes** are reliable: add a model field, update an API
119
+ endpoint, refactor a single function. ([MindStudio](https://www.mindstudio.ai/blog/glm-5-2-vs-gpt-5-5-vs-claude-opus-agentic-workflows))
120
+ - **Repo-scale context (GLM-5.2, 1M)** is "genuinely transformative" — dump a
121
+ 500-file monorepo subset, skip RAG/pruning, make a decision with full context.
122
+ - Python / JavaScript / Java are the explicitly optimized backend languages.
123
+
124
+ **Weaknesses:**
125
+ - **Large, multi-step refactors**: Opus "rarely loses the plan on a 30-step refactor,"
126
+ rarely hallucinates a function signature; GLM is less reliable here.
127
+ - Hardest backend benchmarks favor Opus (NL2Repo, SWE-Marathon, SWE-bench Pro).
128
+ - ⚠️ Caveat repeated across sources: **nobody has publicly run GLM-5.2 as an agent over
129
+ a real 200K-line repo and reported results** — validate long-horizon claims on your
130
+ own branch. ([MindStudio](https://www.mindstudio.ai/blog/glm-5-2-vs-gpt-5-5-vs-claude-opus-agentic-workflows))
131
+
132
+ ---
133
+
134
+ ## 5. Agentic / tool-use / long-horizon reliability
135
+
136
+ **Strengths:**
137
+ - **Tool calling is clean and schema-adherent**: GLM-4.6 "refuses unknown tools and
138
+ minimizes invented arguments," aiming for near-zero tool-call hallucination — less
139
+ cleanup of malformed output. ([Cirra tool calling](https://cirra.ai/articles/glm-4-6-tool-calling-mcp-analysis))
140
+ - GLM-4.5 hit **90.6% on BrowseComp** tool-calling success; GLM-5.2 **MCP-Atlas 77.0
141
+ > Opus 75.3**. Built with agents/MCP in mind.
142
+ - Excellent at **well-defined, explicitly-stepped agentic tasks**.
143
+
144
+ **Weaknesses (the real gap):**
145
+ - **Self-correction & replanning**: GLM executes defined sub-tasks well but "struggles
146
+ with the self-correcting behavior that makes truly agentic coding reliable." Opus
147
+ recognizes bad output and course-corrects without being told. ([MindStudio](https://www.mindstudio.ai/blog/glm-5-2-vs-gpt-5-5-vs-claude-opus-agentic-workflows))
148
+ - **Goal drift / "escapism"** in long debugging: a research trajectory on SWE-bench
149
+ Django #11149 showed GLM-4.6 wandering through irrelevant modules and dodging env
150
+ errors with non-representative scripts ("agent collapse"). ([GLM-5 paper](https://arxiv.org/pdf/2602.15763))
151
+ - **Long-horizon autonomy gap**: τ²-style agent test GLM ~75.9% vs Claude 88.1%;
152
+ Claude demonstrated 30+ hr continuous sessions GLM hasn't matched (GLM-4.6 era).
153
+ - **GUI/computer control** is only rudimentary (deprioritized); Claude leads on
154
+ browser/desktop control. ([Cirra](https://cirra.ai/articles/glm-4-6-tool-calling-mcp-analysis))
155
+ - Inside Claude Code, **vision needs an extra MCP server** with GLM (Claude is native).
156
+ ([ruidiao](https://ruidiao.substack.com/p/two-days-with-glm-as-my-claude-code))
157
+
158
+ ---
159
+
160
+ ## 6. Languages — strongest / weakest
161
+
162
+ - **Strongest (explicitly optimized):** **Python, JavaScript, Java** — Z.ai documents
163
+ these by name, with frontend aesthetics emphasis. ([z.ai docs](https://docs.z.ai/guides/llm/glm-4.6))
164
+ - LiveCodeBench v6 (multi-language write/run/debug): GLM-4.6 **82.8%**.
165
+ - **Weakest:** ⚠️ no GLM-specific Rust/Go/C/C++ numbers published. Industry-wide
166
+ pattern (Multi-SWE-bench) is that all models score far higher on Python than Go,
167
+ Rust, C, C++ — treat GLM's systems-language output as **less reliable, verify more.**
168
+ ([Multi-SWE-bench](https://arxiv.org/pdf/2504.02605))
169
+
170
+ ## 7. Multilingual / non-English
171
+
172
+ - Chinese-origin model; strong natural-language multilingual. GLM-4.6 notes optimized
173
+ translation for French, Russian, Japanese, Korean and informal/role-play contexts.
174
+ - ⚠️ No specific evidence that *coding* quality differs by the developer's natural
175
+ language; Chinese-language tasks are likely a relative strength but unbenchmarked here.
176
+
177
+ ## 8. Where GLM matches Opus vs clearly falls short
178
+
179
+ **Matches / beats Opus:**
180
+ - Frontend/UI (Code Arena Frontend #2; ties FrontierSWE), isolated code-gen
181
+ (LiveCodeBench), math (AIME), tool-call hygiene (MCP-Atlas), agentic *aggregate*,
182
+ cost (≈1/6), context size (1M), open weights / data residency / self-host.
183
+
184
+ **Falls clearly short of Opus:**
185
+ - Large multi-step refactors, subtle/long debugging (goal drift), open-ended
186
+ autonomous planning & self-correction, longest agentic benchmarks (SWE-bench Pro,
187
+ NL2Repo, SWE-Marathon), GUI/computer control, design taste, native vision in
188
+ Claude Code, and raw track-record/reliability on high-stakes codebases.
189
+
190
+ ---
191
+
192
+ ## 9. Capability matrix — **use GLM for X / use Opus for Y**
193
+
194
+ | Task type | Use GLM | Use Opus | Why |
195
+ |---|---|---|---|
196
+ | Boilerplate / scaffolding | ✅ **GLM** | | Cheap, fast, reliable on well-specified output |
197
+ | Simple CRUD / single-endpoint APIs | ✅ **GLM** | | Well-defined = GLM's strength |
198
+ | Frontend UI / dashboards / components | ✅ **GLM** | (taste-critical → Opus) | Frontend is GLM's standout; Opus only for design taste |
199
+ | Routine refactor (one function/field) | ✅ **GLM** | | Defined, local scope |
200
+ | Large multi-file / 30-step refactor | | ✅ **Opus** | Opus holds the plan; GLM drifts |
201
+ | Repo-scale read/analysis (huge codebase) | ✅ **GLM-5.2 (1M ctx)** | | 1M context = no RAG needed |
202
+ | Subtle / long debugging | | ✅ **Opus** | GLM goal-drift & "escapism" |
203
+ | Complex architecture / build-from-spec | | ✅ **Opus** | Open-ended planning + self-correction |
204
+ | Tool calling / MCP-heavy workflows | ✅ **GLM** | | Clean schema adherence; MCP-Atlas > Opus |
205
+ | Long-horizon autonomous agent (hours) | | ✅ **Opus** | GLM hasn't matched sustained autonomy |
206
+ | GUI / browser / desktop control | | ✅ **Opus** | GLM rudimentary |
207
+ | Security-sensitive code | | ✅ **Opus** | Reliability/track record; verify GLM closely |
208
+ | Systems langs (Rust/Go/C/C++) | (verify) | ✅ **Opus** | GLM unbenchmarked, weaker training data |
209
+ | Math / algorithmic codegen | ✅ **GLM** | | AIME 93.9%, LiveCodeBench 82.8% |
210
+ | Research / summarization over big docs | ✅ **GLM-5.2** | | 1M context + cheap |
211
+ | Vision tasks in Claude Code | | ✅ **Opus** | GLM needs extra MCP server |
212
+ | High-volume / cost-constrained anything | ✅ **GLM** | | ≈1/6 the price; route the cheap 80% here |
213
+
214
+ **Cost anchor:** GLM-5.2 ≈ $1.40 in / $4.40 out per M tokens (~$5.80 combined) vs
215
+ Opus 4.8 $5 / $25 and GPT-5.5 $5 / $30 (~$35). GLM Coding Plan starts ~$3/mo.
216
+ ([MindStudio agentic](https://www.mindstudio.ai/blog/glm-5-2-vs-gpt-5-5-vs-claude-opus-agentic-workflows),
217
+ [KDnuggets](https://www.kdnuggets.com/vibe-coding-with-glm-46-coding-plan))
218
+
219
+ **Recommended architecture:** route by complexity — GLM as default for routine/
220
+ high-volume/frontend/repo-scale, escalate to Opus for expensive-if-wrong work
221
+ (large refactors, subtle bugs, security, long autonomous runs).
222
+
223
+ ---
224
+
225
+ ## Sources
226
+ - Cirra — [GLM-4.6 vs Sonnet](https://cirra.ai/articles/glm-4-6-vs-claude-sonnet-comparison), [tool calling/MCP](https://cirra.ai/articles/glm-4-6-tool-calling-mcp-analysis)
227
+ - [adam.holter.com — GLM-4.6 vs Sonnet 4.5](https://adam.holter.com/glm-4-6-vs-claude-sonnet-4-5-benchmarks-capabilities-and-cost-effectiveness/)
228
+ - [IntuitionLabs — GLM-4.6 open-source coding](https://intuitionlabs.ai/articles/glm-4-6-open-source-coding-model)
229
+ - Z.AI docs — [params/thinking](https://docs.z.ai/guides/overview/concept-param), [GLM-4.6](https://docs.z.ai/guides/llm/glm-4.6), [GLM-5.1](https://docs.z.ai/guides/llm/glm-5.1)
230
+ - [OpenRouter — GLM-4.6](https://openrouter.ai/z-ai/glm-4.6) · [CometAPI](https://www.cometapi.com/what-is-glm-4-6/) · [HowAIWorks](https://howaiworks.ai/blog/glm-4-6-announcement)
231
+ - [VentureBeat — GLM-5.2](https://venturebeat.com/technology/z-ais-open-weights-glm-5-2-beats-gpt-5-5-on-multiple-long-horizon-coding-benchmarks-for-1-6th-the-cost)
232
+ - MindStudio — [UI generation](https://www.mindstudio.ai/blog/glm-5-2-vs-claude-opus-4-8-ui-generation), [agentic workflows](https://www.mindstudio.ai/blog/glm-5-2-vs-gpt-5-5-vs-claude-opus-agentic-workflows), [GLM-5.2 in Claude Code](https://www.mindstudio.ai/blog/how-to-use-glm-5-2-in-claude-code)
233
+ - [digitalapplied — GLM-5.2 benchmarks](https://www.digitalapplied.com/blog/glm-5-2-benchmarks-open-weights-vs-claude-opus)
234
+ - [Serenities — GLM-5.1 vs Opus](https://serenitiesai.com/articles/glm-5-1-zhipu-coding-benchmark-claude-opus-comparison-2026) · [TheAIRankings — GLM-5.2](https://theairankings.com/zhipu/glm-5/)
235
+ - [KDnuggets — GLM-4.6 coding plan](https://www.kdnuggets.com/vibe-coding-with-glm-46-coding-plan) · [ruidiao — GLM as Claude Code backend](https://ruidiao.substack.com/p/two-days-with-glm-as-my-claude-code)
236
+ - [GLM-5 paper (arXiv)](https://arxiv.org/pdf/2602.15763) · [Multi-SWE-bench (arXiv)](https://arxiv.org/pdf/2504.02605)
237
+
238
+ > ⚠️ Uncertainty flags: GLM-5.2 "beats Opus/GPT" claims largely vendor-reported;
239
+ > SWE-bench Pro ≠ Verified; GLM-5.2 long-horizon real-repo behavior not independently
240
+ > stress-tested; no GLM-specific Rust/Go numbers; exact context/output figures vary by
241
+ > provider. Confirm which GLM version the user's Coding Plan actually serves.
@@ -0,0 +1,287 @@
1
+ # GLM Failure Modes & GLM-vs-Opus Routing Rules
2
+
3
+ **Purpose:** Turn known GLM (Zhipu/Z.ai) failure modes and special conditions into concrete,
4
+ implementable routing conditions for a GLM-vs-Opus delegation router. GLM is ~10x cheaper than
5
+ Anthropic Claude Opus, so the default bias is "delegate to GLM unless a condition below fires."
6
+
7
+ **Date compiled:** 2026-06-30
8
+ **Models in scope:** GLM-4.6, GLM-4.7, GLM-4.7-Flash, GLM-5.1, GLM-5.2 (current Z.ai coding-plan
9
+ default), GLM-5V-Turbo / GLM-4.6V (vision). Opus reference points: Claude Opus 4.6 / 4.8.
10
+
11
+ > **Evidence-quality caveat (read first):** Much of the public material is vendor marketing
12
+ > (Z.ai blogs, reseller blogs) or single-run anecdotes. Independent, rigorous, GLM-specific
13
+ > benchmarks are scarce. Where a claim is vendor-reported or anecdotal it is flagged. Benchmark
14
+ > variance across runs is high — one reviewer warns a single run "isn't enough to assert
15
+ > anything about absolute model quality." Treat the thresholds below as conservative defaults,
16
+ > not measured cliffs.
17
+
18
+ ---
19
+
20
+ ## 1. Long context: advertised vs usable
21
+
22
+ **Findings**
23
+ - **The "1M context" is mostly a GLM-5.2[1m] thing, not the 4.x line.** GLM-4.7/4.6 top out
24
+ architecturally around **~200K tokens** (`max_position_embeddings = 202752`), with a soft
25
+ `model_max_length = 128000` in the tokenizer config and a 128K *output* cap. Practical usable
26
+ input is closer to ~200K once system-prompt/special-token overhead is subtracted.
27
+ ([HF discussion](https://huggingface.co/zai-org/GLM-4.7/discussions/33),
28
+ [automatio.ai](https://automatio.ai/models/glm-4-7),
29
+ [macaron.im](https://macaron.im/blog/what-is-glm-4-7))
30
+ - **General long-context degradation is real and starts well before the advertised limit.** The
31
+ industry pattern (not GLM-specific) is that "agents with context length up to 1 million tokens
32
+ show severe degradation already at 100K tokens," and "even with 200K tokens severe performance
33
+ degradation is observed." ([arxiv 2512.02445](https://arxiv.org/pdf/2512.02445))
34
+ - No published GLM-specific RULER / needle-in-haystack degradation curve surfaced — **this is an
35
+ uncertainty.** The historical GLM-4 did retain retrieval past 64K better than some peers
36
+ ([arxiv 2411.10137](https://arxiv.org/pdf/2411.10137)), but that does not transfer cleanly to
37
+ 4.6/4.7/5.2.
38
+ - Structured-output quality (tool-call JSON) specifically degrades in long contexts for GLM-5/5.1
39
+ even when short-context calls are clean (see §10).
40
+
41
+ **Routing condition**
42
+ - Input < ~64K tokens → **GLM** (safe zone).
43
+ - ~64K–128K tokens → **GLM, but only for retrieval/summarization-style tasks**; for
44
+ correctness-critical reasoning over the whole context, prefer Opus.
45
+ - 128K–200K tokens → switch GLM to **`glm-5.2[1m]`** if the task must stay on GLM; otherwise
46
+ **Opus**. Avoid 4.6/4.7 here.
47
+ - \> 200K tokens → **Opus**, or `glm-5.2[1m]` only if cost dominates and the task is
48
+ retrieval/extraction (not multi-hop reasoning) — and verify output.
49
+
50
+ ---
51
+
52
+ ## 2. Long-horizon autonomy / goal drift
53
+
54
+ **Findings**
55
+ - **GLM-4.6 measurably drifts on long-horizon tasks.** A study of execution trajectories on a
56
+ SWE-bench Django permission bug found baseline GLM-4.6 "suffers from goal drift, wandering
57
+ through irrelevant modules for multiple turns," plus "escapism" (ignoring env-config errors to
58
+ fall back on simplistic non-representative scripts). ([arxiv 2602.02619](https://arxiv.org/pdf/2602.02619))
59
+ - On the τ² agent benchmark GLM trailed Claude (75.9% vs 88.1%). Claude has demonstrated 30+ hour
60
+ continuous sessions; Opus 4.6 holds the longest *published* autonomous horizon (50% completion at
61
+ ~14.5h). ([creolestudios](https://www.creolestudios.com/glm-5-vs-claude-opus-4-6-performance-pricing-agentic-coding-comparison/),
62
+ [mindstudio](https://www.mindstudio.ai/blog/best-open-source-llms-agentic-coding-2026))
63
+ - **Z.ai has explicitly engineered against drift in newer models:** GLM-4.7 "Preserved Thinking"
64
+ (retains reasoning across turns), and GLM-5.1 claims up to **8-hour** autonomous loops with
65
+ "stronger sustained execution." These are *vendor claims*; Claude still holds the longest
66
+ published horizon. ([adam.holter.com](https://adam.holter.com/glm-4-7-z-ais-open-weights-coding-model-pushes-harder-on-agents-tools-and-ui/),
67
+ [docs.z.ai GLM-5.1](https://docs.z.ai/guides/llm/glm-5.1))
68
+ - Recurring framing: GLM is good at "do the steps"; Opus is safer when the job is "be correct
69
+ across complexity" (audits, migrations, repo-wide changes).
70
+
71
+ **Routing condition**
72
+ - ≤ ~8 sequential tool-using steps / a single well-scoped feature → **GLM**.
73
+ - ~8–20 steps with checkpoints and a clear spec → **GLM (5.1+ preferred)**, but require
74
+ verification at the end.
75
+ - \> ~20 steps, OR unsupervised multi-hour autonomy, OR success depends on holding the original
76
+ goal across many turns (migration, repo-wide refactor) → **Opus**.
77
+
78
+ ---
79
+
80
+ ## 3. Hallucination on obscure / newer APIs and libraries
81
+
82
+ **Findings**
83
+ - GLM-4.6's *tool-calling* is relatively disciplined: it "will refuse unknown tools and minimize
84
+ invented arguments" and adheres tightly to provided schemas.
85
+ ([cirra.ai tool calling](https://cirra.ai/articles/glm-4-6-tool-calling-mcp-analysis))
86
+ - **But like all LLMs it still hallucinates factual/library details**, and GLM-4.6 specifically was
87
+ noted to sometimes "fix the immediate error but break something else." Schema-field hallucination
88
+ was reportedly *worse* in 4.6 and improved in 4.7. ([cirra](https://cirra.ai/articles/glm-4-6-tool-calling-mcp-analysis),
89
+ [macaron 4.7](https://macaron.im/blog/what-is-glm-4-7))
90
+ - Concrete cross-model warning: a hallucinated API tested against a hallucinated implementation
91
+ produced 34 green tests that proved nothing (this example was Opus, but it illustrates the
92
+ "confidently wrong with green tests" risk that applies doubly to a cheaper model on niche APIs).
93
+ ([akitaonrails](https://akitaonrails.com/en/2026/04/18/llm-benchmarks-part-2-multi-model/))
94
+ - GLM's training is Chinese-English heavy and its knowledge cutoff/coverage of brand-new or niche
95
+ Western libraries is uncertain — **higher hallucination risk on obscure/post-cutoff APIs.**
96
+
97
+ **Routing condition**
98
+ - Mainstream, well-documented APIs/frameworks → **GLM**.
99
+ - Niche / proprietary / very new (post-cutoff) library, or an internal/private API GLM can't have
100
+ seen → **paste the authoritative docs into the GLM prompt** (GLM can't fetch them). If docs can't
101
+ be supplied, or correctness is critical → **Opus**.
102
+ - Any task where "confidently wrong with passing tests" is high-cost → Opus, or GLM + independent
103
+ verification of the actual API surface.
104
+
105
+ ---
106
+
107
+ ## 4. Refusals / over-rigidity
108
+
109
+ **Findings**
110
+ - **No GLM-specific over-refusal benchmark surfaced.** General LLM literature documents
111
+ over-refusal (benign prompts rejected for surface keywords, e.g. "how to kill a python
112
+ process"), but nothing quantifies GLM-4.6/4.7 false-refusal rates.
113
+ ([arxiv ORFUZZ 2508.11222](https://arxiv.org/pdf/2508.11222),
114
+ [XSTest/OR-Bench context](https://arxiv.org/pdf/2510.10390))
115
+ - Anecdotally GLM-4.6 is described as having "simpler guardrails" and being faster partly because
116
+ of that — suggesting it refuses *less*, not more, than heavily-aligned models.
117
+ ([cirra cost analysis](https://cirra.ai/articles/glm-4-6-vs-claude-sonnet-comparison))
118
+ - **This is an uncertainty / low-signal area.** Treat refusals as a retry-then-escalate event
119
+ rather than a pre-routing condition.
120
+
121
+ **Routing condition**
122
+ - Do **not** pre-route based on refusal risk (insufficient evidence).
123
+ - Operational rule: if GLM refuses a benign task, **retry once** with clarified intent; if it still
124
+ refuses → **escalate to Opus**.
125
+ - Genuinely sensitive/dual-use security content: route to **Opus** for policy reasons regardless
126
+ (already covered by the "security-sensitive → Opus" project rule).
127
+
128
+ ---
129
+
130
+ ## 5. Non-English / multilingual
131
+
132
+ **Findings**
133
+ - **GLM's clearest strength: native Chinese + Chinese-English bilingual.** Built Chinese-first;
134
+ handles code-switching, mixed-language prompts, and translation with fewer hallucination
135
+ artifacts than Western-centric models. Widely called a multilingual leader, esp. for
136
+ Chinese/APAC. ([avenchat](https://avenchat.com/blog/glm-5.2-review),
137
+ [mindstudio GLM-5.2](https://www.mindstudio.ai/blog/what-is-glm-5-2-open-weight-model-2))
138
+ - **Caveat:** for English/European tasks requiring deep cultural nuance, the Chinese-heavy corpus
139
+ may be a slight disadvantage vs the best Western models — vendor sources recommend testing per
140
+ use case. ([mindstudio](https://www.mindstudio.ai/blog/what-is-glm-5-2-open-weight-model-2))
141
+
142
+ **Routing condition**
143
+ - Chinese-language or Chinese-English bilingual task → **GLM (prefer for quality AND cost).**
144
+ - General English coding/text → GLM is fine (near-Opus on coding benchmarks).
145
+ - High-stakes English/European *cultural-nuance* copy (marketing, legal tone, brand voice) → lean
146
+ **Opus** when quality matters more than cost.
147
+
148
+ ---
149
+
150
+ ## 6. Vision / image / screenshot / GUI / computer-use
151
+
152
+ **Findings**
153
+ - **The base coding models (GLM-4.6/4.7) are text models; vision lives in separate models**
154
+ (GLM-5V-Turbo, GLM-4.6V). In Claude Code against GLM-4.7, **pasting images is unreliable** — the
155
+ client transcodes and bypasses the vision path, producing "weird" output. Fix is Z.ai's Vision
156
+ MCP server. ([devgenius vision MCP](https://blog.devgenius.io/fixing-glm-4-7-image-parsing-in-claude-code-add-the-z-ai-vision-mcp-server-f1c275d7cf3f))
157
+ - Z.ai's dedicated vision models are strong on **design-to-code / GUI** and claim wins over Opus
158
+ (e.g. Design2Code 94.8 vs Opus 4.6 77.3) — **vendor-reported.**
159
+ ([agentnativedev](https://agentnativedev.medium.com/glm-5v-turbo-beats-opus-4-6-on-multimodal-benchmarks-f6376822eb32),
160
+ [wavespeed](https://wavespeed.ai/blog/posts/glm-5v-turbo-vs-gpt-4o-vision-ui-coding/))
161
+ - Both Claude and GLM have documented GUI *grounding* weaknesses (misreading cells, double-click
162
+ semantics) per OSWorld-style research. ([arxiv OSWorld](https://arxiv.org/pdf/2404.07972))
163
+
164
+ **Routing condition**
165
+ - Task includes images/screenshots in a **text-model GLM context (e.g. Claude Code + GLM-4.7)** →
166
+ **Opus** (native vision) unless the Z.ai Vision MCP server is wired up.
167
+ - Dedicated **design-to-code / UI-from-mockup** with a GLM vision model available → **GLM vision**
168
+ is a strong, cheap choice (verify output).
169
+ - Live computer-use / GUI agent driving a real desktop → **Opus** (more mature, integrated
170
+ vision+action loop); neither is flawless at grounding.
171
+
172
+ ---
173
+
174
+ ## 7. Systems languages (Rust / Go / C) & concurrency/memory correctness
175
+
176
+ **Findings**
177
+ - GLM has been used successfully for real Rust agent work ("nothing felt off ... fast ... much
178
+ cheaper"). ([HN GLM 5.2](https://news.ycombinator.com/item?id=48709670))
179
+ - **On genuinely hard concurrency bugs, neither model wins** — in one team test "both struggled
180
+ with the same tricky concurrency bug," and Sonnet "more often flagged potential logical issues."
181
+ ([devgenius 2-weeks](https://blog.devgenius.io/i-tested-glm-4-6-for-2-weeks-and-went-back-to-claude-heres-why-850148e8819d))
182
+ - GLM posts very high coding/logic benchmark numbers (LiveCodeBench 84.5 w/ tools vs Claude 57.7;
183
+ Hard Logical 30.4 vs 17.3) **but dips on integrated/balanced tasks** (composite 75.9 vs 88.1).
184
+ ([cirra systems](https://cirra.ai/articles/glm-4-6-vs-claude-sonnet-comparison))
185
+
186
+ **Routing condition**
187
+ - Routine systems-language codegen / refactor (idiomatic Rust/Go/C) → **GLM**.
188
+ - Subtle **memory-safety, data-race, lifetime, or concurrency-correctness** work where a wrong
189
+ answer is expensive → **Opus** (and even then, verify). Don't trust GLM's confidence here.
190
+
191
+ ---
192
+
193
+ ## 8. Math vs code reasoning
194
+
195
+ **Findings**
196
+ - **Math/competition reasoning is a GLM strength.** AIME-25 93.9 (up to 98.6 with tools),
197
+ competitive with or beating Claude Sonnet 4 (87.0). Inference-time tool use boosts math/logic.
198
+ ([cirra](https://cirra.ai/articles/glm-4-6-vs-claude-sonnet-comparison),
199
+ [eonsr](https://eonsr.com/en/glm-4-6-logic-and-reasoning-benchmarks-a-deep-dive-into-todays-performance/),
200
+ [arxiv GLM-4.5](https://arxiv.org/pdf/2508.06471))
201
+ - **Coding is GLM's relative weak spot** vs frontier — Zhipu itself said 4.6 "still lags behind
202
+ Claude Sonnet 4.5 in coding," CC-Bench win rate vs Sonnet 4 was 48.6% (slightly losing).
203
+ ([artificialanalysis](https://artificialanalysis.ai/models/glm-4-6-reasoning))
204
+
205
+ **Routing condition**
206
+ - Algorithmic / mathematical / competition-style problem solving (AIME-like, pure algorithm
207
+ design) → **GLM (prefer for quality AND cost).**
208
+ - Large *integrated* engineering work blending coding + knowledge + tools across complexity →
209
+ **Opus** edges ahead; route there when correctness across breadth matters.
210
+
211
+ ---
212
+
213
+ ## 9. Latency / throughput & the ~1 concurrency cap
214
+
215
+ **Findings**
216
+ - **GLM Coding Plan has a brutally low effective concurrency cap — reportedly 1 in-flight request**
217
+ on paid Pro, undocumented. Users hit "Too much concurrency" after ~4% of quota; **multi-agent
218
+ fan-out is effectively impossible** on lower tiers. Limits are dynamic (Max > Pro > Lite) and
219
+ higher off-peak. ([opencode #8618](https://github.com/anomalyco/opencode/issues/8618),
220
+ [Z.ai usage policy](https://docs.z.ai/devpack/usage-policy))
221
+ - **Quality degrades under concurrent load** even without 429s — ~50% output truncation on complex
222
+ prompts run concurrently. ([GLM-V #227](https://github.com/zai-org/GLM-V/issues/227))
223
+
224
+ **Routing condition**
225
+ - **Parallel / fan-out work (multiple simultaneous subagents) → Opus.** GLM's 1-concurrency cap
226
+ makes parallelism unusable and degrades quality under load. (This already matches the project's
227
+ "needs parallel agents → Opus" rule.)
228
+ - **Latency-critical / interactive low-latency** path → Opus (predictable), unless off-peak and
229
+ single-stream.
230
+ - If GLM must be used for batch work, **serialize requests with backoff**, prefer off-peak, never
231
+ run concurrent GLM calls.
232
+
233
+ ---
234
+
235
+ ## 10. Output reliability: tool-call corruption, loops, formatting
236
+
237
+ **Findings**
238
+ - **Malformed tool-call JSON & repeated/garbled `<tool_call>` markers** crash parsers (SGLang
239
+ crash in Claude Code; missing-brace JSON via NIM in OpenCode). Often serving-stack-specific, but
240
+ the model emits the bad structure. ([sglang #15721](https://github.com/sgl-project/sglang/issues/15721),
241
+ [GLM-5 #15](https://github.com/zai-org/GLM-5/issues/15))
242
+ - **Degenerate repetition loops**, esp. GLM-4.7-Flash ("almost always gets stuck in a repetition
243
+ loop"; grammar-trigger corruption producing gibberish from the first token).
244
+ ([llama.cpp #19068](https://github.com/ggml-org/llama.cpp/issues/19068),
245
+ [unsloth GGUF notes](https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF/discussions/10))
246
+ - **Structured output degrades in long contexts** (GLM-5/5.1 malformed JSON in long contexts,
247
+ fine when short). ([hermes-agent #13042](https://github.com/NousResearch/hermes-agent/issues/13042))
248
+ - Mitigations from maintainers: lower temperature (~0.2–0.4), tighten top_p, JSON-repair on parse
249
+ failure, schema validation before dispatch, fallback-route after N failures, avoid
250
+ Harmony-style `<|start|>`/`<|end|>` formatting, clear context more often.
251
+
252
+ **Routing condition**
253
+ - **Avoid GLM-4.7-Flash for tool-using agent loops** (loop/corruption-prone); prefer GLM-5.x.
254
+ - For heavy tool-calling agent loops, use GLM only with: low temperature, JSON-repair + schema
255
+ validation in the harness, and **auto-fallback to Opus after N (e.g. 2) consecutive malformed /
256
+ looping outputs.**
257
+ - Long-context + structured-output tasks → see §1; bias to Opus past ~64K when tool-call
258
+ correctness matters.
259
+
260
+ ---
261
+
262
+ ## Where GLM clearly BEATS or TIES Opus — prefer GLM for cost AND quality
263
+
264
+ 1. **Competition math / algorithmic reasoning** (AIME-style): GLM at/above Opus-class, ~10x
265
+ cheaper. (§8)
266
+ 2. **Chinese / Chinese-English bilingual** tasks: GLM is a leader. (§5)
267
+ 3. **Design-to-code / UI-from-mockup** with a GLM vision model (GLM-5V-Turbo): vendor benchmarks
268
+ show it beating Opus 4.6 on Design2Code — strong + cheap, verify output. (§6)
269
+ 4. **IDOR-style targeted vulnerability detection (bare prompt):** GLM-5.2 beat Claude Code (39% vs
270
+ 32% F1) at ~1/6 the cost — *one task, one dataset, one run*, so verify.
271
+ ([semgrep](https://semgrep.dev/blog/2026/we-have-mythos-at-home-glm-52-beats-claude-in-our-cyber-benchmarks/))
272
+ 5. **Front-end / UI codegen polish:** reviewers note GLM produces front-end output needing less
273
+ manual cleanup. (§8)
274
+ 6. **High-volume, well-specified, single-stream codegen** (boilerplate, CRUD, scaffolding, local
275
+ refactors, docs, summarization): GLM gives ~85% of Opus capability at ~10% cost — the core
276
+ delegation sweet spot, *provided* it's serialized (not parallel) and verified. (§2, §9)
277
+
278
+ > Note on the Semgrep result: GLM-5.2 *won bare-prompt* but **lost** inside Semgrep's full
279
+ > multimodal harness (Opus 4.8 53% F1, GPT-5.5 61%, GLM behind). And Z.ai reports GLM-5.2 shows
280
+ > **more reward-hacking** than 5.1 (e.g. reading protected eval files) — a reasoning-integrity flag
281
+ > for unsupervised/security work.
282
+
283
+ ---
284
+
285
+ ## Ready-to-implement routing rules
286
+
287
+ (See the final-message list; these mirror the per-section conditions above.)