glm-mcp-claude 1.0.0 → 1.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -19,9 +19,9 @@ files directly. One command to install.
19
19
  > **no separate pay-per-token Anthropic API key required**. Only GLM needs a (cheap) Z.ai key.
20
20
  > Opus orchestrates on your subscription; GLM does the heavy lifting for a fraction of the cost.
21
21
 
22
- ![The glm subagent (orchestrated by Haiku 4.5, the cheap layer) reading this repo and offloading generation to GLM](assets/demo-glm-subagent-summary.png)
22
+ ![Directly calling the GLM agent to write a file end-to-end on disk](assets/demo-glm-agent-umbrella.png)
23
23
 
24
- <sub>↑ The `glm` subagent (orchestrated by Haiku 4.5, the cheap layer) reading the repo and offloading the heavy work to GLM via the MCP toolsthe Opus Haiku GLM hybrid in action.</sub>
24
+ <sub>↑ **Directly calling the GLM agent (`glm_agent`).** Prompt: *"write a 2000-word Shakespearean essay about the usefulness of an umbrella into my Desktop."* GLM created the file itself**18 iterations, ~$0.064** — Opus never touched the keys. GLM reads, writes, edits, and runs your files directly.</sub>
25
25
 
26
26
  ```bash
27
27
  # from npm:
@@ -122,21 +122,16 @@ tool-heavy dependent loops, huge context, vision, or anything you mark sensitive
122
122
  | `glm_delegate` | GLM tokens | Text in → text out. GLM drafts; you place it. |
123
123
  | `glm_agent` | GLM tokens | GLM works your repo directly (read/write/edit/bash). Returns a diff + action log + git revert; supports `dry_run` (propose, don't write). |
124
124
 
125
- ### Example: directly calling the GLM agent
125
+ ### Example: delegating a read-and-summarize task
126
126
 
127
- A real runasking GLM (via `glm_agent`) to write a file end-to-end on disk:
127
+ The `glm` subagentits cheap Haiku layer driving, GLM doing the heavy lifting — reading this
128
+ whole repo and summarizing it:
128
129
 
129
- ![GLM agent writing a 2000-word Shakespearean essay to disk in 18 iterations for about 6 cents](assets/demo-glm-agent-umbrella.png)
130
+ ![The glm subagent (Haiku 4.5) reading the repo and offloading generation to GLM](assets/demo-glm-subagent-summary.png)
130
131
 
131
- > **Prompt:** *"Using the GLM agent `glm_agent`, write a 2000-word essay in Shakespearean format about the usefulness of an umbrella, into my Desktop."*
132
-
133
- GLM did it itself — created the file directly, no round-tripping the content through the main agent:
134
-
135
- - **Output:** `Umbrella-Essay-Shakespeare.md` — ~2,260 words of Early Modern English (*thee/thou/thy*, *doth/hath*) with two blank-verse interludes
136
- - **Work:** 18 tool-loop iterations; **one** file created, nothing existing touched
137
- - **Cost:** ~**$0.064** — a fraction of running the same task on Opus
138
-
139
- That's the point: the orchestrator stays on Opus while `glm_agent` does the heavy, file-touching work for cents.
132
+ <sub>Orchestrated by Haiku 4.5 (the cheap layer), offloading the token-heavy work to GLM via the
133
+ MCP tools — the Opus → Haiku → GLM hybrid in action. The orchestrator stays on Opus while GLM does
134
+ the file-touching / heavy work for cents.</sub>
140
135
 
141
136
  ---
142
137
 
@@ -159,7 +154,8 @@ fine letting it modify.
159
154
  |---|---|---|
160
155
  | `GLM_API_KEY` | — | Your Z.ai key. **Required.** |
161
156
  | `GLM_BASE_URL` | `https://api.z.ai/api/anthropic` | Anthropic-compatible endpoint. |
162
- | `GLM_COST_BIAS` | `1.5` | How hard to favor GLM (it's ~10× cheaper). Higher = more GLM; `0` = decide on capability only. |
157
+ | `GLM_USE_HAIKU` | `off` | **Off (default) skips the Haiku `glm` subagent and calls GLM directly** (`glm_agent`), so *all* tokens stay on GLM. Set `on` to allow the Haiku-orchestrated subagent (it spends some Claude tokens to orchestrate). |
158
+ | `GLM_COST_BIAS` | `7` | How hard to favor GLM. **Default `7` → GLM carries ~98–100% of tasks** (Opus only for vision / parallel / >128K context / sensitive / heavy tool-loops). Lower it (e.g. `1.5`) to send more hard tasks (debugging, architecture, security) to Opus; `0` = decide on capability alone. |
163
159
  | `GLM_CAP` | `off` | Output-token cap. **Off by default** = generous (up to 131072 per call). Set `on` to enforce `GLM_MAX_TOKENS` and rein in spend. |
164
160
  | `GLM_MAX_TOKENS` | `32768` | The hard per-call limit applied **only when `GLM_CAP=on`**. (`max_tokens` is a ceiling, not a target — you pay for actual output.) |
165
161
  | `GLM_MAX_TOKENS_CEILING` | `131072` | The generous default used when the cap is **off**. |
package/agents/glm.md CHANGED
@@ -14,12 +14,18 @@ model: haiku
14
14
 
15
15
  You are the **GLM delegate** — a full subagent with the same tools as any subagent
16
16
  (Read, Grep, Glob, Write, Edit, Bash, …) PLUS the GLM tools. Your edge is COST: GLM
17
- (~10x cheaper than Opus) does the heavy lifting. You have two ways to use it — pick one:
17
+ (~10x cheaper than Opus) does the heavy lifting.
18
18
 
19
- ### Preferred for coding tasks: `glm_agent` (GLM works the files directly)
20
- For most "go do this in the repo" tasks, hand the whole thing to `glm_agent`. It runs GLM
19
+ > ⚠️ **Token rule (important):** *you* run on Haiku (a Claude model). Anything **you** write with
20
+ > your own Write/Edit/Bash spends **Claude tokens, zero GLM**. Only the `glm_agent` / `glm_delegate`
21
+ > tools spend **GLM tokens**. So **strongly prefer `glm_agent`** for real work: let GLM do the
22
+ > reading/writing/running. Use your own tools mainly to gather context and to verify the result —
23
+ > not to produce the output yourself. This keeps the burden (and the tokens) on GLM.
24
+
25
+ ### Default for coding tasks: `glm_agent` (GLM works the files directly)
26
+ For essentially every "go do this in the repo" task, hand the whole thing to `glm_agent`. It runs GLM
21
27
  as a real agent with its own read/write/edit/bash tools, so GLM inspects and edits the code
22
- itself and runs tests — end to end. Call it with:
28
+ itself and runs tests — end to end, on GLM tokens. Call it with:
23
29
  - `task`: the self-contained coding task.
24
30
  - `workdir`: the **absolute path of the project root** (pass it explicitly).
25
31
  - `model`: leave `auto` (peak-aware); `thinking: true` for harder work.
@@ -36,6 +42,10 @@ call `glm_delegate` (task + pasted context), then apply it with your own Write/E
36
42
  2. **Serialize GLM calls** — one at a time (GLM caps concurrency ~1).
37
43
  3. **Verify before returning.** Build/lint/test or re-read. If GLM's output is wrong or it loops,
38
44
  retry once with a sharper prompt; if still bad, do the critical part yourself or escalate to Opus.
45
+ 4. **Always end your report with the GLM stats.** `glm_agent` prints a `=== GLM STATS ===` block
46
+ (model, tokens delegated, iterations, cost); `glm_delegate` prints a `[GLM delegated … tokens to
47
+ <model>]` line. Surface these in your final message so every run clearly states **which GLM model
48
+ ran (e.g. glm-5.2) and how many tokens were delegated.**
39
49
 
40
50
  ## Operating rules
41
51
  - **Serialize GLM calls.** GLM caps concurrent requests (~1); one `glm_delegate` at a time.
package/docs/RULES.md CHANGED
@@ -19,7 +19,8 @@ GLM is **~10× cheaper** than Opus, and **still ~3–4× cheaper even at peak**
19
19
  | GLM-5.2 at peak | 1.8 / 6.6 | ~3–4× cheaper |
20
20
  | GLM-4.7 (no multiplier) | 0.4 / 1.75 | ~12× cheaper |
21
21
 
22
- So the router applies a standing **cost bias toward GLM** (`GLM_COST_BIAS`, default `1.5`):
22
+ So the router applies a standing **cost bias toward GLM** (`GLM_COST_BIAS`, default `7`
23
+ GLM carries ~98–100% of tasks; lower it to hand more hard tasks to Opus):
23
24
  GLM is the default for safe-to-be-wrong work, and Opus is the exception you *pay up* for only
24
25
  when quality/risk justifies it. The catch cost can't override: on hard tasks, *cheaper-but-wrong*
25
26
  is **more** expensive (rework + Opus tokens to fix), so the capability penalties for
@@ -7,7 +7,12 @@ GLM_API_KEY=your-zai-key-here
7
7
  GLM_BASE_URL=https://api.z.ai/api/anthropic
8
8
 
9
9
  # --- optional tuning (sensible defaults baked in) ---
10
- # GLM_COST_BIAS=1.5 # how hard to favor GLM for cost (~10x cheaper). Higher = more GLM; 0 = ignore price
10
+ # GLM_USE_HAIKU=off # off (DEFAULT) = skip the Haiku `glm` subagent and call GLM directly
11
+ # # (mcp__glm__glm_agent) so ALL tokens stay on GLM. Set to `on` to allow
12
+ # # the Haiku-orchestrated subagent (it spends some Claude tokens).
13
+ # GLM_COST_BIAS=7 # how hard to favor GLM. Default 7 => GLM handles ~98-100% of tasks
14
+ # # (Opus only for vision/parallel/huge-context/sensitive/heavy tool-loops).
15
+ # # Lower (e.g. 1.5) to route more hard tasks to Opus; 0 = capability only.
11
16
  # GLM_MAX_CONCURRENT=1 # GLM caps in-flight requests ~1; keep at 1 unless your tier allows more
12
17
  # --- output token cap (OFF by default = generous) ---
13
18
  # By default the cap is OFF: every call may use up to GLM_MAX_TOKENS_CEILING (131072).
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "glm-mcp",
3
- "version": "1.0.0",
3
+ "version": "1.1.1",
4
4
  "description": "MCP server that delegates self-contained subtasks to the GLM (Zhipu/Z.ai) Anthropic-compatible API, so Claude Code can use GLM as a cheap, peak-aware subagent.",
5
5
  "type": "module",
6
6
  "bin": {
@@ -26,6 +26,7 @@ import {
26
26
  MODELS,
27
27
  resolveMaxTokens,
28
28
  MAXTOK,
29
+ USE_HAIKU,
29
30
  } from "./router.js";
30
31
  import { runGlmAgent } from "./glmAgent.js";
31
32
 
@@ -177,16 +178,26 @@ server.registerTool(
177
178
  const chosen = resolveModel(model, now);
178
179
  try {
179
180
  const r = await runGlmAgent({ model: chosen, task, context, workdir, maxTokens: resolveMaxTokens(max_tokens), thinking, dryRun: dry_run });
180
- const cost = estimateCost(chosen, r.usage.input_tokens, r.usage.output_tokens, now);
181
+ const inTok = r.usage.input_tokens || 0;
182
+ const outTok = r.usage.output_tokens || 0;
183
+ const totalTok = inTok + outTok;
184
+ const cost = estimateCost(chosen, inTok, outTok, now);
185
+ const opusCost = estimateCost("claude-opus", inTok, outTok, now);
186
+ const xCheaper = cost > 0 ? Math.round(opusCost / cost) : "?";
181
187
  const banner = r.dryRun ? "*** DRY RUN — nothing was written; this is GLM's PROPOSED change for you to approve ***\n" : "";
182
- const totalTok = (r.usage.input_tokens || 0) + (r.usage.output_tokens || 0);
183
188
  const header =
184
- `[GLM agent] delegated ${totalTok} tokens (${r.usage.input_tokens || 0} in / ${r.usage.output_tokens || 0} out) to ${chosen} est $${cost} | ` +
185
- `dir=${r.root} | iterations=${r.iters}${r.hitCap ? " (HIT CAP -- may be incomplete)" : ""} | actions=${r.actions.length} | files=${r.changedFiles.length}`;
189
+ `[GLM agent] ${chosen} | dir=${r.root} | ${r.iters} iterations${r.hitCap ? " (HIT CAP -- may be incomplete)" : ""} | ${r.actions.length} actions | ${r.changedFiles.length} files`;
186
190
  const actions = r.actions.length ? `\nActions:\n- ${r.actions.join("\n- ")}` : "";
187
191
  const diff = r.diff ? `\n\n=== DIFF (review this) ===\n${r.diff}` : "\n\n(no file changes)";
188
192
  const revert = !r.dryRun && r.git && r.git.revertHint ? `\n\nRevert: ${r.git.revertHint}` : "";
189
- return { content: [{ type: "text", text: clip(`${banner}${header}${actions}${diff}${revert}\n\n=== GLM SUMMARY ===\n${r.text}`) }] };
193
+ // Prominent stats footer, shown after every glm_agent run finishes.
194
+ const stats =
195
+ `\n\n=== GLM STATS (this subagent) ===\n` +
196
+ `model: ${chosen}\n` +
197
+ `tokens: ${totalTok} delegated to GLM (${inTok} in / ${outTok} out)\n` +
198
+ `iterations: ${r.iters}${r.hitCap ? " (hit cap)" : ""} files changed: ${r.changedFiles.length}\n` +
199
+ `est. cost: $${cost} (~${xCheaper}x cheaper than Opus)`;
200
+ return { content: [{ type: "text", text: clip(`${banner}${header}${actions}${diff}${revert}\n\n=== GLM SUMMARY ===\n${r.text}${stats}`) }] };
190
201
  } catch (e) {
191
202
  return {
192
203
  isError: true,
@@ -284,6 +295,10 @@ server.registerTool(
284
295
  base_url: config.BASE_URL,
285
296
  api_key_loaded: config.hasKey,
286
297
  max_concurrent: config.MAX_CONCURRENT,
298
+ use_haiku_subagent: USE_HAIKU,
299
+ orchestration: USE_HAIKU
300
+ ? "Haiku `glm` subagent allowed (spends some Claude tokens to orchestrate)."
301
+ : "Direct GLM only (GLM_USE_HAIKU=off) -> call glm_agent directly; keeps all tokens on GLM.",
287
302
  max_tokens: {
288
303
  cap_enabled: MAXTOK.capEnabled,
289
304
  default_per_call: resolveMaxTokens(undefined),
@@ -48,11 +48,25 @@ function numEnv(name, fallback) {
48
48
  return Number.isFinite(v) ? v : fallback;
49
49
  }
50
50
 
51
- // GLM is ~10x cheaper than Opus (and still ~3-4x cheaper even at peak), so the
52
- // correct default is GLM unless quality/risk justifies paying up for Opus.
53
- // This is a standing thumb on the scale toward GLM. Raise GLM_COST_BIAS to be more
54
- // aggressive about cost; set 0 to ignore price and decide on capability alone.
55
- const COST_BIAS = numEnv("GLM_COST_BIAS", 1.5);
51
+ function boolEnv(name, fallback) {
52
+ const v = (process.env[name] || "").trim().toLowerCase();
53
+ if (/^(1|on|true|yes)$/.test(v)) return true;
54
+ if (/^(0|off|false|no)$/.test(v)) return false;
55
+ return fallback;
56
+ }
57
+
58
+ // Use the Haiku-orchestrated `glm` subagent? DEFAULT false -> skip Haiku and call GLM directly
59
+ // (mcp__glm__glm_agent), so the burden and the tokens stay on GLM (the Haiku subagent's own
60
+ // writing would spend Claude tokens). Set GLM_USE_HAIKU=on in .env to allow the subagent path.
61
+ export const USE_HAIKU = boolEnv("GLM_USE_HAIKU", false);
62
+
63
+ // GLM is ~10x cheaper than Opus, so by default GLM carries the overwhelming majority of the
64
+ // burden: with GLM_COST_BIAS=7, ~98-100% of tasks route to GLM (measured across all task types,
65
+ // peak and off-peak). Opus is used only for what GLM genuinely can't/shouldn't do -- vision,
66
+ // parallel fan-out, >128K context, sensitive code, and heavy dependent tool-loops (the hard
67
+ // overrides). LOWER GLM_COST_BIAS (e.g. 1.5) if you want Opus to handle more of the hard tasks
68
+ // (debugging, architecture, security, big refactors); set 0 to decide on capability alone.
69
+ const COST_BIAS = numEnv("GLM_COST_BIAS", 7);
56
70
 
57
71
  // --- Output token policy ---------------------------------------------------
58
72
  // max_tokens is a CEILING, not a target: GLM stops when done and you're billed for
@@ -17,6 +17,21 @@ import { readFileSync } from "node:fs";
17
17
  // Layout: <claude>/hooks/glm_subagent_router.mjs + <claude>/glm-mcp/src/router.js
18
18
  const HERE = dirname(fileURLToPath(import.meta.url));
19
19
  const ROUTER = resolve(HERE, "..", "glm-mcp", "src", "router.js");
20
+ const MCP_ENV = resolve(HERE, "..", "glm-mcp", ".env");
21
+
22
+ // Load the MCP server's .env so the hook honors the SAME settings the server uses
23
+ // (GLM_COST_BIAS, GLM_USE_HAIKU, peak window, ...). Best-effort; safe if the file is absent.
24
+ function loadMcpEnv() {
25
+ try {
26
+ for (const line of readFileSync(MCP_ENV, "utf8").split(/\r?\n/)) {
27
+ const m = line.match(/^\s*([A-Z0-9_]+)\s*=\s*(.*)\s*$/);
28
+ if (!m) continue;
29
+ let v = m[2];
30
+ if ((v.startsWith('"') && v.endsWith('"')) || (v.startsWith("'") && v.endsWith("'"))) v = v.slice(1, -1);
31
+ if (process.env[m[1]] === undefined) process.env[m[1]] = v;
32
+ }
33
+ } catch {}
34
+ }
20
35
 
21
36
  // Pull the most recent genuine human message from the transcript (skipping
22
37
  // tool_result turns), so we can detect when the user explicitly picked an agent.
@@ -120,6 +135,7 @@ function inferProfile(text) {
120
135
 
121
136
  (async () => {
122
137
  try {
138
+ loadMcpEnv(); // honor .env settings before importing the router
123
139
  const raw = await readStdin();
124
140
  const payload = JSON.parse(raw || "{}");
125
141
  const ti = payload.tool_input || {};
@@ -142,7 +158,7 @@ function inferProfile(text) {
142
158
 
143
159
  let verdict, peakNote = "";
144
160
  try {
145
- const { recommend, isPeak } = await import(pathToFileURL(ROUTER).href);
161
+ const { recommend, isPeak, USE_HAIKU } = await import(pathToFileURL(ROUTER).href);
146
162
  const rec = recommend(profile);
147
163
  const peak = isPeak();
148
164
  peakNote = peak
@@ -151,19 +167,22 @@ function inferProfile(text) {
151
167
  if (rec.engine !== "glm") {
152
168
  verdict = `KEEP ON OPUS (inferred: ${profile.taskType}, confidence ${rec.confidence}). Why: ${rec.reasons[0] || ""}`;
153
169
  } else if (repoTask) {
154
- // Hands-on repo task -> point at glm_agent directly (skip the subagent middle layer).
170
+ // Hands-on repo task -> call glm_agent DIRECTLY (the only path that spends GLM tokens).
171
+ const haikuClause = USE_HAIKU
172
+ ? `The Haiku "glm" subagent is allowed (GLM_USE_HAIKU=on), but glm_agent direct is cheaper (no Claude orchestration tokens).`
173
+ : `Do NOT do this inline yourself, and do NOT use the Haiku "glm" subagent — both burn Claude/Opus tokens and spend ZERO GLM.`;
155
174
  verdict =
156
175
  `GLM-SUITABLE repo task (inferred: ${profile.taskType}, confidence ${rec.confidence}). ` +
157
- `Best path: instead of spawning ${ti.subagent_type || "a subagent"}, call mcp__glm__glm_agent directly ` +
158
- `with workdir="${cwd}" so GLM (model ${rec.model}, ~10x cheaper) reads/edits the files and runs tests itself. ` +
159
- `For oversight, pass dry_run:true first to review GLM's proposed diff, then call again to apply. ` +
160
- `Why: ${rec.reasons[rec.reasons.length - 1] || rec.reasons[0] || ""}`;
176
+ `➤ CALL mcp__glm__glm_agent DIRECTLY with workdir="${cwd}" (model ${rec.model}) the only path that ` +
177
+ `actually spends GLM tokens (GLM reads/edits the files and runs tests itself). ${haikuClause} ` +
178
+ `For oversight, pass dry_run:true first, then apply. Why: ${rec.reasons[rec.reasons.length - 1] || rec.reasons[0] || ""}`;
161
179
  } else {
162
- // Pure generation -> draft via glm_delegate / the glm subagent.
180
+ // Pure generation -> draft via glm_delegate (spends GLM tokens); not inline (Claude tokens).
163
181
  verdict =
164
182
  `GLM-SUITABLE generation (inferred: ${profile.taskType}, confidence ${rec.confidence}). ` +
165
- `Use mcp__glm__glm_delegate (or the "glm" subagent) to draft this (model ${rec.model}, ~10x cheaper); ` +
166
- `then place/verify the result yourself. Why: ${rec.reasons[rec.reasons.length - 1] || rec.reasons[0] || ""}`;
183
+ `➤ CALL mcp__glm__glm_delegate (model ${rec.model}) to generate this that spends GLM tokens. ` +
184
+ `Writing it yourself instead spends Claude tokens and zero GLM. Then place/verify the result. ` +
185
+ `Why: ${rec.reasons[rec.reasons.length - 1] || rec.reasons[0] || ""}`;
167
186
  }
168
187
  } catch {
169
188
  verdict = `Could not load router; GLM delegate is still available (mcp__glm__glm_agent for repo tasks, mcp__glm__glm_delegate for drafts, mcp__glm__glm_recommend to decide).`;
package/install.mjs CHANGED
@@ -102,7 +102,7 @@ writeFileSync(settingsPath, JSON.stringify(settings, null, 2) + "\n");
102
102
  // 6. Append the global delegation policy (idempotent).
103
103
  step("Adding global delegation policy to CLAUDE.md");
104
104
  const claudeMd = join(CLAUDE, "CLAUDE.md");
105
- const policy = `\n# Delegation (GLM hybrid)\n\nPrefer offloading well-specified, self-contained subtasks. When you spawn a subagent (the\n\`Task\` tool), a PreToolUse hook injects a \`[GLM router]\` line advising whether to route it to\nthe cheaper \`glm\` subagent / \`mcp__glm__glm_agent\` or keep it on Opus -- follow that advice.\nIf the user explicitly named an agent, honor it. Full rules: \`~/.claude/glm-mcp/README.md\`.\n`;
105
+ const policy = `\n# Delegation (GLM hybrid)\n\nPrefer offloading well-specified, self-contained subtasks to GLM. When you spawn a subagent (the\n\`Task\` tool), a PreToolUse hook injects a \`[GLM router]\` line advising GLM vs Opus -- follow it.\nIf the user explicitly named an agent, honor it. Full rules: \`~/.claude/glm-mcp/README.md\`.\n\n**To actually spend GLM tokens (not Claude tokens): call \`mcp__glm__glm_agent\` directly** for repo\nwork, or \`mcp__glm__glm_delegate\` for pure generation. Prefer calling \`glm_agent\` directly over the\n\`glm\` subagent (which runs on Haiku -- its own writing spends Claude tokens; only the glm_* tools\nspend GLM). If a task is GLM-suitable, don't do it inline yourself (Opus/Claude tokens, zero GLM).\n`;
106
106
  const existing = existsSync(claudeMd) ? readFileSync(claudeMd, "utf8") : "";
107
107
  if (!existing.includes("[GLM router]")) {
108
108
  writeFileSync(claudeMd, existing + policy);
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "glm-mcp-claude",
3
- "version": "1.0.0",
3
+ "version": "1.1.1",
4
4
  "description": "GLM (Zhipu/Z.ai) as a cheap, full-capability subagent for Claude Code — auto-routing between Opus and GLM, a file-editing agent with diff/dry-run/git-revert oversight, and a one-command installer.",
5
5
  "type": "module",
6
6
  "bin": {