npm - glm-mcp-claude - Versions diffs - 1.0.0 → 1.1.1 - Mend

glm-mcp-claude 1.0.0 → 1.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

package/README.md +11 -15
package/agents/glm.md +14 -4
package/docs/RULES.md +2 -1
package/glm-mcp/.env.example +6 -1
package/glm-mcp/package.json +1 -1
package/glm-mcp/src/index.js +20 -5
package/glm-mcp/src/router.js +19 -5
package/hooks/glm_subagent_router.mjs +28 -9
package/install.mjs +1 -1
package/package.json +1 -1

package/README.md CHANGED Viewed

@@ -19,9 +19,9 @@ files directly. One command to install.
 > **no separate pay-per-token Anthropic API key required**. Only GLM needs a (cheap) Z.ai key.
 > Opus orchestrates on your subscription; GLM does the heavy lifting for a fraction of the cost.
-![The glm subagent (orchestrated by Haiku 4.5, the cheap layer) reading this repo and offloading generation to GLM](assets/demo-glm-subagent-summary.png)
+![Directly calling the GLM agent to write a file end-to-end on disk](assets/demo-glm-agent-umbrella.png)
-<sub>↑ The `glm` subagent (orchestrated by Haiku 4.5, the cheap layer) reading the repo and offloading the heavy work to GLM via the MCP tools — the Opus → Haiku → GLM hybrid in action.</sub>
+<sub>↑ **Directly calling the GLM agent (`glm_agent`).** Prompt: *"write a 2000-word Shakespearean essay about the usefulness of an umbrella into my Desktop."* GLM created the file itself — **18 iterations, ~$0.064** — Opus never touched the keys. GLM reads, writes, edits, and runs your files directly.</sub>
 ```bash
 # from npm:
@@ -122,21 +122,16 @@ tool-heavy dependent loops, huge context, vision, or anything you mark sensitive
 | `glm_delegate` | GLM tokens | Text in → text out. GLM drafts; you place it. |
 | `glm_agent` | GLM tokens | GLM works your repo directly (read/write/edit/bash). Returns a diff + action log + git revert; supports `dry_run` (propose, don't write). |
-### Example: directly calling the GLM agent
+### Example: delegating a read-and-summarize task
-A real run — asking GLM (via `glm_agent`) to write a file end-to-end on disk:
+The `glm` subagent — its cheap Haiku layer driving, GLM doing the heavy lifting — reading this
+whole repo and summarizing it:
-![GLM agent writing a 2000-word Shakespearean essay to disk in 18 iterations for about 6 cents](assets/demo-glm-agent-umbrella.png)
+![The glm subagent (Haiku 4.5) reading the repo and offloading generation to GLM](assets/demo-glm-subagent-summary.png)
-> **Prompt:** *"Using the GLM agent `glm_agent`, write a 2000-word essay in Shakespearean format about the usefulness of an umbrella, into my Desktop."*
-GLM did it itself — created the file directly, no round-tripping the content through the main agent:
-- **Output:** `Umbrella-Essay-Shakespeare.md` — ~2,260 words of Early Modern English (*thee/thou/thy*, *doth/hath*) with two blank-verse interludes
-- **Work:** 18 tool-loop iterations; **one** file created, nothing existing touched
-- **Cost:** ~**$0.064** — a fraction of running the same task on Opus
-That's the point: the orchestrator stays on Opus while `glm_agent` does the heavy, file-touching work for cents.
+<sub>Orchestrated by Haiku 4.5 (the cheap layer), offloading the token-heavy work to GLM via the
+MCP tools — the Opus → Haiku → GLM hybrid in action. The orchestrator stays on Opus while GLM does
+the file-touching / heavy work for cents.</sub>
 ---
@@ -159,7 +154,8 @@ fine letting it modify.
 |---|---|---|
 | `GLM_API_KEY` | — | Your Z.ai key. **Required.** |
 | `GLM_BASE_URL` | `https://api.z.ai/api/anthropic` | Anthropic-compatible endpoint. |
-| `GLM_COST_BIAS` | `1.5` | How hard to favor GLM (it's ~10× cheaper). Higher = more GLM; `0` = decide on capability only. |
+| `GLM_USE_HAIKU` | `off` | **Off (default) skips the Haiku `glm` subagent and calls GLM directly** (`glm_agent`), so *all* tokens stay on GLM. Set `on` to allow the Haiku-orchestrated subagent (it spends some Claude tokens to orchestrate). |
+| `GLM_COST_BIAS` | `7` | How hard to favor GLM. **Default `7` → GLM carries ~98–100% of tasks** (Opus only for vision / parallel / >128K context / sensitive / heavy tool-loops). Lower it (e.g. `1.5`) to send more hard tasks (debugging, architecture, security) to Opus; `0` = decide on capability alone. |
 | `GLM_CAP` | `off` | Output-token cap. **Off by default** = generous (up to 131072 per call). Set `on` to enforce `GLM_MAX_TOKENS` and rein in spend. |
 | `GLM_MAX_TOKENS` | `32768` | The hard per-call limit applied **only when `GLM_CAP=on`**. (`max_tokens` is a ceiling, not a target — you pay for actual output.) |
 | `GLM_MAX_TOKENS_CEILING` | `131072` | The generous default used when the cap is **off**. |

package/agents/glm.md CHANGED Viewed

@@ -14,12 +14,18 @@ model: haiku
 You are the **GLM delegate** — a full subagent with the same tools as any subagent
 (Read, Grep, Glob, Write, Edit, Bash, …) PLUS the GLM tools. Your edge is COST: GLM
-(~10x cheaper than Opus) does the heavy lifting. You have two ways to use it — pick one:
+(~10x cheaper than Opus) does the heavy lifting.
-### Preferred for coding tasks: `glm_agent` (GLM works the files directly)
-For most "go do this in the repo" tasks, hand the whole thing to `glm_agent`. It runs GLM
+> ⚠️ **Token rule (important):** *you* run on Haiku (a Claude model). Anything **you** write with
+> your own Write/Edit/Bash spends **Claude tokens, zero GLM**. Only the `glm_agent` / `glm_delegate`
+> tools spend **GLM tokens**. So **strongly prefer `glm_agent`** for real work: let GLM do the
+> reading/writing/running. Use your own tools mainly to gather context and to verify the result —
+> not to produce the output yourself. This keeps the burden (and the tokens) on GLM.
+### Default for coding tasks: `glm_agent` (GLM works the files directly)
+For essentially every "go do this in the repo" task, hand the whole thing to `glm_agent`. It runs GLM
 as a real agent with its own read/write/edit/bash tools, so GLM inspects and edits the code
-itself and runs tests — end to end. Call it with:
+itself and runs tests — end to end, on GLM tokens. Call it with:
 - `task`: the self-contained coding task.
 - `workdir`: the **absolute path of the project root** (pass it explicitly).
 - `model`: leave `auto` (peak-aware); `thinking: true` for harder work.
@@ -36,6 +42,10 @@ call `glm_delegate` (task + pasted context), then apply it with your own Write/E
 2. **Serialize GLM calls** — one at a time (GLM caps concurrency ~1).
 3. **Verify before returning.** Build/lint/test or re-read. If GLM's output is wrong or it loops,
    retry once with a sharper prompt; if still bad, do the critical part yourself or escalate to Opus.
+4. **Always end your report with the GLM stats.** `glm_agent` prints a `=== GLM STATS ===` block
+   (model, tokens delegated, iterations, cost); `glm_delegate` prints a `[GLM delegated … tokens to
+   <model>]` line. Surface these in your final message so every run clearly states **which GLM model
+   ran (e.g. glm-5.2) and how many tokens were delegated.**
 ## Operating rules
 - **Serialize GLM calls.** GLM caps concurrent requests (~1); one `glm_delegate` at a time.

package/docs/RULES.md CHANGED Viewed

@@ -19,7 +19,8 @@ GLM is **~10× cheaper** than Opus, and **still ~3–4× cheaper even at peak**
 | GLM-5.2 at peak | 1.8 / 6.6 | ~3–4× cheaper |
 | GLM-4.7 (no multiplier) | 0.4 / 1.75 | ~12× cheaper |
-So the router applies a standing **cost bias toward GLM** (`GLM_COST_BIAS`, default `1.5`):
+So the router applies a standing **cost bias toward GLM** (`GLM_COST_BIAS`, default `7` →
+GLM carries ~98–100% of tasks; lower it to hand more hard tasks to Opus):
 GLM is the default for safe-to-be-wrong work, and Opus is the exception you *pay up* for only
 when quality/risk justifies it. The catch cost can't override: on hard tasks, *cheaper-but-wrong*
 is **more** expensive (rework + Opus tokens to fix), so the capability penalties for

package/glm-mcp/.env.example CHANGED Viewed

@@ -7,7 +7,12 @@ GLM_API_KEY=your-zai-key-here
 GLM_BASE_URL=https://api.z.ai/api/anthropic
 # --- optional tuning (sensible defaults baked in) ---
-# GLM_COST_BIAS=1.5           # how hard to favor GLM for cost (~10x cheaper). Higher = more GLM; 0 = ignore price
+# GLM_USE_HAIKU=off           # off (DEFAULT) = skip the Haiku `glm` subagent and call GLM directly
+#                             # (mcp__glm__glm_agent) so ALL tokens stay on GLM. Set to `on` to allow
+#                             # the Haiku-orchestrated subagent (it spends some Claude tokens).
+# GLM_COST_BIAS=7             # how hard to favor GLM. Default 7 => GLM handles ~98-100% of tasks
+#                             # (Opus only for vision/parallel/huge-context/sensitive/heavy tool-loops).
+#                             # Lower (e.g. 1.5) to route more hard tasks to Opus; 0 = capability only.
 # GLM_MAX_CONCURRENT=1        # GLM caps in-flight requests ~1; keep at 1 unless your tier allows more
 # --- output token cap (OFF by default = generous) ---
 # By default the cap is OFF: every call may use up to GLM_MAX_TOKENS_CEILING (131072).

package/glm-mcp/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "glm-mcp",
-  "version": "1.0.0",
+  "version": "1.1.1",
   "description": "MCP server that delegates self-contained subtasks to the GLM (Zhipu/Z.ai) Anthropic-compatible API, so Claude Code can use GLM as a cheap, peak-aware subagent.",
   "type": "module",
   "bin": {

package/glm-mcp/src/index.js CHANGED Viewed

@@ -26,6 +26,7 @@ import {
   MODELS,
   resolveMaxTokens,
   MAXTOK,
+  USE_HAIKU,
 } from "./router.js";
 import { runGlmAgent } from "./glmAgent.js";
@@ -177,16 +178,26 @@ server.registerTool(
     const chosen = resolveModel(model, now);
     try {
       const r = await runGlmAgent({ model: chosen, task, context, workdir, maxTokens: resolveMaxTokens(max_tokens), thinking, dryRun: dry_run });
-      const cost = estimateCost(chosen, r.usage.input_tokens, r.usage.output_tokens, now);
+      const inTok = r.usage.input_tokens || 0;
+      const outTok = r.usage.output_tokens || 0;
+      const totalTok = inTok + outTok;
+      const cost = estimateCost(chosen, inTok, outTok, now);
+      const opusCost = estimateCost("claude-opus", inTok, outTok, now);
+      const xCheaper = cost > 0 ? Math.round(opusCost / cost) : "?";
       const banner = r.dryRun ? "*** DRY RUN — nothing was written; this is GLM's PROPOSED change for you to approve ***\n" : "";
-      const totalTok = (r.usage.input_tokens || 0) + (r.usage.output_tokens || 0);
       const header =
-        `[GLM agent] delegated ${totalTok} tokens (${r.usage.input_tokens || 0} in / ${r.usage.output_tokens || 0} out) to ${chosen} — est $${cost} | ` +
-        `dir=${r.root} | iterations=${r.iters}${r.hitCap ? " (HIT CAP -- may be incomplete)" : ""} | actions=${r.actions.length} | files=${r.changedFiles.length}`;
+        `[GLM agent] ${chosen} | dir=${r.root} | ${r.iters} iterations${r.hitCap ? " (HIT CAP -- may be incomplete)" : ""} | ${r.actions.length} actions | ${r.changedFiles.length} files`;
       const actions = r.actions.length ? `\nActions:\n- ${r.actions.join("\n- ")}` : "";
       const diff = r.diff ? `\n\n=== DIFF (review this) ===\n${r.diff}` : "\n\n(no file changes)";
       const revert = !r.dryRun && r.git && r.git.revertHint ? `\n\nRevert: ${r.git.revertHint}` : "";
-      return { content: [{ type: "text", text: clip(`${banner}${header}${actions}${diff}${revert}\n\n=== GLM SUMMARY ===\n${r.text}`) }] };
+      // Prominent stats footer, shown after every glm_agent run finishes.
+      const stats =
+        `\n\n=== GLM STATS (this subagent) ===\n` +
+        `model:      ${chosen}\n` +
+        `tokens:     ${totalTok} delegated to GLM (${inTok} in / ${outTok} out)\n` +
+        `iterations: ${r.iters}${r.hitCap ? " (hit cap)" : ""}   files changed: ${r.changedFiles.length}\n` +
+        `est. cost:  $${cost}  (~${xCheaper}x cheaper than Opus)`;
+      return { content: [{ type: "text", text: clip(`${banner}${header}${actions}${diff}${revert}\n\n=== GLM SUMMARY ===\n${r.text}${stats}`) }] };
     } catch (e) {
       return {
         isError: true,
@@ -284,6 +295,10 @@ server.registerTool(
       base_url: config.BASE_URL,
       api_key_loaded: config.hasKey,
       max_concurrent: config.MAX_CONCURRENT,
+      use_haiku_subagent: USE_HAIKU,
+      orchestration: USE_HAIKU
+        ? "Haiku `glm` subagent allowed (spends some Claude tokens to orchestrate)."
+        : "Direct GLM only (GLM_USE_HAIKU=off) -> call glm_agent directly; keeps all tokens on GLM.",
       max_tokens: {
         cap_enabled: MAXTOK.capEnabled,
         default_per_call: resolveMaxTokens(undefined),

package/glm-mcp/src/router.js CHANGED Viewed

@@ -48,11 +48,25 @@ function numEnv(name, fallback) {
   return Number.isFinite(v) ? v : fallback;
 }
-// GLM is ~10x cheaper than Opus (and still ~3-4x cheaper even at peak), so the
-// correct default is GLM unless quality/risk justifies paying up for Opus.
-// This is a standing thumb on the scale toward GLM. Raise GLM_COST_BIAS to be more
-// aggressive about cost; set 0 to ignore price and decide on capability alone.
-const COST_BIAS = numEnv("GLM_COST_BIAS", 1.5);
+function boolEnv(name, fallback) {
+  const v = (process.env[name] || "").trim().toLowerCase();
+  if (/^(1|on|true|yes)$/.test(v)) return true;
+  if (/^(0|off|false|no)$/.test(v)) return false;
+  return fallback;
+}
+// Use the Haiku-orchestrated `glm` subagent? DEFAULT false -> skip Haiku and call GLM directly
+// (mcp__glm__glm_agent), so the burden and the tokens stay on GLM (the Haiku subagent's own
+// writing would spend Claude tokens). Set GLM_USE_HAIKU=on in .env to allow the subagent path.
+export const USE_HAIKU = boolEnv("GLM_USE_HAIKU", false);
+// GLM is ~10x cheaper than Opus, so by default GLM carries the overwhelming majority of the
+// burden: with GLM_COST_BIAS=7, ~98-100% of tasks route to GLM (measured across all task types,
+// peak and off-peak). Opus is used only for what GLM genuinely can't/shouldn't do -- vision,
+// parallel fan-out, >128K context, sensitive code, and heavy dependent tool-loops (the hard
+// overrides). LOWER GLM_COST_BIAS (e.g. 1.5) if you want Opus to handle more of the hard tasks
+// (debugging, architecture, security, big refactors); set 0 to decide on capability alone.
+const COST_BIAS = numEnv("GLM_COST_BIAS", 7);
 // --- Output token policy ---------------------------------------------------
 // max_tokens is a CEILING, not a target: GLM stops when done and you're billed for

package/hooks/glm_subagent_router.mjs CHANGED Viewed

@@ -17,6 +17,21 @@ import { readFileSync } from "node:fs";
 // Layout: <claude>/hooks/glm_subagent_router.mjs  +  <claude>/glm-mcp/src/router.js
 const HERE = dirname(fileURLToPath(import.meta.url));
 const ROUTER = resolve(HERE, "..", "glm-mcp", "src", "router.js");
+const MCP_ENV = resolve(HERE, "..", "glm-mcp", ".env");
+// Load the MCP server's .env so the hook honors the SAME settings the server uses
+// (GLM_COST_BIAS, GLM_USE_HAIKU, peak window, ...). Best-effort; safe if the file is absent.
+function loadMcpEnv() {
+  try {
+    for (const line of readFileSync(MCP_ENV, "utf8").split(/\r?\n/)) {
+      const m = line.match(/^\s*([A-Z0-9_]+)\s*=\s*(.*)\s*$/);
+      if (!m) continue;
+      let v = m[2];
+      if ((v.startsWith('"') && v.endsWith('"')) || (v.startsWith("'") && v.endsWith("'"))) v = v.slice(1, -1);
+      if (process.env[m[1]] === undefined) process.env[m[1]] = v;
+    }
+  } catch {}
+}
 // Pull the most recent genuine human message from the transcript (skipping
 // tool_result turns), so we can detect when the user explicitly picked an agent.
@@ -120,6 +135,7 @@ function inferProfile(text) {
 (async () => {
   try {
+    loadMcpEnv(); // honor .env settings before importing the router
     const raw = await readStdin();
     const payload = JSON.parse(raw || "{}");
     const ti = payload.tool_input || {};
@@ -142,7 +158,7 @@ function inferProfile(text) {
     let verdict, peakNote = "";
     try {
-      const { recommend, isPeak } = await import(pathToFileURL(ROUTER).href);
+      const { recommend, isPeak, USE_HAIKU } = await import(pathToFileURL(ROUTER).href);
       const rec = recommend(profile);
       const peak = isPeak();
       peakNote = peak
@@ -151,19 +167,22 @@ function inferProfile(text) {
       if (rec.engine !== "glm") {
         verdict = `KEEP ON OPUS (inferred: ${profile.taskType}, confidence ${rec.confidence}). Why: ${rec.reasons[0] || ""}`;
       } else if (repoTask) {
-        // Hands-on repo task -> point at glm_agent directly (skip the subagent middle layer).
+        // Hands-on repo task -> call glm_agent DIRECTLY (the only path that spends GLM tokens).
+        const haikuClause = USE_HAIKU
+          ? `The Haiku "glm" subagent is allowed (GLM_USE_HAIKU=on), but glm_agent direct is cheaper (no Claude orchestration tokens).`
+          : `Do NOT do this inline yourself, and do NOT use the Haiku "glm" subagent — both burn Claude/Opus tokens and spend ZERO GLM.`;
         verdict =
           `GLM-SUITABLE repo task (inferred: ${profile.taskType}, confidence ${rec.confidence}). ` +
-          `Best path: instead of spawning ${ti.subagent_type || "a subagent"}, call mcp__glm__glm_agent directly ` +
-          `with workdir="${cwd}" so GLM (model ${rec.model}, ~10x cheaper) reads/edits the files and runs tests itself. ` +
-          `For oversight, pass dry_run:true first to review GLM's proposed diff, then call again to apply. ` +
-          `Why: ${rec.reasons[rec.reasons.length - 1] || rec.reasons[0] || ""}`;
+          `➤ CALL mcp__glm__glm_agent DIRECTLY with workdir="${cwd}" (model ${rec.model}) — the only path that ` +
+          `actually spends GLM tokens (GLM reads/edits the files and runs tests itself). ${haikuClause} ` +
+          `For oversight, pass dry_run:true first, then apply. Why: ${rec.reasons[rec.reasons.length - 1] || rec.reasons[0] || ""}`;
       } else {
-        // Pure generation -> draft via glm_delegate / the glm subagent.
+        // Pure generation -> draft via glm_delegate (spends GLM tokens); not inline (Claude tokens).
         verdict =
           `GLM-SUITABLE generation (inferred: ${profile.taskType}, confidence ${rec.confidence}). ` +
-          `Use mcp__glm__glm_delegate (or the "glm" subagent) to draft this (model ${rec.model}, ~10x cheaper); ` +
-          `then place/verify the result yourself. Why: ${rec.reasons[rec.reasons.length - 1] || rec.reasons[0] || ""}`;
+          `➤ CALL mcp__glm__glm_delegate (model ${rec.model}) to generate this — that spends GLM tokens. ` +
+          `Writing it yourself instead spends Claude tokens and zero GLM. Then place/verify the result. ` +
+          `Why: ${rec.reasons[rec.reasons.length - 1] || rec.reasons[0] || ""}`;
       }
     } catch {
       verdict = `Could not load router; GLM delegate is still available (mcp__glm__glm_agent for repo tasks, mcp__glm__glm_delegate for drafts, mcp__glm__glm_recommend to decide).`;

package/install.mjs CHANGED Viewed

@@ -102,7 +102,7 @@ writeFileSync(settingsPath, JSON.stringify(settings, null, 2) + "\n");
 // 6. Append the global delegation policy (idempotent).
 step("Adding global delegation policy to CLAUDE.md");
 const claudeMd = join(CLAUDE, "CLAUDE.md");
-const policy = `\n# Delegation (GLM hybrid)\n\nPrefer offloading well-specified, self-contained subtasks. When you spawn a subagent (the\n\`Task\` tool), a PreToolUse hook injects a \`[GLM router]\` line advising whether to route it to\nthe cheaper \`glm\` subagent / \`mcp__glm__glm_agent\` or keep it on Opus -- follow that advice.\nIf the user explicitly named an agent, honor it. Full rules: \`~/.claude/glm-mcp/README.md\`.\n`;
+const policy = `\n# Delegation (GLM hybrid)\n\nPrefer offloading well-specified, self-contained subtasks to GLM. When you spawn a subagent (the\n\`Task\` tool), a PreToolUse hook injects a \`[GLM router]\` line advising GLM vs Opus -- follow it.\nIf the user explicitly named an agent, honor it. Full rules: \`~/.claude/glm-mcp/README.md\`.\n\n**To actually spend GLM tokens (not Claude tokens): call \`mcp__glm__glm_agent\` directly** for repo\nwork, or \`mcp__glm__glm_delegate\` for pure generation. Prefer calling \`glm_agent\` directly over the\n\`glm\` subagent (which runs on Haiku -- its own writing spends Claude tokens; only the glm_* tools\nspend GLM). If a task is GLM-suitable, don't do it inline yourself (Opus/Claude tokens, zero GLM).\n`;
 const existing = existsSync(claudeMd) ? readFileSync(claudeMd, "utf8") : "";
 if (!existing.includes("[GLM router]")) {
   writeFileSync(claudeMd, existing + policy);

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "glm-mcp-claude",
-  "version": "1.0.0",
+  "version": "1.1.1",
   "description": "GLM (Zhipu/Z.ai) as a cheap, full-capability subagent for Claude Code — auto-routing between Opus and GLM, a file-editing agent with diff/dry-run/git-revert oversight, and a one-command installer.",
   "type": "module",
   "bin": {