npm - @estebanforge/pi-glm-tweaks - Versions diffs - 1.0.0 - Mend

@estebanforge/pi-glm-tweaks 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (4) hide show

package/LICENSE ADDED Viewed

@@ -0,0 +1,21 @@
+MIT License
+Copyright (c) 2026 EstebanForge
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

package/README.md ADDED Viewed

@@ -0,0 +1,73 @@
+# @estebanforge/pi-glm-tweaks
+Pi-native tweaks for Z.AI's **GLM-5.2**. Restricts the Pi thinking-level UI to the three modes GLM-5.2 actually supports (**off**, **high**, **max**), wires the native `thinkingFormat:"zai"` translation, and auto-clamps any stale level when the model is selected.
+## Install
+```
+pi install npm:@estebanforge/pi-glm-tweaks
+```
+Works with Pi's built-in `zai/glm-5.2` model out of the box, or a custom entry in `~/.pi/agent/models.json`. The extension re-registers it with the OpenAI-compat endpoint and the proper thinking map. Other Z.AI models (`zai/glm-4.7`, `zai/glm-5-turbo`, `zai/glm-5.1`, plus any custom entries) are preserved across the re-registration.
+## What it does
+GLM-5.2 ships three thinking modes (per [docs.z.ai](https://docs.z.ai/guides/capabilities/thinking)):
+| Pi thinking level | GLM-5.2 wire |
+| --- | --- |
+| `off` | `thinking: { type: "disabled" }` |
+| `high` | `thinking: { type: "enabled" }` + `reasoning_effort: "high"` |
+| `max` (Pi `xhigh`) | `thinking: { type: "enabled" }` + `reasoning_effort: "max"` |
+Pi natively exposes six thinking levels (`off`, `minimal`, `low`, `medium`, `high`, `xhigh`). GLM-5.2 doesn't really fit the middle four — `low`/`medium` get mapped to `high` server-side, `minimal` skips thinking, and `xhigh` is the only way to reach `reasoning_effort: "max"`.
+This extension collapses that mismatch:
+1. **Re-registers `zai/glm-5.2`** on `session_start` with `api: "openai-completions"`, `baseUrl: https://api.z.ai/api/coding/paas/v4`, `compat.thinkingFormat: "zai"`, and a tight `thinkingLevelMap`:
+   ```ts
+   {
+     minimal: null,  // hidden
+     low: null,      // hidden
+     medium: null,   // hidden
+     high:   "high", // → reasoning_effort: "high"
+     xhigh:  "max",  // → reasoning_effort: "max"
+     // off omitted → supported, sends thinking.type = "disabled"
+   }
+   ```
+2. **Auto-clamps on `model_select`** — if the current level is one we hid (e.g. you switched from a model that allowed `medium`), quietly bump to `high` and notify.
+3. **Footer hint** — sets `ctx.ui.setStatus("glm-thinking", "thinking: off | high | max")` while GLM-5.2 is the active model.
+`Shift+Tab`, `/thinking`, and the level picker all see only the three GLM-5.2 modes.
+## Token-efficiency tweaks
+GLM-5.2 overthinks on long agent loops — it can spend an entire turn on `reasoning_content` without taking a tool call. The Z.AI API does not expose a `max_thinking_tokens` parameter, so the post that popularised this observation does it at the provider layer (mid-stream injection). We can't intercept the stream, but we can approximate the win with three cheap, opt-out tweaks:
+| Flag | Default | What it does |
+| --- | --- | --- |
+| `glm-budget-nudge` | `true` | (a) Appends a soft thinking-budget fragment to the system prompt on every zai/glm-5.2 turn. (b) Per LLM call, sums `reasoning_content` across prior assistant messages in the current agent loop (the one started by the most recent user prompt); if cumulative exceeds ~2000 characters (roughly 500 English tokens), injects a one-shot hint to push the model back toward tool calls. Fires at most once per loop. The hint appears in the conversation panel as a user message prefixed `[system reminder: ...]` — that is intentional, so you can see when the ratchet fired. |
+| `glm-clear-thinking` | `true` | Forces `clear_thinking: true` on every request. The coding endpoint (`api.z.ai/api/coding/paas/v4`) defaults to preserved thinking, which silently compounds `reasoning_content` across turns. At $4.4/MTok output, this is real money. |
+| `glm-quick-disable` | `true` | For user prompts under 80 chars, forces `thinking.type: "disabled"` for that turn. Trivial questions ("what time is it") don't need deep thinking. |
+All three flags surface in `pi config` and Pi's flag editor — `pi config set glm-budget-nudge false` to disable.
+### What the tweaks cannot do
+- Cap thinking tokens at a wire level. Z.AI does not expose a thinking budget param.
+- Inject text mid-stream. No Pi hook for streaming chunk mutation.
+- Force the model to call a tool. The system prompt can ask; nothing forces it.
+- Lower `reasoning_effort` per-request. Per [KiwiGaze/glm-for-copilot #7](https://github.com/KiwiGaze/glm-for-copilot/issues/7) it's a no-op on `/chat/completions`.
+## Why this exists
+Pi's built-in `thinkingFormat: "zai"` (in `openai-completions.js`) already knows the wire translation. The catch is that GLM-5.2's user-defined model in `models.json` typically lacks a `thinkingLevelMap`, so the UI shows all six levels and sends invalid combinations on hidden ones. This extension fills that gap automatically — no manual `models.json` editing.
+## Compatibility
+- Pi (`@earendil-works/pi-coding-agent`) — any version with `registerProvider` taking effect post-bind and `thinkingFormat: "zai"` support, plus the `before_agent_start` / `context` / `before_provider_request` / `registerFlag` hooks.
+- Z.AI API key — resolved through Pi's standard auth storage (env var `ZAI_API_KEY`, `/login`, or `models.json` provider `apiKey`). The extension does not configure auth.
+## License
+MIT

package/extensions/index.ts ADDED Viewed

@@ -0,0 +1,289 @@
+/**
+ * pi-glm-tweaks — Pi-native tweaks for Z.AI's GLM-5.2.
+ *
+ * Restricts the Pi thinking-level UI to the three modes GLM-5.2 actually
+ * supports (off, high, max), wires the native `thinkingFormat: "zai"` wire
+ * translation, auto-clamps hidden levels, and applies token-efficiency
+ * hygiene (per-turn system-prompt nudge, intra-loop ratchet, wire-level
+ * clear_thinking and short-prompt quick-disable).
+ *
+ * Wire map (see https://docs.z.ai/guides/capabilities/thinking and
+ * providers/openai-completions.js in pi-ai):
+ *
+ *   Pi level  | thinking.type | reasoning_effort
+ *   ----------|---------------|------------------
+ *   off       | "disabled"    | (omitted)
+ *   high      | "enabled"     | "high"
+ *   xhigh     | "enabled"     | "max"
+ *
+ * Hidden levels (minimal, low, medium) are Pi-side concepts that don't map
+ * cleanly: low/medium get server-side-mapped to "high", minimal is a no-op
+ * for Pi's reasoning transport. Showing them invites accidental footguns.
+ *
+ * Behavior:
+ *   - On session_start, re-register the `zai` provider with GLM-5.2 redefined
+ *     against the OpenAI-compat endpoint and the tight thinkingLevelMap.
+ *     registerProvider takes effect immediately after bindCore (no /reload).
+ *   - On model_select to zai/glm-5.2, clamp a stale hidden level to "high"
+ *     and notify. Set the footer status hint.
+ *   - On model_select to any other model, clear the footer status.
+ *   - On every user turn, inject a soft system-prompt budget fragment
+ *     (`glm-budget-nudge`, default on).
+ *   - Per LLM call, count cumulative reasoning_content; if over a
+ *     threshold, inject a one-shot user-side hint to push the model back
+ *     toward tool calls (`glm-budget-nudge`).
+ *   - On every outgoing request, force `clear_thinking: true` (the coding
+ *     endpoint defaults to preserved thinking, which silently compounds
+ *     `reasoning_content` across turns). `glm-clear-thinking`, default on.
+ *   - On short user prompts (<80 chars), force `thinking.type: "disabled"`
+ *     to save tokens on trivial turns. `glm-quick-disable`, default on.
+ *
+ * Auth is untouched. The provider's existing key (ZAI_API_KEY env, /login,
+ * or models.json apiKey) continues to resolve against the new baseUrl.
+ */
+import type { ExtensionAPI } from "@earendil-works/pi-coding-agent";
+const PROVIDER = "zai";
+const MODEL_ID = "glm-5.2";
+const ZAI_CODING_BASE_URL = "https://api.z.ai/api/coding/paas/v4";
+// Pi thinking-level keys we hide for GLM-5.2. Listed explicitly so the map
+// stays grep-friendly; any level not present (notably `off`) is supported
+// with the provider's default mapping (here: thinking.type="disabled").
+const HIDDEN_LEVELS = new Set(["minimal", "low", "medium"]);
+// Token-efficiency tuning constants. Hardcoded for v1 — exposed as flags
+// would be over-engineering for a single-model extension. Bump these in
+// a future minor if users report the ratchet firing too eagerly / not
+// eagerly enough.
+const SHORT_PROMPT_THRESHOLD = 80;
+const RATCHET_THRESHOLD_CHARS = 2_000;
+// Soft system-prompt fragment appended to every zai/glm-5.2 turn when
+// the budget-nudge flag is on. No "I'm overthinking" ack string — that's
+// unenforceable (model may or may not emit it, may emit it in Chinese,
+// and we'd have to detect it).
+const BUDGET_FRAGMENT = `
+<glm-thinking-budget>
+You are operating under a per-turn thinking budget. Behave accordingly:
+- Cap each thinking block at ~500 tokens. Don't ruminate; commit to a tool call or response.
+- Take a tool call every 200-300 thinking tokens. Don't sit and speculate without acting.
+- Prefer a concrete tool call over further internal deliberation.
+</glm-thinking-budget>`;
+// Redefined glm-5.2 model entry. `cost` mirrors the built-in (Z.AI does
+// not publish per-token rates; zeros is conservative). thinkingLevelMap
+// doubles as UI-hide (`null`) and wire-level safety net: Pi's zai branch
+// in openai-completions.js reads this map for reasoning_effort, and a
+// null entry produces no reasoning_effort field on the wire. baseUrl
+// is per-model (not provider-level) so we don't override any custom
+// baseUrl the user may have set on other `zai/*` models.
+const GLM52_MODEL = {
+	id: MODEL_ID,
+	name: "GLM-5.2",
+	api: "openai-completions",
+	baseUrl: ZAI_CODING_BASE_URL,
+	reasoning: true,
+	input: ["text"] as ("text" | "image")[],
+	contextWindow: 1_000_000,
+	maxTokens: 131_072,
+	cost: { input: 0, output: 0, cacheRead: 0, cacheWrite: 0 },
+	thinkingLevelMap: {
+		minimal: null,
+		low: null,
+		medium: null,
+		high: "high",
+		xhigh: "max",
+	},
+	compat: {
+		supportsDeveloperRole: false,
+		supportsReasoningEffort: true,
+		thinkingFormat: "zai" as const,
+		zaiToolStream: true,
+	},
+};
+function isZaiGlm52(model: { provider: string; id: string } | undefined | null): boolean {
+	return !!model && model.provider === PROVIDER && model.id === MODEL_ID;
+}
+export default function (pi: ExtensionAPI) {
+	// Register Pi-idiomatic flags at factory load time, NOT inside
+	// session_start. registerFlag is static setup; calling it per session
+	// would clobber user preferences on every /new or /reload.
+	pi.registerFlag("glm-budget-nudge", {
+		description: "Inject a soft thinking-budget system-prompt fragment and intra-loop ratchet for zai/glm-5.2.",
+		type: "boolean",
+		default: true,
+	});
+	pi.registerFlag("glm-clear-thinking", {
+		description: "Force clear_thinking=true on zai/glm-5.2 requests to prevent cross-turn reasoning_content carryover on the coding endpoint.",
+		type: "boolean",
+		default: true,
+	});
+	pi.registerFlag("glm-quick-disable", {
+		description: "Disable thinking on short user prompts (<80 chars) to save tokens on trivial turns.",
+		type: "boolean",
+		default: true,
+	});
+	// Per-loop mutable state. Node.js runs the extension hooks single-
+	// threaded, so a closure-scoped object is safe and avoids re-reading
+	// flags + recomputing in every hook. Reset on every before_agent_start.
+	const loop: {
+		shortPrompt: boolean;
+		ratchetFired: boolean;
+	} = { shortPrompt: false, ratchetFired: false };
+	pi.on("session_start", async (_event, ctx) => {
+		// Build the full `zai` provider model list, patching only glm-5.2.
+		// registerProvider replaces ALL models for the provider when models
+		// are provided, so a single-entry list would silently drop
+		// glm-4.7, glm-5-turbo, glm-5.1, and any user-added zai entries.
+		const existing = ctx.modelRegistry.getAll().filter((m) => m.provider === PROVIDER);
+		if (existing.length === 0) return;
+		if (!existing.some((m) => m.id === MODEL_ID)) return;
+		// registerProvider requires apiKey (or oauth) when defining models,
+		// even for a provider that already has auth resolved. Pull the
+		// resolved key from the existing provider so we keep working
+		// whether the user used ZAI_API_KEY env, /login, or models.json
+		// apiKey.
+		const apiKey = await ctx.modelRegistry.getApiKeyForProvider(PROVIDER);
+		if (!apiKey) {
+			ctx.ui.notify(
+				"pi-glm-tweaks: ZAI auth not configured. Run `/login` or set ZAI_API_KEY to enable GLM-5.2 thinking tweaks.",
+				"warning",
+			);
+			return;
+		}
+		// Per-model spread preserves every original field (api, baseUrl,
+		// headers, compat extras) for non-target models. Only glm-5.2 gets
+		// the new thinkingLevelMap, baseUrl, and OpenAI-compat compat block.
+		// baseUrl is set at BOTH provider level (required by validation;
+		// satisfies the model-registry check) and per-model in GLM52_MODEL
+		// (per-model takes precedence at request time, so any custom
+		// baseUrl the user has on other `zai/*` models is preserved by
+		// the spread).
+		const models = existing.map((m) => (isZaiGlm52(m) ? GLM52_MODEL : { ...m }));
+		pi.registerProvider(PROVIDER, {
+			baseUrl: ZAI_CODING_BASE_URL,
+			apiKey,
+			models,
+		});
+	});
+	pi.on("before_agent_start", (event, ctx) => {
+		// Reset per-loop state at the start of each user turn. The other
+		// hooks read these to drive their per-turn behavior.
+		loop.shortPrompt = event.prompt.length < SHORT_PROMPT_THRESHOLD;
+		loop.ratchetFired = false;
+		if (!isZaiGlm52(ctx.model)) return {};
+		if (pi.getFlag("glm-budget-nudge") !== true) return {};
+		// Return the assembled prompt with our fragment appended. We must
+		// concat (not replace) — Pi's before_agent_start chaining means
+		// our systemPrompt replaces the upstream value, and other
+		// extensions downstream only see what we return.
+		return { systemPrompt: (event.systemPrompt ?? "") + BUDGET_FRAGMENT };
+	});
+	pi.on("context", (event, ctx) => {
+		if (!isZaiGlm52(ctx.model)) return {};
+		if (pi.getFlag("glm-budget-nudge") !== true) return {};
+		if (loop.ratchetFired) return {};
+		// Sum reasoning_content from assistant messages in the CURRENT
+		// agent loop only. Find the boundary by walking back to the last
+		// `role: "user"` message (the prompt that started this loop).
+		// toolResult / assistant / custom / etc. are not user role, so
+		// they don't reset the boundary. Without this scoping, a long
+		// session would fire the ratchet on the first LLM call of every
+		// new turn regardless of current-loop thinking.
+		let loopStart = event.messages.length - 1;
+		while (loopStart > 0) {
+			const m = event.messages[loopStart] as { role?: string } | undefined;
+			if (m?.role === "user") break;
+			loopStart--;
+		}
+		let totalReasoning = 0;
+		for (let i = loopStart + 1; i < event.messages.length; i++) {
+			const m = event.messages[i];
+			if (typeof m !== "object" || m === null) continue;
+			const msg = m as { role?: string; reasoning_content?: unknown };
+			if (msg.role !== "assistant") continue;
+			if (typeof msg.reasoning_content === "string") {
+				totalReasoning += msg.reasoning_content.length;
+			}
+		}
+		if (totalReasoning < RATCHET_THRESHOLD_CHARS) return {};
+		loop.ratchetFired = true;
+		const hint = {
+			role: "user",
+			content:
+				"[system reminder: you've been thinking extensively without taking a tool call. Take a tool call now or wrap up your response.]",
+		};
+		return { messages: [...event.messages, hint as never] };
+	});
+	pi.on("before_provider_request", (event, ctx) => {
+		if (!isZaiGlm52(ctx.model)) return;
+		if (!event.payload || typeof event.payload !== "object") return;
+		const obj = event.payload as Record<string, unknown>;
+		const current = obj.thinking;
+		const thinking =
+			current && typeof current === "object" && !Array.isArray(current)
+				? { ...(current as Record<string, unknown>) }
+				: ({} as Record<string, unknown>);
+		let mutated = false;
+		// Force clear_thinking on every request. The coding endpoint
+		// defaults to preserved thinking (clear_thinking: false), which
+		// silently compounds reasoning_content across turns. Cost at
+		// $4.4/MTok output makes this materially expensive.
+		if (pi.getFlag("glm-clear-thinking") === true) {
+			thinking.clear_thinking = true;
+			mutated = true;
+		}
+		// Short-prompt quick-disable: trivial turns ("what time is it")
+		// don't need deep thinking. Force the kill switch and let Pi's
+		// zai branch drop the thinking.type="disabled" through.
+		if (pi.getFlag("glm-quick-disable") === true && loop.shortPrompt) {
+			thinking.type = "disabled";
+			mutated = true;
+		}
+		if (mutated) {
+			obj.thinking = thinking;
+		}
+		return obj;
+	});
+	pi.on("model_select", (event, ctx) => {
+		if (!isZaiGlm52(event.model)) {
+			ctx.ui.setStatus("glm-thinking", undefined);
+			return;
+		}
+		// Auto-clamp if Pi's current level is one we hid for GLM-5.2.
+		// setThinkingLevel is a no-op if already at the requested level.
+		const current = pi.getThinkingLevel();
+		if (HIDDEN_LEVELS.has(current)) {
+			pi.setThinkingLevel("high");
+			ctx.ui.notify(
+				`GLM-5.2 thinking: "${current}" not supported. Switched to high (off | high | max).`,
+				"info",
+			);
+		}
+		ctx.ui.setStatus("glm-thinking", "thinking: off | high | max");
+	});
+}

package/package.json ADDED Viewed

@@ -0,0 +1,49 @@
+{
+  "name": "@estebanforge/pi-glm-tweaks",
+  "version": "1.0.0",
+  "description": "Pi-native tweaks for Z.AI's GLM-5.2. Restricts the Pi thinking-level UI to the three modes GLM-5.2 actually supports (off, high, max), wires the native thinkingFormat:\"zai\" wire translation, and auto-clamps hidden levels when the model is selected.",
+  "keywords": [
+    "pi-package",
+    "pi-extension",
+    "zai",
+    "z-ai",
+    "glm",
+    "glm-5.2",
+    "thinking",
+    "reasoning_effort",
+    "thinking-mode"
+  ],
+  "license": "MIT",
+  "author": {
+    "name": "EstebanForge",
+    "email": "esteban@attitude.cl",
+    "url": "https://actitud.xyz"
+  },
+  "repository": {
+    "type": "git",
+    "url": "git+https://github.com/EstebanForge/pi-glm-tweaks.git"
+  },
+  "type": "module",
+  "files": [
+    "extensions",
+    "README.md"
+  ],
+  "pi": {
+    "extensions": [
+      "./extensions"
+    ]
+  },
+  "peerDependencies": {
+    "@earendil-works/pi-coding-agent": "*"
+  },
+  "peerDependenciesMeta": {
+    "@earendil-works/pi-coding-agent": {
+      "optional": true
+    }
+  },
+  "devDependencies": {
+    "@earendil-works/pi-coding-agent": "^0.80.2",
+    "@types/node": "^22.0.0",
+    "typescript": "^5.8.0"
+  }
+}