npm - @possumtech/rummy - Versions diffs - 2.1.0 → 2.2.1 - Mend

@possumtech/rummy 2.1.0 → 2.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (140) hide show

package/.env.example +40 -15
package/.xai.key +1 -0
package/PLUGINS.md +169 -53
package/README.md +38 -32
package/SPEC.md +366 -179
package/bin/digest.js +1097 -0
package/biome/no-fallbacks.grit +2 -2
package/gemini.key +1 -0
package/lang/en.json +10 -1
package/migrations/001_initial_schema.sql +9 -2
package/package.json +19 -8
package/service.js +1 -0
package/src/agent/AgentLoop.js +76 -26
package/src/agent/ContextAssembler.js +2 -0
package/src/agent/Entries.js +238 -60
package/src/agent/ProjectAgent.js +44 -0
package/src/agent/TurnExecutor.js +99 -30
package/src/agent/XmlParser.js +206 -111
package/src/agent/errors.js +35 -0
package/src/agent/known_queries.sql +1 -1
package/src/agent/known_store.sql +3 -42
package/src/agent/materializeContext.js +30 -1
package/src/agent/runs.sql +8 -18
package/src/agent/tokens.js +0 -1
package/src/agent/turns.sql +1 -0
package/src/hooks/Hooks.js +26 -0
package/src/hooks/RummyContext.js +12 -1
package/src/lib/hedberg/README.md +60 -0
package/src/lib/hedberg/hedberg.js +60 -0
package/src/lib/hedberg/marker.js +158 -0
package/src/{plugins → lib}/hedberg/matcher.js +1 -2
package/src/llm/LlmProvider.js +41 -3
package/src/llm/openaiStream.js +17 -0
package/src/plugins/ask_user/ask_user.js +12 -2
package/src/plugins/ask_user/ask_userDoc.md +1 -5
package/src/plugins/budget/README.md +29 -24
package/src/plugins/budget/budget.js +166 -110
package/src/plugins/cli/README.md +3 -4
package/src/plugins/cli/cli.js +31 -5
package/src/plugins/cloudflare/cloudflare.js +136 -0
package/src/plugins/cp/cp.js +41 -4
package/src/plugins/cp/cpDoc.md +5 -6
package/src/plugins/engine/engine.sql +1 -1
package/src/plugins/env/README.md +5 -4
package/src/plugins/env/env.js +7 -4
package/src/plugins/env/envDoc.md +7 -8
package/src/plugins/error/error.js +56 -15
package/src/plugins/file/README.md +12 -3
package/src/plugins/file/file.js +2 -2
package/src/plugins/get/get.js +59 -36
package/src/plugins/get/getDoc.md +10 -34
package/src/plugins/google/google.js +115 -0
package/src/plugins/hedberg/hedberg.js +13 -56
package/src/plugins/helpers.js +66 -12
package/src/plugins/index.js +1 -2
package/src/plugins/instructions/README.md +44 -47
package/src/plugins/instructions/instructions-system.md +44 -0
package/src/plugins/instructions/instructions-user.md +53 -0
package/src/plugins/instructions/instructions.js +58 -189
package/src/plugins/known/README.md +6 -7
package/src/plugins/known/known.js +24 -30
package/src/plugins/log/log.js +41 -32
package/src/plugins/mv/mv.js +40 -1
package/src/plugins/mv/mvDoc.md +1 -8
package/src/plugins/ollama/ollama.js +4 -3
package/src/plugins/openai/openai.js +4 -3
package/src/plugins/openrouter/openrouter.js +14 -4
package/src/plugins/persona/README.md +11 -13
package/src/plugins/persona/default.md +29 -0
package/src/plugins/persona/persona.js +10 -66
package/src/plugins/policy/policy.js +23 -22
package/src/plugins/prompt/README.md +37 -27
package/src/plugins/prompt/prompt.js +13 -19
package/src/plugins/rm/rm.js +18 -0
package/src/plugins/rm/rmDoc.md +5 -6
package/src/plugins/rpc/rpc.js +3 -3
package/src/plugins/set/set.js +205 -323
package/src/plugins/set/setDoc.md +47 -17
package/src/plugins/sh/README.md +6 -5
package/src/plugins/sh/sh.js +8 -5
package/src/plugins/sh/shDoc.md +7 -8
package/src/plugins/skill/README.md +37 -14
package/src/plugins/skill/skill.js +200 -101
package/src/plugins/skill/skillDoc.js +3 -0
package/src/plugins/skill/skillDoc.md +9 -0
package/src/plugins/stream/README.md +7 -6
package/src/plugins/stream/finalize.js +100 -0
package/src/plugins/stream/stream.js +13 -45
package/src/plugins/telemetry/telemetry.js +27 -4
package/src/plugins/think/think.js +2 -3
package/src/plugins/think/thinkDoc.md +2 -4
package/src/plugins/unknown/README.md +1 -1
package/src/plugins/unknown/unknown.js +17 -19
package/src/plugins/update/update.js +4 -51
package/src/plugins/update/updateDoc.md +21 -6
package/src/plugins/xai/xai.js +68 -102
package/src/plugins/yolo/yolo.js +102 -75
package/src/sql/functions/hedmatch.js +1 -1
package/src/sql/functions/hedreplace.js +1 -1
package/src/sql/functions/hedsearch.js +1 -1
package/src/sql/functions/slugify.js +16 -2
package/BENCH_ENVIRONMENT.md +0 -230
package/CLIENT_INTERFACE.md +0 -396
package/last_run.txt +0 -5617
package/scriptify/ask_run.js +0 -77
package/scriptify/cache_probe.js +0 -66
package/scriptify/cache_probe_grok.js +0 -74
package/src/agent/budget.js +0 -33
package/src/agent/config.js +0 -38
package/src/plugins/hedberg/README.md +0 -71
package/src/plugins/hedberg/docs.md +0 -0
package/src/plugins/hedberg/edits.js +0 -55
package/src/plugins/hedberg/normalize.js +0 -17
package/src/plugins/hedberg/sed.js +0 -49
package/src/plugins/instructions/instructions.md +0 -34
package/src/plugins/instructions/instructions_104.md +0 -8
package/src/plugins/instructions/instructions_105.md +0 -39
package/src/plugins/instructions/instructions_106.md +0 -22
package/src/plugins/instructions/instructions_107.md +0 -17
package/src/plugins/instructions/instructions_108.md +0 -0
package/src/plugins/known/knownDoc.js +0 -3
package/src/plugins/known/knownDoc.md +0 -8
package/src/plugins/unknown/unknownDoc.js +0 -3
package/src/plugins/unknown/unknownDoc.md +0 -11
package/turns/cli_1777462658211/turn_001.txt +0 -772
package/turns/cli_1777462658211/turn_002.txt +0 -606
package/turns/cli_1777462658211/turn_003.txt +0 -667
package/turns/cli_1777462658211/turn_004.txt +0 -297
package/turns/cli_1777462658211/turn_005.txt +0 -301
package/turns/cli_1777462658211/turn_006.txt +0 -262
package/turns/cli_1777465095132/turn_001.txt +0 -715
package/turns/cli_1777465095132/turn_002.txt +0 -236
package/turns/cli_1777465095132/turn_003.txt +0 -287
package/turns/cli_1777465095132/turn_004.txt +0 -694
package/turns/cli_1777465095132/turn_005.txt +0 -422
package/turns/cli_1777465095132/turn_006.txt +0 -365
package/turns/cli_1777465095132/turn_007.txt +0 -885
package/turns/cli_1777465095132/turn_008.txt +0 -1277
package/turns/cli_1777465095132/turn_009.txt +0 -736
/package/src/{plugins → lib}/hedberg/patterns.js +0 -0

package/src/plugins/yolo/yolo.js CHANGED Viewed

@@ -1,5 +1,6 @@
 import { spawn } from "node:child_process";
 import { logPathToDataBase } from "../helpers.js";
+import finalizeStream from "../stream/finalize.js";
 const SH_PATH_RE = /^log:\/\/turn_\d+\/(sh|env)\//;
@@ -16,7 +17,11 @@ export default class Yolo {
 			// Resolve first so sh/env's post-accept seeds channels before we stream into them.
 			await this.#serverResolve(rummy, p.path);
 			if (SH_PATH_RE.test(p.path)) {
-				await this.#executeShellProposal(rummy, p.path);
+				// Fire-and-forget: spawn returns and yolo never blocks. If
+				// the child outlives the loop, finalizeStream wakes the run
+				// with a fresh prompt so the agent gets a turn to react.
+				// SPEC #streaming_entries.
+				this.#executeShellProposal(rummy, p.path);
 			}
 		}
 	}
@@ -66,94 +71,116 @@ export default class Yolo {
 		await this.core.hooks.proposal.accepted.emit({ ...ctx, resolvedBody });
 	}
-	// Spawn locally and stream into {dataBase}_{1,2}; mirrors stream/stream-completed RPC.
-	async #executeShellProposal(rummy, logPath) {
+	// Spawn locally and stream into {dataBase}_{1,2}; finalization (channel
+	// terminal states, log-body rewrite, dormant-run wake) is delegated to
+	// stream/finalize so yolo and external producers share one termination
+	// site. Fire-and-forget: spawn returns synchronously; the close handler
+	// runs whenever the child exits, regardless of whether the loop that
+	// proposed the <sh> is still alive.
+	#executeShellProposal(rummy, logPath) {
 		const runId = rummy.runId;
 		const entries = rummy.entries;
 		const db = rummy.db;
-		const runRow = await db.get_run_by_id.get({ id: runId });
-		const project = await db.get_project_by_id.get({ id: runRow.project_id });
-		const projectRoot = project?.project_root;
-		if (!projectRoot) return;
+		const hooks = this.core.hooks;
-		const attrs = await entries.getAttributes(runId, logPath);
-		const command = attrs?.command || attrs?.summary;
-		if (!command) return;
+		(async () => {
+			const runRow = await db.get_run_by_id.get({ id: runId });
+			const project = await db.get_project_by_id.get({
+				id: runRow.project_id,
+			});
+			const projectRoot = project?.project_root;
+			if (!projectRoot) return;
-		const dataBase = logPathToDataBase(logPath);
-		if (!dataBase) return;
-		const stdoutPath = `${dataBase}_1`;
-		const stderrPath = `${dataBase}_2`;
+			const attrs = await entries.getAttributes(runId, logPath);
+			const command = attrs?.command || attrs?.summary;
+			if (!command) return;
-		const start = Date.now();
-		const child = spawn("bash", ["-lc", command], {
-			cwd: projectRoot,
-			env: process.env,
-		});
-		// Buffer + write-once-on-exit; async appends would race the terminal-state transition.
-		const stdoutChunks = [];
-		const stderrChunks = [];
-		child.stdout.on("data", (data) => stdoutChunks.push(data.toString()));
-		child.stderr.on("data", (data) => stderrChunks.push(data.toString()));
+			const dataBase = logPathToDataBase(logPath);
+			if (!dataBase) return;
+			const stdoutPath = `${dataBase}_1`;
+			const stderrPath = `${dataBase}_2`;
-		await new Promise((resolve) => {
-			child.on("close", async (code) => {
-				const stdoutBody = stdoutChunks.join("");
-				const stderrBody = stderrChunks.join("");
-				if (stdoutBody) {
-					try {
-						await entries.set({
-							runId,
-							path: stdoutPath,
-							body: stdoutBody,
-							append: true,
-						});
-					} catch {}
-				}
-				if (stderrBody) {
-					try {
-						await entries.set({
-							runId,
-							path: stderrPath,
-							body: stderrBody,
-							append: true,
-						});
-					} catch {}
+			const start = Date.now();
+			// Shell argv defaults to ["bash", "-lc"] — a host-shell exec.
+			// `RUMMY_SHELL_ARGV` (JSON array) routes commands elsewhere:
+			// benchmark integrations set it to docker-exec into a per-task
+			// isolated container so the agent's `<sh>` runs without host
+			// filesystem access or network reach. Example:
+			//   ["docker","exec","--workdir","/workspace","<cid>","bash","-lc"]
+			const argvJson = process.env.RUMMY_SHELL_ARGV;
+			const shellArgv = argvJson ? JSON.parse(argvJson) : ["bash", "-lc"];
+			// signal: AbortController kills the child if the user aborts
+			// the run. Children that simply outlive their loop are not
+			// killed — finalizeStream wakes the run when they close.
+			const child = spawn(shellArgv[0], [...shellArgv.slice(1), command], {
+				cwd: projectRoot,
+				env: process.env,
+				signal: rummy.signal ?? undefined,
+			});
+			// Append chunks via per-channel promise chains so concurrent
+			// appends don't race for body order in SQLite.
+			const stdoutRef = { value: Promise.resolve() };
+			const stderrRef = { value: Promise.resolve() };
+			const appendChunk = (path, body, queueRef) => {
+				queueRef.value = queueRef.value
+					.then(() => entries.set({ runId, path, body, append: true }))
+					.catch((err) => {
+						console.error(`[yolo] append to ${path} failed: ${err.message}`);
+					});
+			};
+			child.stdout.on("data", (data) => {
+				appendChunk(stdoutPath, data.toString(), stdoutRef);
+			});
+			child.stderr.on("data", (data) => {
+				appendChunk(stderrPath, data.toString(), stderrRef);
+			});
+			let launched = false;
+			child.on("spawn", () => {
+				launched = true;
+			});
+			// Launch failure (binary missing, cwd invalid): no "close"
+			// arrives, so finalize directly with a non-zero exit.
+			child.on("error", async (err) => {
+				if (launched) return;
+				try {
+					await finalizeStream({
+						db,
+						entries,
+						hooks,
+						runRow,
+						path: logPath,
+						exitCode: 127,
+						duration: `${Math.round((Date.now() - start) / 1000)}s`,
+					});
+				} catch (e) {
+					console.error(`[yolo] finalize on launch error failed: ${e.message}`);
 				}
+				console.error(`[yolo] spawn failed: ${err.message}`);
+			});
+			child.on("close", async (code) => {
+				// Drain per-channel append queues before finalizing so the
+				// terminal-state write can't land before the last chunk.
+				await Promise.allSettled([stdoutRef.value, stderrRef.value]);
 				const exitCode = code === null ? 130 : code;
 				const duration = `${Math.round((Date.now() - start) / 1000)}s`;
-				const terminalState = exitCode === 0 ? "resolved" : "failed";
-				const outcome = exitCode === 0 ? null : `exit:${exitCode}`;
-				// body=undefined preserves streamed content; body="" would wipe it.
-				for (const path of [stdoutPath, stderrPath]) {
-					try {
-						await entries.set({
-							runId,
-							path,
-							state: terminalState,
-							outcome,
-						});
-					} catch {}
-				}
 				try {
-					const channels = await entries.getEntriesByPattern(
-						runId,
-						`${dataBase}_*`,
-						null,
-					);
-					const summary = channels
-						.map((c) => `${c.path} (${c.tokens} tokens)`)
-						.join(", ");
-					const exitLabel = exitCode === 0 ? "exit=0" : `exit=${exitCode}`;
-					await entries.set({
-						runId,
+					await finalizeStream({
+						db,
+						entries,
+						hooks,
+						runRow,
 						path: logPath,
-						state: "resolved",
-						body: `ran '${command}', ${exitLabel} (${duration}). Output: ${summary}`,
+						exitCode,
+						duration,
 					});
-				} catch {}
-				resolve();
+				} catch (err) {
+					console.error(`[yolo] finalize failed: ${err.message}`);
+				}
 			});
+		})().catch((err) => {
+			console.error(`[yolo] child lifecycle errored: ${err.message}`);
 		});
 	}
 }

package/src/sql/functions/hedmatch.js CHANGED Viewed

@@ -1,4 +1,4 @@
-import { hedmatch } from "../../plugins/hedberg/patterns.js";
+import { hedmatch } from "../../lib/hedberg/patterns.js";
 export const deterministic = true;

package/src/sql/functions/hedreplace.js CHANGED Viewed

@@ -1,4 +1,4 @@
-import { hedreplace } from "../../plugins/hedberg/patterns.js";
+import { hedreplace } from "../../lib/hedberg/patterns.js";
 export const deterministic = true;

package/src/sql/functions/hedsearch.js CHANGED Viewed

@@ -1,4 +1,4 @@
-import { hedsearch } from "../../plugins/hedberg/patterns.js";
+import { hedsearch } from "../../lib/hedberg/patterns.js";
 export const deterministic = true;

package/src/sql/functions/slugify.js CHANGED Viewed

@@ -2,15 +2,29 @@ import encodeSegment from "../../agent/pathEncode.js";
 export const deterministic = true;
-// commas→/, then encode-per-segment so / survives as separator.
+// scheme separator `://` → `___` (three chars replaced, three underscores —
+// visually distinctive, nobody writes triple-underscore in real paths or
+// identifiers, so a `___` in a slug unambiguously signals "this was a
+// scheme separator at write-time"). Round-trippable if a consumer ever
+// wants to decode. Done BEFORE comma→/ and split so a path like
+// `unknown://geography/x` slugs as `unknown___geography/x` instead of
+// dropping a slash via `filter(Boolean)`.
+//
+// commas→/, then encode-per-segment so / survives as separator. Drop `.`
+// and `..` segments — they're shell path-navigation noise that has no
+// addressing value AND breaks picomatch globs (literal `.` is treated
+// as a directory marker that `**` won't match across), so a command
+// like `./executable --help` previously slugged to `./executable_--help`
+// and made `sh://turn_N/**` queries miss it.
 // encodeSegment handles spaces→_ + URL-encode (single rule, used everywhere).
 export default function slugify(text) {
 	if (!text) return "";
 	return text
 		.slice(0, 80)
+		.replace(/:\/\//g, "___")
 		.replace(/,/g, "/")
 		.split("/")
-		.filter(Boolean)
+		.filter((seg) => seg && seg !== "." && seg !== "..")
 		.map(encodeSegment)
 		.join("/");
 }

package/BENCH_ENVIRONMENT.md DELETED Viewed

@@ -1,230 +0,0 @@
-# Bench Environment
-Hardware and software inventory for local-model rummy runs. Captured
-from live system probes; values to cite verbatim in any benchmark
-writeup. **Do not paraphrase.** Re-probe before publishing if the
-machine has been touched.
-Last verified: 2026-04-30.
----
-## Hardware
-| | |
-|---|---|
-| GPU            | **NVIDIA GeForce RTX 5070 Ti** (16 GB VRAM, GB203 Blackwell die) |
-| GPU driver     | 595.71.05 (kernel module + userspace, matched as of 2026-04-29 module reload) |
-| Integrated GPU | Intel Arrow Lake-S iGPU (not used for inference) |
-| CPU            | Intel Core Ultra 9 285 |
-| Cores          | 24 (logical) |
-| RAM            | 32 GB |
-**Source of truth:** `lspci | grep -E "vga\|3d"`, `cat /proc/driver/nvidia/version`,
-`grep model /proc/cpuinfo`, `nproc`, `grep MemTotal /proc/meminfo`.
----
-## OS / kernel
-| | |
-|---|---|
-| Distro  | Debian 13 (trixie) |
-| Kernel  | 6.12.74+deb13+1-amd64 |
-| GCC     | 14.2.0 |
----
-## Inference engine
-| | |
-|---|---|
-| Server   | `llama-server` (llama.cpp) |
-| Build    | `b199-82209ef` (per `/props.build_info`) |
-| Local endpoint  | `http://127.0.0.1:11435` (OpenAI-compatible) |
-| Public endpoint | `https://gemma.possumtech.com` (OpenAI-compatible; Cloudflare-fronted, SSL terminated on hawkbit AWS box, proxied via SSH reverse tunnel `hawkbit:5172 → hyzen:11435`. Toggleable via `systemctl --user disable --now gemma.service` on hyzen.) |
-| n_ctx    | **32768** (runtime; model supports up to 262144) |
-| Binary   | `/home/hyzen/repo/llama-mainline/build-fast/bin/llama-server` (custom rebuild — see "Build flags" below) |
-| Slots    | 1 |
-| Default sampler | temperature 0.0, top_k 64, top_p 1.0, min_p 0.05 |
-| `n_predict` default | -1 (unbounded — fills remaining context) |
-| `reasoning_format` | none (model treated as content-only) |
----
-## Loaded model
-| | |
-|---|---|
-| Filename                     | `macher.gguf` (local rename) |
-| Path                         | `/home/hyzen/repo/turbo/models/gemma/macher.gguf` |
-| File size                    | 13917726528 bytes (12.95 GiB / 13.92 GB) |
-| `general.name`               | **Gemma 4 26B A4B It** |
-| `general.architecture`       | gemma4 |
-| `general.basename`           | gemma-4 |
-| `general.size_label`         | 26B-A4B |
-| `general.finetune`           | it (instruction-tuned) |
-| `general.license`            | apache-2.0 |
-| `general.file_type`          | 30 → **IQ4_XS** at 4.41 BPW (confirmed by `llama-server` load output: `print_info: file type = IQ4_XS - 4.25 bpw`; not Q3_K_XL despite filename hints in older `.env` templates) |
-| `general.quantization_version` | 2 |
-| Architecture: `expert_count` | 128 |
-| Architecture: `expert_used_count` | 8 (MoE — ~4B params active per token despite 26B total) |
-| Architecture: `block_count`  | 30 |
-| Architecture: `embedding_length` | 2816 |
-| Architecture: `attention.head_count` | 16 |
-| Architecture: `feed_forward_length` | 2112 |
-| Native `context_length`      | 262144 (256K) |
-**Note on quantization:** `general.file_type=30` maps to **IQ4_XS** in
-current llama.cpp (the file is mradermacher's imatrix quant of
-`google/gemma-4-26B-A4B-it`). The load output confirms this directly:
-`print_info: file type = IQ4_XS - 4.25 bpw`, `file size = 12.95 GiB
-(4.41 BPW)`. The earlier `Q3_K_XL` reference in `.env` templates was
-for a different file (`gemma-4-26B-A4B-it-UD-Q3_K_XL.gguf`) that has
-since been deleted from disk. Tensor breakdown from load: 392× F32,
-1× Q6_K, 60× IQ4_NL, 205× IQ4_XS.
-**Note on chat template:** the GGUF was rewritten in-place on
-2026-04-29 via `gguf_new_metadata.py` to embed the Apr-28 upstream
-official Google chat template (commit `4c55b528` of
-`google/gemma-4-26B-A4B-it`), which fixes SI / tool-call handling.
-Tensor data is byte-identical to the original mradermacher download;
-only `tokenizer.chat_template` changed (12045 → 16934 bytes), so
-file size grew by 4864 bytes. The `--chat-template-file` runtime
-flag is no longer needed and has been removed from ExecStart.
----
-## Sampling parameters used by rummy
-Rummy's `openai` plugin (`src/plugins/openai/openai.js`) constructs
-its request body as `{ model, messages, think: true }`, optionally
-adding `temperature` if the caller passed one. **No `max_tokens`,
-no `stop`** — server defaults apply.
-The plugin then sends the request through the shared streaming
-client at `src/llm/openaiStream.js`, which spreads that body and
-adds `stream: true` and `stream_options: { include_usage: true }`.
-So the actual wire body is:
-`{ model, messages, think: true, [temperature], stream: true, stream_options: {include_usage:true} }`.
-Streaming is required, not optional: a non-streaming hold can
-exceed the Cloudflare-fronted edge's idle-timeout when the model
-spends seconds on extended reasoning before emitting visible
-content. The streaming wrapper exists specifically to keep bytes
-flowing through the proxy.
-`n_predict: -1` is in force (server default), so output can still
-grow until it hits the context limit and gets truncated. Under
-streaming, that truncation now manifests as a stalled / late-EOS
-stream rather than the all-at-once mid-token cutoff observed in
-the regex-log gemma run on 2026-04-29.
----
-## Build flags (custom llama.cpp rebuild)
-The `llama-server` binary is a local rebuild with non-default flags
-that materially affect performance on Blackwell sm_120. Stock builds
-will produce slower numbers — readers reproducing should match these
-flags or note their stock-build numbers as such.
-| Flag | Setting | Why it matters |
-|---|---|---|
-| `CMAKE_CUDA_ARCHITECTURES` | `120` | Blackwell-targeted kernels |
-| `GGML_CUDA_FORCE_MMQ` | `ON` | Forces MMQ kernels for low-bit quants (default OFF) |
-| `GGML_CUDA_FA_ALL_QUANTS` | `ON` | Enables Flash Attention path for q8_0 KV cache (default OFF; without it, q8 KV falls back to a slow generic path) |
-| `GGML_CUDA_F16` | `ON` | fp16 intermediates |
-| `GGML_NATIVE` | `ON` | Native CPU arch tuning |
-| `CMAKE_BUILD_TYPE` | `Release` | |
-Source tree: `/home/hyzen/repo/llama-mainline` at commit `82209ef`.
-Build dir: `build-fast/`. Binary RUNPATH is baked to that absolute
-path; do not rename the directory.
----
-## Service-level operational settings
-| | |
-|---|---|
-| systemd unit | `/etc/systemd/system/llama.service` |
-| `MemoryHigh` | 12 GB (host RAM soft cap) |
-| `MemoryMax`  | 16 GB (host RAM hard cap; OOM-kill if exceeded) |
-| `MemorySwapMax` | 0 (process is forbidden from touching swap) |
-| `Restart`    | `always`, `RestartSec=3` |
-| Daily restart timer | `llama-restart.timer` at 04:00 EDT ±30 min via `systemctl try-restart` |
-KV cache quantization is q8_0 (both K and V). Flash Attention is
-enabled. Gemma 4 sliding-window attention keeps KV at ~500 MiB
-even at 32k context (5 of 30 layers full-context, 25 SWA-capped;
-SWA window = 1024 tokens; pattern is full-context every 6th layer
-per `gemma4.attention.sliding_window_pattern`).
-Full ExecStart (for cite-verbatim purposes):
-```
-/home/hyzen/repo/llama-mainline/build-fast/bin/llama-server \
-  --model /home/hyzen/repo/turbo/models/gemma/macher.gguf \
-  --ctx-size 32768 --parallel 1 \
-  -fa on -ctk q8_0 -ctv q8_0 \
-  -ngl 999 -b 1024 -ub 512 \
-  -t 12 -tb 24 \
-  --host 127.0.0.1 --port 11435 \
-  --jinja --reasoning-budget 4096 \
-  --cache-ram 4096 --cache-reuse 256 \
-  --temp 0 --top-p 1.0 --repeat-penalty 1.0
-```
-`--cache-ram 4096 --cache-reuse 256` enables the 4 GiB host-RAM
-prompt cache; first-token latency on warm cache is ~10× faster
-than cold. `--reasoning-budget 4096` caps the thinking phase at
-4096 tokens before forcing the model into the answer phase.
----
-## Measured single-stream baseline (this config)
-Captured from steady-state probes on 2026-04-29 (after warmup).
-Carried forward as of 2026-04-30: the subsequent changes
-(`--reasoning-budget`, `--cache-ram`, `--cache-reuse` added; chat
-template metadata rewritten in-place) do not touch tensor data,
-attention path, or sampler chain, so generation throughput is
-unaffected. Re-probe before publishing.
-| Metric | Value |
-|---|---|
-| Generation throughput | **~168 tokens/sec** at 32k ctx (~187 t/s at 16k ctx with FP16 KV) |
-| Per-token latency | ~5.95 ms/token |
-| Prompt eval, small prompts (warm cache) | ~900 t/s |
-| Prompt eval, large prompts (10k+ tokens) | ~5,600 t/s |
-| Time-to-first-token, 10k-token prompt | ~1.9 s |
-| VRAM at idle after model load | ~14.6 / 15.84 GB |
-| Theoretical bandwidth ceiling | 421 t/s (4B active × 4.25 bpw / 896 GB/s) |
-| Observed efficiency vs ceiling | ~40% |
-Sampling is deterministic at temp 0; numbers above are reproducible
-to <0.5% across trials within the same llama-server lifetime.
----
-## How to re-probe
-```bash
-# llama-server runtime
-curl -s http://127.0.0.1:11435/props | python3 -m json.tool | head -60
-# GGUF metadata
-# (parser script in this repo; or use `gguf-dump` if installed)
-# GPU
-lspci | grep -iE "vga|3d|display"
-cat /proc/driver/nvidia/version
-# CPU / RAM / OS
-grep -m1 "model name" /proc/cpuinfo
-nproc
-grep MemTotal /proc/meminfo
-grep PRETTY_NAME /etc/os-release
-uname -a
-```