@possumtech/rummy 2.1.0 → 2.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (140) hide show
  1. package/.env.example +40 -15
  2. package/.xai.key +1 -0
  3. package/PLUGINS.md +169 -53
  4. package/README.md +38 -32
  5. package/SPEC.md +366 -179
  6. package/bin/digest.js +1097 -0
  7. package/biome/no-fallbacks.grit +2 -2
  8. package/gemini.key +1 -0
  9. package/lang/en.json +10 -1
  10. package/migrations/001_initial_schema.sql +9 -2
  11. package/package.json +19 -8
  12. package/service.js +1 -0
  13. package/src/agent/AgentLoop.js +76 -26
  14. package/src/agent/ContextAssembler.js +2 -0
  15. package/src/agent/Entries.js +238 -60
  16. package/src/agent/ProjectAgent.js +44 -0
  17. package/src/agent/TurnExecutor.js +99 -30
  18. package/src/agent/XmlParser.js +206 -111
  19. package/src/agent/errors.js +35 -0
  20. package/src/agent/known_queries.sql +1 -1
  21. package/src/agent/known_store.sql +3 -42
  22. package/src/agent/materializeContext.js +30 -1
  23. package/src/agent/runs.sql +8 -18
  24. package/src/agent/tokens.js +0 -1
  25. package/src/agent/turns.sql +1 -0
  26. package/src/hooks/Hooks.js +26 -0
  27. package/src/hooks/RummyContext.js +12 -1
  28. package/src/lib/hedberg/README.md +60 -0
  29. package/src/lib/hedberg/hedberg.js +60 -0
  30. package/src/lib/hedberg/marker.js +158 -0
  31. package/src/{plugins → lib}/hedberg/matcher.js +1 -2
  32. package/src/llm/LlmProvider.js +41 -3
  33. package/src/llm/openaiStream.js +17 -0
  34. package/src/plugins/ask_user/ask_user.js +12 -2
  35. package/src/plugins/ask_user/ask_userDoc.md +1 -5
  36. package/src/plugins/budget/README.md +29 -24
  37. package/src/plugins/budget/budget.js +166 -110
  38. package/src/plugins/cli/README.md +3 -4
  39. package/src/plugins/cli/cli.js +31 -5
  40. package/src/plugins/cloudflare/cloudflare.js +136 -0
  41. package/src/plugins/cp/cp.js +41 -4
  42. package/src/plugins/cp/cpDoc.md +5 -6
  43. package/src/plugins/engine/engine.sql +1 -1
  44. package/src/plugins/env/README.md +5 -4
  45. package/src/plugins/env/env.js +7 -4
  46. package/src/plugins/env/envDoc.md +7 -8
  47. package/src/plugins/error/error.js +56 -15
  48. package/src/plugins/file/README.md +12 -3
  49. package/src/plugins/file/file.js +2 -2
  50. package/src/plugins/get/get.js +59 -36
  51. package/src/plugins/get/getDoc.md +10 -34
  52. package/src/plugins/google/google.js +115 -0
  53. package/src/plugins/hedberg/hedberg.js +13 -56
  54. package/src/plugins/helpers.js +66 -12
  55. package/src/plugins/index.js +1 -2
  56. package/src/plugins/instructions/README.md +44 -47
  57. package/src/plugins/instructions/instructions-system.md +44 -0
  58. package/src/plugins/instructions/instructions-user.md +53 -0
  59. package/src/plugins/instructions/instructions.js +58 -189
  60. package/src/plugins/known/README.md +6 -7
  61. package/src/plugins/known/known.js +24 -30
  62. package/src/plugins/log/log.js +41 -32
  63. package/src/plugins/mv/mv.js +40 -1
  64. package/src/plugins/mv/mvDoc.md +1 -8
  65. package/src/plugins/ollama/ollama.js +4 -3
  66. package/src/plugins/openai/openai.js +4 -3
  67. package/src/plugins/openrouter/openrouter.js +14 -4
  68. package/src/plugins/persona/README.md +11 -13
  69. package/src/plugins/persona/default.md +29 -0
  70. package/src/plugins/persona/persona.js +10 -66
  71. package/src/plugins/policy/policy.js +23 -22
  72. package/src/plugins/prompt/README.md +37 -27
  73. package/src/plugins/prompt/prompt.js +13 -19
  74. package/src/plugins/rm/rm.js +18 -0
  75. package/src/plugins/rm/rmDoc.md +5 -6
  76. package/src/plugins/rpc/rpc.js +3 -3
  77. package/src/plugins/set/set.js +205 -323
  78. package/src/plugins/set/setDoc.md +47 -17
  79. package/src/plugins/sh/README.md +6 -5
  80. package/src/plugins/sh/sh.js +8 -5
  81. package/src/plugins/sh/shDoc.md +7 -8
  82. package/src/plugins/skill/README.md +37 -14
  83. package/src/plugins/skill/skill.js +200 -101
  84. package/src/plugins/skill/skillDoc.js +3 -0
  85. package/src/plugins/skill/skillDoc.md +9 -0
  86. package/src/plugins/stream/README.md +7 -6
  87. package/src/plugins/stream/finalize.js +100 -0
  88. package/src/plugins/stream/stream.js +13 -45
  89. package/src/plugins/telemetry/telemetry.js +27 -4
  90. package/src/plugins/think/think.js +2 -3
  91. package/src/plugins/think/thinkDoc.md +2 -4
  92. package/src/plugins/unknown/README.md +1 -1
  93. package/src/plugins/unknown/unknown.js +17 -19
  94. package/src/plugins/update/update.js +4 -51
  95. package/src/plugins/update/updateDoc.md +21 -6
  96. package/src/plugins/xai/xai.js +68 -102
  97. package/src/plugins/yolo/yolo.js +102 -75
  98. package/src/sql/functions/hedmatch.js +1 -1
  99. package/src/sql/functions/hedreplace.js +1 -1
  100. package/src/sql/functions/hedsearch.js +1 -1
  101. package/src/sql/functions/slugify.js +16 -2
  102. package/BENCH_ENVIRONMENT.md +0 -230
  103. package/CLIENT_INTERFACE.md +0 -396
  104. package/last_run.txt +0 -5617
  105. package/scriptify/ask_run.js +0 -77
  106. package/scriptify/cache_probe.js +0 -66
  107. package/scriptify/cache_probe_grok.js +0 -74
  108. package/src/agent/budget.js +0 -33
  109. package/src/agent/config.js +0 -38
  110. package/src/plugins/hedberg/README.md +0 -71
  111. package/src/plugins/hedberg/docs.md +0 -0
  112. package/src/plugins/hedberg/edits.js +0 -55
  113. package/src/plugins/hedberg/normalize.js +0 -17
  114. package/src/plugins/hedberg/sed.js +0 -49
  115. package/src/plugins/instructions/instructions.md +0 -34
  116. package/src/plugins/instructions/instructions_104.md +0 -8
  117. package/src/plugins/instructions/instructions_105.md +0 -39
  118. package/src/plugins/instructions/instructions_106.md +0 -22
  119. package/src/plugins/instructions/instructions_107.md +0 -17
  120. package/src/plugins/instructions/instructions_108.md +0 -0
  121. package/src/plugins/known/knownDoc.js +0 -3
  122. package/src/plugins/known/knownDoc.md +0 -8
  123. package/src/plugins/unknown/unknownDoc.js +0 -3
  124. package/src/plugins/unknown/unknownDoc.md +0 -11
  125. package/turns/cli_1777462658211/turn_001.txt +0 -772
  126. package/turns/cli_1777462658211/turn_002.txt +0 -606
  127. package/turns/cli_1777462658211/turn_003.txt +0 -667
  128. package/turns/cli_1777462658211/turn_004.txt +0 -297
  129. package/turns/cli_1777462658211/turn_005.txt +0 -301
  130. package/turns/cli_1777462658211/turn_006.txt +0 -262
  131. package/turns/cli_1777465095132/turn_001.txt +0 -715
  132. package/turns/cli_1777465095132/turn_002.txt +0 -236
  133. package/turns/cli_1777465095132/turn_003.txt +0 -287
  134. package/turns/cli_1777465095132/turn_004.txt +0 -694
  135. package/turns/cli_1777465095132/turn_005.txt +0 -422
  136. package/turns/cli_1777465095132/turn_006.txt +0 -365
  137. package/turns/cli_1777465095132/turn_007.txt +0 -885
  138. package/turns/cli_1777465095132/turn_008.txt +0 -1277
  139. package/turns/cli_1777465095132/turn_009.txt +0 -736
  140. /package/src/{plugins → lib}/hedberg/patterns.js +0 -0
@@ -1,5 +1,6 @@
1
1
  import { spawn } from "node:child_process";
2
2
  import { logPathToDataBase } from "../helpers.js";
3
+ import finalizeStream from "../stream/finalize.js";
3
4
 
4
5
  const SH_PATH_RE = /^log:\/\/turn_\d+\/(sh|env)\//;
5
6
 
@@ -16,7 +17,11 @@ export default class Yolo {
16
17
  // Resolve first so sh/env's post-accept seeds channels before we stream into them.
17
18
  await this.#serverResolve(rummy, p.path);
18
19
  if (SH_PATH_RE.test(p.path)) {
19
- await this.#executeShellProposal(rummy, p.path);
20
+ // Fire-and-forget: spawn returns and yolo never blocks. If
21
+ // the child outlives the loop, finalizeStream wakes the run
22
+ // with a fresh prompt so the agent gets a turn to react.
23
+ // SPEC #streaming_entries.
24
+ this.#executeShellProposal(rummy, p.path);
20
25
  }
21
26
  }
22
27
  }
@@ -66,94 +71,116 @@ export default class Yolo {
66
71
  await this.core.hooks.proposal.accepted.emit({ ...ctx, resolvedBody });
67
72
  }
68
73
 
69
- // Spawn locally and stream into {dataBase}_{1,2}; mirrors stream/stream-completed RPC.
70
- async #executeShellProposal(rummy, logPath) {
74
+ // Spawn locally and stream into {dataBase}_{1,2}; finalization (channel
75
+ // terminal states, log-body rewrite, dormant-run wake) is delegated to
76
+ // stream/finalize so yolo and external producers share one termination
77
+ // site. Fire-and-forget: spawn returns synchronously; the close handler
78
+ // runs whenever the child exits, regardless of whether the loop that
79
+ // proposed the <sh> is still alive.
80
+ #executeShellProposal(rummy, logPath) {
71
81
  const runId = rummy.runId;
72
82
  const entries = rummy.entries;
73
83
  const db = rummy.db;
74
- const runRow = await db.get_run_by_id.get({ id: runId });
75
- const project = await db.get_project_by_id.get({ id: runRow.project_id });
76
- const projectRoot = project?.project_root;
77
- if (!projectRoot) return;
84
+ const hooks = this.core.hooks;
78
85
 
79
- const attrs = await entries.getAttributes(runId, logPath);
80
- const command = attrs?.command || attrs?.summary;
81
- if (!command) return;
86
+ (async () => {
87
+ const runRow = await db.get_run_by_id.get({ id: runId });
88
+ const project = await db.get_project_by_id.get({
89
+ id: runRow.project_id,
90
+ });
91
+ const projectRoot = project?.project_root;
92
+ if (!projectRoot) return;
82
93
 
83
- const dataBase = logPathToDataBase(logPath);
84
- if (!dataBase) return;
85
- const stdoutPath = `${dataBase}_1`;
86
- const stderrPath = `${dataBase}_2`;
94
+ const attrs = await entries.getAttributes(runId, logPath);
95
+ const command = attrs?.command || attrs?.summary;
96
+ if (!command) return;
87
97
 
88
- const start = Date.now();
89
- const child = spawn("bash", ["-lc", command], {
90
- cwd: projectRoot,
91
- env: process.env,
92
- });
93
- // Buffer + write-once-on-exit; async appends would race the terminal-state transition.
94
- const stdoutChunks = [];
95
- const stderrChunks = [];
96
- child.stdout.on("data", (data) => stdoutChunks.push(data.toString()));
97
- child.stderr.on("data", (data) => stderrChunks.push(data.toString()));
98
+ const dataBase = logPathToDataBase(logPath);
99
+ if (!dataBase) return;
100
+ const stdoutPath = `${dataBase}_1`;
101
+ const stderrPath = `${dataBase}_2`;
98
102
 
99
- await new Promise((resolve) => {
100
- child.on("close", async (code) => {
101
- const stdoutBody = stdoutChunks.join("");
102
- const stderrBody = stderrChunks.join("");
103
- if (stdoutBody) {
104
- try {
105
- await entries.set({
106
- runId,
107
- path: stdoutPath,
108
- body: stdoutBody,
109
- append: true,
110
- });
111
- } catch {}
112
- }
113
- if (stderrBody) {
114
- try {
115
- await entries.set({
116
- runId,
117
- path: stderrPath,
118
- body: stderrBody,
119
- append: true,
120
- });
121
- } catch {}
103
+ const start = Date.now();
104
+ // Shell argv defaults to ["bash", "-lc"] a host-shell exec.
105
+ // `RUMMY_SHELL_ARGV` (JSON array) routes commands elsewhere:
106
+ // benchmark integrations set it to docker-exec into a per-task
107
+ // isolated container so the agent's `<sh>` runs without host
108
+ // filesystem access or network reach. Example:
109
+ // ["docker","exec","--workdir","/workspace","<cid>","bash","-lc"]
110
+ const argvJson = process.env.RUMMY_SHELL_ARGV;
111
+ const shellArgv = argvJson ? JSON.parse(argvJson) : ["bash", "-lc"];
112
+ // signal: AbortController kills the child if the user aborts
113
+ // the run. Children that simply outlive their loop are not
114
+ // killed — finalizeStream wakes the run when they close.
115
+ const child = spawn(shellArgv[0], [...shellArgv.slice(1), command], {
116
+ cwd: projectRoot,
117
+ env: process.env,
118
+ signal: rummy.signal ?? undefined,
119
+ });
120
+
121
+ // Append chunks via per-channel promise chains so concurrent
122
+ // appends don't race for body order in SQLite.
123
+ const stdoutRef = { value: Promise.resolve() };
124
+ const stderrRef = { value: Promise.resolve() };
125
+ const appendChunk = (path, body, queueRef) => {
126
+ queueRef.value = queueRef.value
127
+ .then(() => entries.set({ runId, path, body, append: true }))
128
+ .catch((err) => {
129
+ console.error(`[yolo] append to ${path} failed: ${err.message}`);
130
+ });
131
+ };
132
+ child.stdout.on("data", (data) => {
133
+ appendChunk(stdoutPath, data.toString(), stdoutRef);
134
+ });
135
+ child.stderr.on("data", (data) => {
136
+ appendChunk(stderrPath, data.toString(), stderrRef);
137
+ });
138
+
139
+ let launched = false;
140
+ child.on("spawn", () => {
141
+ launched = true;
142
+ });
143
+ // Launch failure (binary missing, cwd invalid): no "close"
144
+ // arrives, so finalize directly with a non-zero exit.
145
+ child.on("error", async (err) => {
146
+ if (launched) return;
147
+ try {
148
+ await finalizeStream({
149
+ db,
150
+ entries,
151
+ hooks,
152
+ runRow,
153
+ path: logPath,
154
+ exitCode: 127,
155
+ duration: `${Math.round((Date.now() - start) / 1000)}s`,
156
+ });
157
+ } catch (e) {
158
+ console.error(`[yolo] finalize on launch error failed: ${e.message}`);
122
159
  }
160
+ console.error(`[yolo] spawn failed: ${err.message}`);
161
+ });
162
+ child.on("close", async (code) => {
163
+ // Drain per-channel append queues before finalizing so the
164
+ // terminal-state write can't land before the last chunk.
165
+ await Promise.allSettled([stdoutRef.value, stderrRef.value]);
123
166
  const exitCode = code === null ? 130 : code;
124
167
  const duration = `${Math.round((Date.now() - start) / 1000)}s`;
125
- const terminalState = exitCode === 0 ? "resolved" : "failed";
126
- const outcome = exitCode === 0 ? null : `exit:${exitCode}`;
127
- // body=undefined preserves streamed content; body="" would wipe it.
128
- for (const path of [stdoutPath, stderrPath]) {
129
- try {
130
- await entries.set({
131
- runId,
132
- path,
133
- state: terminalState,
134
- outcome,
135
- });
136
- } catch {}
137
- }
138
168
  try {
139
- const channels = await entries.getEntriesByPattern(
140
- runId,
141
- `${dataBase}_*`,
142
- null,
143
- );
144
- const summary = channels
145
- .map((c) => `${c.path} (${c.tokens} tokens)`)
146
- .join(", ");
147
- const exitLabel = exitCode === 0 ? "exit=0" : `exit=${exitCode}`;
148
- await entries.set({
149
- runId,
169
+ await finalizeStream({
170
+ db,
171
+ entries,
172
+ hooks,
173
+ runRow,
150
174
  path: logPath,
151
- state: "resolved",
152
- body: `ran '${command}', ${exitLabel} (${duration}). Output: ${summary}`,
175
+ exitCode,
176
+ duration,
153
177
  });
154
- } catch {}
155
- resolve();
178
+ } catch (err) {
179
+ console.error(`[yolo] finalize failed: ${err.message}`);
180
+ }
156
181
  });
182
+ })().catch((err) => {
183
+ console.error(`[yolo] child lifecycle errored: ${err.message}`);
157
184
  });
158
185
  }
159
186
  }
@@ -1,4 +1,4 @@
1
- import { hedmatch } from "../../plugins/hedberg/patterns.js";
1
+ import { hedmatch } from "../../lib/hedberg/patterns.js";
2
2
 
3
3
  export const deterministic = true;
4
4
 
@@ -1,4 +1,4 @@
1
- import { hedreplace } from "../../plugins/hedberg/patterns.js";
1
+ import { hedreplace } from "../../lib/hedberg/patterns.js";
2
2
 
3
3
  export const deterministic = true;
4
4
 
@@ -1,4 +1,4 @@
1
- import { hedsearch } from "../../plugins/hedberg/patterns.js";
1
+ import { hedsearch } from "../../lib/hedberg/patterns.js";
2
2
 
3
3
  export const deterministic = true;
4
4
 
@@ -2,15 +2,29 @@ import encodeSegment from "../../agent/pathEncode.js";
2
2
 
3
3
  export const deterministic = true;
4
4
 
5
- // commas→/, then encode-per-segment so / survives as separator.
5
+ // scheme separator `://` `___` (three chars replaced, three underscores —
6
+ // visually distinctive, nobody writes triple-underscore in real paths or
7
+ // identifiers, so a `___` in a slug unambiguously signals "this was a
8
+ // scheme separator at write-time"). Round-trippable if a consumer ever
9
+ // wants to decode. Done BEFORE comma→/ and split so a path like
10
+ // `unknown://geography/x` slugs as `unknown___geography/x` instead of
11
+ // dropping a slash via `filter(Boolean)`.
12
+ //
13
+ // commas→/, then encode-per-segment so / survives as separator. Drop `.`
14
+ // and `..` segments — they're shell path-navigation noise that has no
15
+ // addressing value AND breaks picomatch globs (literal `.` is treated
16
+ // as a directory marker that `**` won't match across), so a command
17
+ // like `./executable --help` previously slugged to `./executable_--help`
18
+ // and made `sh://turn_N/**` queries miss it.
6
19
  // encodeSegment handles spaces→_ + URL-encode (single rule, used everywhere).
7
20
  export default function slugify(text) {
8
21
  if (!text) return "";
9
22
  return text
10
23
  .slice(0, 80)
24
+ .replace(/:\/\//g, "___")
11
25
  .replace(/,/g, "/")
12
26
  .split("/")
13
- .filter(Boolean)
27
+ .filter((seg) => seg && seg !== "." && seg !== "..")
14
28
  .map(encodeSegment)
15
29
  .join("/");
16
30
  }
@@ -1,230 +0,0 @@
1
- # Bench Environment
2
-
3
- Hardware and software inventory for local-model rummy runs. Captured
4
- from live system probes; values to cite verbatim in any benchmark
5
- writeup. **Do not paraphrase.** Re-probe before publishing if the
6
- machine has been touched.
7
-
8
- Last verified: 2026-04-30.
9
-
10
- ---
11
-
12
- ## Hardware
13
-
14
- | | |
15
- |---|---|
16
- | GPU | **NVIDIA GeForce RTX 5070 Ti** (16 GB VRAM, GB203 Blackwell die) |
17
- | GPU driver | 595.71.05 (kernel module + userspace, matched as of 2026-04-29 module reload) |
18
- | Integrated GPU | Intel Arrow Lake-S iGPU (not used for inference) |
19
- | CPU | Intel Core Ultra 9 285 |
20
- | Cores | 24 (logical) |
21
- | RAM | 32 GB |
22
-
23
- **Source of truth:** `lspci | grep -E "vga\|3d"`, `cat /proc/driver/nvidia/version`,
24
- `grep model /proc/cpuinfo`, `nproc`, `grep MemTotal /proc/meminfo`.
25
-
26
- ---
27
-
28
- ## OS / kernel
29
-
30
- | | |
31
- |---|---|
32
- | Distro | Debian 13 (trixie) |
33
- | Kernel | 6.12.74+deb13+1-amd64 |
34
- | GCC | 14.2.0 |
35
-
36
- ---
37
-
38
- ## Inference engine
39
-
40
- | | |
41
- |---|---|
42
- | Server | `llama-server` (llama.cpp) |
43
- | Build | `b199-82209ef` (per `/props.build_info`) |
44
- | Local endpoint | `http://127.0.0.1:11435` (OpenAI-compatible) |
45
- | Public endpoint | `https://gemma.possumtech.com` (OpenAI-compatible; Cloudflare-fronted, SSL terminated on hawkbit AWS box, proxied via SSH reverse tunnel `hawkbit:5172 → hyzen:11435`. Toggleable via `systemctl --user disable --now gemma.service` on hyzen.) |
46
- | n_ctx | **32768** (runtime; model supports up to 262144) |
47
- | Binary | `/home/hyzen/repo/llama-mainline/build-fast/bin/llama-server` (custom rebuild — see "Build flags" below) |
48
- | Slots | 1 |
49
- | Default sampler | temperature 0.0, top_k 64, top_p 1.0, min_p 0.05 |
50
- | `n_predict` default | -1 (unbounded — fills remaining context) |
51
- | `reasoning_format` | none (model treated as content-only) |
52
-
53
- ---
54
-
55
- ## Loaded model
56
-
57
- | | |
58
- |---|---|
59
- | Filename | `macher.gguf` (local rename) |
60
- | Path | `/home/hyzen/repo/turbo/models/gemma/macher.gguf` |
61
- | File size | 13917726528 bytes (12.95 GiB / 13.92 GB) |
62
- | `general.name` | **Gemma 4 26B A4B It** |
63
- | `general.architecture` | gemma4 |
64
- | `general.basename` | gemma-4 |
65
- | `general.size_label` | 26B-A4B |
66
- | `general.finetune` | it (instruction-tuned) |
67
- | `general.license` | apache-2.0 |
68
- | `general.file_type` | 30 → **IQ4_XS** at 4.41 BPW (confirmed by `llama-server` load output: `print_info: file type = IQ4_XS - 4.25 bpw`; not Q3_K_XL despite filename hints in older `.env` templates) |
69
- | `general.quantization_version` | 2 |
70
- | Architecture: `expert_count` | 128 |
71
- | Architecture: `expert_used_count` | 8 (MoE — ~4B params active per token despite 26B total) |
72
- | Architecture: `block_count` | 30 |
73
- | Architecture: `embedding_length` | 2816 |
74
- | Architecture: `attention.head_count` | 16 |
75
- | Architecture: `feed_forward_length` | 2112 |
76
- | Native `context_length` | 262144 (256K) |
77
-
78
- **Note on quantization:** `general.file_type=30` maps to **IQ4_XS** in
79
- current llama.cpp (the file is mradermacher's imatrix quant of
80
- `google/gemma-4-26B-A4B-it`). The load output confirms this directly:
81
- `print_info: file type = IQ4_XS - 4.25 bpw`, `file size = 12.95 GiB
82
- (4.41 BPW)`. The earlier `Q3_K_XL` reference in `.env` templates was
83
- for a different file (`gemma-4-26B-A4B-it-UD-Q3_K_XL.gguf`) that has
84
- since been deleted from disk. Tensor breakdown from load: 392× F32,
85
- 1× Q6_K, 60× IQ4_NL, 205× IQ4_XS.
86
-
87
- **Note on chat template:** the GGUF was rewritten in-place on
88
- 2026-04-29 via `gguf_new_metadata.py` to embed the Apr-28 upstream
89
- official Google chat template (commit `4c55b528` of
90
- `google/gemma-4-26B-A4B-it`), which fixes SI / tool-call handling.
91
- Tensor data is byte-identical to the original mradermacher download;
92
- only `tokenizer.chat_template` changed (12045 → 16934 bytes), so
93
- file size grew by 4864 bytes. The `--chat-template-file` runtime
94
- flag is no longer needed and has been removed from ExecStart.
95
-
96
- ---
97
-
98
- ## Sampling parameters used by rummy
99
-
100
- Rummy's `openai` plugin (`src/plugins/openai/openai.js`) constructs
101
- its request body as `{ model, messages, think: true }`, optionally
102
- adding `temperature` if the caller passed one. **No `max_tokens`,
103
- no `stop`** — server defaults apply.
104
-
105
- The plugin then sends the request through the shared streaming
106
- client at `src/llm/openaiStream.js`, which spreads that body and
107
- adds `stream: true` and `stream_options: { include_usage: true }`.
108
- So the actual wire body is:
109
- `{ model, messages, think: true, [temperature], stream: true, stream_options: {include_usage:true} }`.
110
-
111
- Streaming is required, not optional: a non-streaming hold can
112
- exceed the Cloudflare-fronted edge's idle-timeout when the model
113
- spends seconds on extended reasoning before emitting visible
114
- content. The streaming wrapper exists specifically to keep bytes
115
- flowing through the proxy.
116
-
117
- `n_predict: -1` is in force (server default), so output can still
118
- grow until it hits the context limit and gets truncated. Under
119
- streaming, that truncation now manifests as a stalled / late-EOS
120
- stream rather than the all-at-once mid-token cutoff observed in
121
- the regex-log gemma run on 2026-04-29.
122
-
123
- ---
124
-
125
- ## Build flags (custom llama.cpp rebuild)
126
-
127
- The `llama-server` binary is a local rebuild with non-default flags
128
- that materially affect performance on Blackwell sm_120. Stock builds
129
- will produce slower numbers — readers reproducing should match these
130
- flags or note their stock-build numbers as such.
131
-
132
- | Flag | Setting | Why it matters |
133
- |---|---|---|
134
- | `CMAKE_CUDA_ARCHITECTURES` | `120` | Blackwell-targeted kernels |
135
- | `GGML_CUDA_FORCE_MMQ` | `ON` | Forces MMQ kernels for low-bit quants (default OFF) |
136
- | `GGML_CUDA_FA_ALL_QUANTS` | `ON` | Enables Flash Attention path for q8_0 KV cache (default OFF; without it, q8 KV falls back to a slow generic path) |
137
- | `GGML_CUDA_F16` | `ON` | fp16 intermediates |
138
- | `GGML_NATIVE` | `ON` | Native CPU arch tuning |
139
- | `CMAKE_BUILD_TYPE` | `Release` | |
140
-
141
- Source tree: `/home/hyzen/repo/llama-mainline` at commit `82209ef`.
142
- Build dir: `build-fast/`. Binary RUNPATH is baked to that absolute
143
- path; do not rename the directory.
144
-
145
- ---
146
-
147
- ## Service-level operational settings
148
-
149
- | | |
150
- |---|---|
151
- | systemd unit | `/etc/systemd/system/llama.service` |
152
- | `MemoryHigh` | 12 GB (host RAM soft cap) |
153
- | `MemoryMax` | 16 GB (host RAM hard cap; OOM-kill if exceeded) |
154
- | `MemorySwapMax` | 0 (process is forbidden from touching swap) |
155
- | `Restart` | `always`, `RestartSec=3` |
156
- | Daily restart timer | `llama-restart.timer` at 04:00 EDT ±30 min via `systemctl try-restart` |
157
-
158
- KV cache quantization is q8_0 (both K and V). Flash Attention is
159
- enabled. Gemma 4 sliding-window attention keeps KV at ~500 MiB
160
- even at 32k context (5 of 30 layers full-context, 25 SWA-capped;
161
- SWA window = 1024 tokens; pattern is full-context every 6th layer
162
- per `gemma4.attention.sliding_window_pattern`).
163
-
164
- Full ExecStart (for cite-verbatim purposes):
165
-
166
- ```
167
- /home/hyzen/repo/llama-mainline/build-fast/bin/llama-server \
168
- --model /home/hyzen/repo/turbo/models/gemma/macher.gguf \
169
- --ctx-size 32768 --parallel 1 \
170
- -fa on -ctk q8_0 -ctv q8_0 \
171
- -ngl 999 -b 1024 -ub 512 \
172
- -t 12 -tb 24 \
173
- --host 127.0.0.1 --port 11435 \
174
- --jinja --reasoning-budget 4096 \
175
- --cache-ram 4096 --cache-reuse 256 \
176
- --temp 0 --top-p 1.0 --repeat-penalty 1.0
177
- ```
178
-
179
- `--cache-ram 4096 --cache-reuse 256` enables the 4 GiB host-RAM
180
- prompt cache; first-token latency on warm cache is ~10× faster
181
- than cold. `--reasoning-budget 4096` caps the thinking phase at
182
- 4096 tokens before forcing the model into the answer phase.
183
-
184
- ---
185
-
186
- ## Measured single-stream baseline (this config)
187
-
188
- Captured from steady-state probes on 2026-04-29 (after warmup).
189
- Carried forward as of 2026-04-30: the subsequent changes
190
- (`--reasoning-budget`, `--cache-ram`, `--cache-reuse` added; chat
191
- template metadata rewritten in-place) do not touch tensor data,
192
- attention path, or sampler chain, so generation throughput is
193
- unaffected. Re-probe before publishing.
194
-
195
- | Metric | Value |
196
- |---|---|
197
- | Generation throughput | **~168 tokens/sec** at 32k ctx (~187 t/s at 16k ctx with FP16 KV) |
198
- | Per-token latency | ~5.95 ms/token |
199
- | Prompt eval, small prompts (warm cache) | ~900 t/s |
200
- | Prompt eval, large prompts (10k+ tokens) | ~5,600 t/s |
201
- | Time-to-first-token, 10k-token prompt | ~1.9 s |
202
- | VRAM at idle after model load | ~14.6 / 15.84 GB |
203
- | Theoretical bandwidth ceiling | 421 t/s (4B active × 4.25 bpw / 896 GB/s) |
204
- | Observed efficiency vs ceiling | ~40% |
205
-
206
- Sampling is deterministic at temp 0; numbers above are reproducible
207
- to <0.5% across trials within the same llama-server lifetime.
208
-
209
- ---
210
-
211
- ## How to re-probe
212
-
213
- ```bash
214
- # llama-server runtime
215
- curl -s http://127.0.0.1:11435/props | python3 -m json.tool | head -60
216
-
217
- # GGUF metadata
218
- # (parser script in this repo; or use `gguf-dump` if installed)
219
-
220
- # GPU
221
- lspci | grep -iE "vga|3d|display"
222
- cat /proc/driver/nvidia/version
223
-
224
- # CPU / RAM / OS
225
- grep -m1 "model name" /proc/cpuinfo
226
- nproc
227
- grep MemTotal /proc/meminfo
228
- grep PRETTY_NAME /etc/os-release
229
- uname -a
230
- ```