@semalt-ai/code 1.19.0 → 1.20.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude/settings.local.json +2 -1
- package/ARCHITECTURE.md +6 -95
- package/CLAUDE.md +196 -1874
- package/README.md +1 -1
- package/docs/ARCHITECTURE.md +1321 -0
- package/docs/CONFIG.md +340 -0
- package/docs/HISTORY.md +245 -0
- package/index.js +1 -1
- package/lib/agent.js +145 -16
- package/lib/api.js +28 -3
- package/lib/commands/chat-session.js +187 -4
- package/lib/commands/chat-slash.js +16 -0
- package/lib/commands/chat-turn.js +272 -49
- package/lib/commands/chat.js +12 -8
- package/lib/config.js +27 -0
- package/lib/constants.js +30 -1
- package/lib/headless.js +36 -1
- package/lib/images.js +8 -2
- package/lib/permissions.js +23 -16
- package/lib/prompts.js +15 -3
- package/lib/tool_registry.js +357 -53
- package/lib/tool_specs.js +42 -8
- package/lib/tools.js +80 -19
- package/lib/ui/anim.js +86 -0
- package/lib/ui/ansi.js +17 -27
- package/lib/ui/chat-history.js +253 -71
- package/lib/ui/create-ui.js +67 -24
- package/lib/ui/diff.js +90 -25
- package/lib/ui/file-activity.js +236 -0
- package/lib/ui/format.js +173 -28
- package/lib/ui/input-field.js +5 -4
- package/lib/ui/md-stream.js +234 -0
- package/lib/ui/render-operation.js +113 -0
- package/lib/ui/select.js +1 -4
- package/lib/ui/status-bar.js +99 -57
- package/lib/ui/stream.js +20 -13
- package/lib/ui/theme.js +190 -45
- package/lib/ui/tool-operation.js +190 -0
- package/lib/ui/utils.js +9 -5
- package/lib/ui/web-activity.js +58 -6
- package/lib/ui/writer.js +159 -45
- package/lib/ui.js +1 -1
- package/package.json +1 -1
- package/test/anim-driver.test.js +153 -0
- package/test/ask-user-display.test.js +226 -0
- package/test/ask-user-gate.test.js +231 -0
- package/test/chat-history-nocolor.test.js +155 -0
- package/test/chat-relogin.test.js +207 -0
- package/test/defer-detail-band.test.js +403 -0
- package/test/detail-band-tab-flatten.test.js +242 -0
- package/test/exec-diff.test.js +268 -0
- package/test/executors.test.js +250 -13
- package/test/extract-tool-calls.test.js +37 -3
- package/test/file-activity.test.js +522 -0
- package/test/grep-path-target.test.js +227 -0
- package/test/harness/chat-harness.js +2 -1
- package/test/headless.test.js +146 -1
- package/test/input-field-ctrl-o.test.js +37 -0
- package/test/live-height-physical.test.js +281 -0
- package/test/max-iterations.test.js +9 -7
- package/test/md-stream.test.js +183 -0
- package/test/native-dispatch.test.js +53 -0
- package/test/native-live-narration.test.js +254 -0
- package/test/output-heredoc-leak.test.js +195 -0
- package/test/output-preview.test.js +245 -0
- package/test/permissions.test.js +199 -0
- package/test/read-paginate.test.js +1 -1
- package/test/render-operation.test.js +317 -0
- package/test/replay-descriptor-xml.test.js +216 -0
- package/test/replay-descriptor.test.js +189 -0
- package/test/replay-web-aggregate.test.js +291 -0
- package/test/replay-web-persist.test.js +241 -0
- package/test/running-glyph-anim.test.js +111 -0
- package/test/status-bar-driver.test.js +93 -0
- package/test/status-bar-resync.test.js +188 -0
- package/test/stream-parser.test.js +24 -0
- package/test/theme-palette.test.js +166 -0
- package/test/truncate-visible.test.js +78 -0
- package/test/view-image.test.js +199 -0
- package/test/web-activity-ordering.test.js +12 -3
- package/path +0 -1
|
@@ -0,0 +1,1321 @@
|
|
|
1
|
+
# semalt-code — Architecture (per-subsystem deep detail)
|
|
2
|
+
|
|
3
|
+
> Deep technical detail per subsystem. **Not auto-loaded** as project memory.
|
|
4
|
+
> The lean `CLAUDE.md` is the runtime-essential entry point; this file is for humans
|
|
5
|
+
> who need the internals. See also `docs/HISTORY.md` (rationale/history) and
|
|
6
|
+
> `docs/CONFIG.md` (config + CLI reference).
|
|
7
|
+
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
## MCP Boundary (`lib/mcp/boundary.js`, Task 3.2)
|
|
11
|
+
|
|
12
|
+
The single bridge between the CommonJS codebase and the ESM-only MCP SDK. It loads the
|
|
13
|
+
SDK via dynamic `import()` (memoized — evaluated at most once per process, lazily on
|
|
14
|
+
first use) and re-exposes a small async surface:
|
|
15
|
+
|
|
16
|
+
- `loadSdk()` → `{ Client, StdioClientTransport }` (the named exports we consume).
|
|
17
|
+
- `createClient(clientInfo?, options?)` → instantiates an MCP `Client` (does **not**
|
|
18
|
+
connect; transport + handshake are Task 3.3). Defaults `clientInfo` to this CLI's
|
|
19
|
+
`{ name, version }` and declares no capabilities.
|
|
20
|
+
- `createStdioTransport(params)` → a `StdioClientTransport` for a local server subprocess.
|
|
21
|
+
- `isSdkAvailable()` → synchronous resolvability check, used by the smoke test to **skip
|
|
22
|
+
gracefully** (never fail) when the dependency isn't installed (e.g. an offline runner).
|
|
23
|
+
- `DEFAULT_CLIENT_INFO`, `_reset()` (test seam).
|
|
24
|
+
|
|
25
|
+
**Invariant:** the SDK is imported **only** here. Anywhere else in the codebase, reach
|
|
26
|
+
MCP through this module and keep using `require()`. Do not migrate the project to ESM.
|
|
27
|
+
Smoke-tested by `test/mcp-boundary.test.js`.
|
|
28
|
+
|
|
29
|
+
As of Task 3.3 the boundary also builds **HTTP/SSE** transports
|
|
30
|
+
(`createStreamableHttpTransport`, `createSseTransport`) and merges caller `env`
|
|
31
|
+
over `getDefaultEnvironment()` for stdio so a launched server keeps PATH/HOME.
|
|
32
|
+
|
|
33
|
+
---
|
|
34
|
+
|
|
35
|
+
## MCP Client (`lib/mcp/client.js`, Task 3.3)
|
|
36
|
+
|
|
37
|
+
Connects to the MCP servers under `config.mcp.servers`, discovers each server's
|
|
38
|
+
tools, and registers them into the runtime tool registry under the namespace
|
|
39
|
+
**`mcp__<server>__<tool>`** so they dispatch through the *same* agent loop as
|
|
40
|
+
built-ins. The manager (`createMcpManager`) owns connect/discover/register,
|
|
41
|
+
per-server status, and shutdown.
|
|
42
|
+
|
|
43
|
+
- **Transports:** `stdio` (local subprocess) and `http`/`sse` (remote). Inferred
|
|
44
|
+
as `http` when a `url` is set and no `transport` is given.
|
|
45
|
+
- **Dynamic registry:** discovered tools are registered via the new dynamic API
|
|
46
|
+
in `lib/tool_registry.js` (`registerDynamicTool` / `dynamicToolEntries` /
|
|
47
|
+
`dynamicToolSpecs`). This set is **kept separate** from the static
|
|
48
|
+
`TOOL_REGISTRY` so the load-time parity check in `lib/constants.js` (which runs
|
|
49
|
+
before any server connects) is never affected. `entryForAction`/`fromInvoke`
|
|
50
|
+
consult dynamic tools *after* the static set, so a dynamic tool can never
|
|
51
|
+
shadow a built-in. Dynamic specs are merged into the native function-calling
|
|
52
|
+
`tools` array in `api.js`, and into the XML `extractToolCalls` pass.
|
|
53
|
+
- **Security posture (load-bearing):**
|
|
54
|
+
- MCP tool **results are untrusted** — `lib/agent.js` wraps `mcp__*` results in
|
|
55
|
+
the same `<<<UNTRUSTED_EXTERNAL_CONTENT>>>` fence used for `http_get`.
|
|
56
|
+
- MCP tool **results are token-capped before entering context (Task W.8)** —
|
|
57
|
+
`formatMcpResult` (`lib/agent.js`) caps the result text with `capToTokens` at
|
|
58
|
+
the **stricter** `mcp.max_result_tokens` budget (default **10000**) **before**
|
|
59
|
+
wrapping it in the fence, so a server returning a huge payload can't blow
|
|
60
|
+
context. The result's size is third-party-controlled, hence the stricter
|
|
61
|
+
budget; the truncation notice sits **inside** the fence with the capped
|
|
62
|
+
content and the untrusted perimeter is unchanged (capping never weakens it).
|
|
63
|
+
- MCP tools **require approval by default** — their permission descriptor is
|
|
64
|
+
non-null, so they are NOT auto-allowed by the `--allow-*` tiers. Opt-in per
|
|
65
|
+
server via `allow: ["toolA", …]` or `allowAll: true` in the server spec
|
|
66
|
+
(a matching tool's descriptor then returns null, like a read-only tool).
|
|
67
|
+
- **OAuth (`lib/mcp/oauth.js`):** remote servers with `oauth: true` get a
|
|
68
|
+
keychain-backed `OAuthClientProvider`. Tokens, the dynamically-registered
|
|
69
|
+
client info, and the PKCE verifier are stored in the OS keychain
|
|
70
|
+
(service `semalt-code-mcp`, namespaced per server) — **never in plaintext
|
|
71
|
+
config**, reusing the generic keychain helpers added to `lib/secrets.js`.
|
|
72
|
+
- **Graceful degradation:** a server that fails to launch/connect is recorded as
|
|
73
|
+
`failed` in status with its error, a warning is logged, and the CLI continues —
|
|
74
|
+
one bad server never blocks the others or crashes startup. A `disabled: true`
|
|
75
|
+
server is skipped entirely.
|
|
76
|
+
- **Management:** `semalt-code mcp list|status|add|remove|auth` (`lib/commands/mcp.js`)
|
|
77
|
+
and the in-chat `/mcp` status view. `mcp add` writes a server spec to config;
|
|
78
|
+
`mcp remove` deletes it and clears any stored OAuth material; `mcp auth` runs
|
|
79
|
+
the OAuth flow for a remote server.
|
|
80
|
+
|
|
81
|
+
**Scope: interactive chat only (load-bearing limitation).** `connectAll()` is invoked
|
|
82
|
+
in exactly two places — `cmdChat` (`lib/commands/chat.js`, the interactive session) and
|
|
83
|
+
the `mcp list|status` management commands (`lib/commands/index.js`). The one-shot/headless
|
|
84
|
+
entry points (`code`/`edit`/`shell` in `lib/commands/oneshot.js` and `-p/--print` via
|
|
85
|
+
`lib/headless.js`) **never construct an MCP manager**, so MCP tools are **not available**
|
|
86
|
+
in those modes — only built-in tools dispatch there. "MCP in headless / one-shot" is a
|
|
87
|
+
**deferred** item (see **Deferred / Not Yet Implemented**), not a bug.
|
|
88
|
+
|
|
89
|
+
**Config (`config.mcp.servers[name]`):** `transport` (`stdio`|`http`|`sse`),
|
|
90
|
+
`command`/`args`/`env`/`cwd` (stdio), `url`/`headers`/`oauth` (remote),
|
|
91
|
+
`allow`/`allowAll` (approval opt-in), `disabled`.
|
|
92
|
+
|
|
93
|
+
Tested by `test/mcp-client.test.js` (real SDK client ↔ a local mock stdio server
|
|
94
|
+
in `test/harness/mock-mcp-server.js`: discovery, namespacing, registry dispatch,
|
|
95
|
+
untrusted wrapping, approval-by-default + allow opt-in, graceful degradation) and
|
|
96
|
+
`test/mcp-oauth.test.js` (keychain token round-trip via an injected store).
|
|
97
|
+
|
|
98
|
+
---
|
|
99
|
+
|
|
100
|
+
|
|
101
|
+
## Agent Loop (`lib/agent.js`)
|
|
102
|
+
|
|
103
|
+
Iterations per user turn are capped (default **125**). The cap is overridable via
|
|
104
|
+
`--max-iterations <n>` / `config.max_iterations`; **`--max-iterations 0`** (or
|
|
105
|
+
`"unlimited"`) opts into a deliberately unbounded loop (power-user choice).
|
|
106
|
+
`DEFAULT_MAX_ITERATIONS` (`lib/constants.js`, = 125) is the single source of truth:
|
|
107
|
+
it seeds `DEFAULT_CONFIG.max_iterations` and is the factory default of
|
|
108
|
+
`runAgentLoop(...)`, so a caller that omits the value still gets a real cap rather
|
|
109
|
+
than `Infinity`. Entry points (`oneshot.js`, `chat-turn.js`, headless) resolve the
|
|
110
|
+
config value through `resolveMaxIterations()` (the `0` sentinel → `Infinity`).
|
|
111
|
+
When the cap is reached, the loop **stops gracefully**: it surfaces a clear,
|
|
112
|
+
user-visible warning naming the limit and how to raise it, returns
|
|
113
|
+
`stopReason: "max_iterations"`, and headless `json`/`stream-json` carry that
|
|
114
|
+
`stopReason` in their envelope (`"end_turn"` on a normal finish, `"verify_failed"`
|
|
115
|
+
when enforcing self-verification exhausts its attempts — see **Self-Verification**).
|
|
116
|
+
Subagents keep their own separate cap of 12 (`lib/subagents.js`).
|
|
117
|
+
|
|
118
|
+
At the loop's **natural end** (final answer, no tool calls — the agent declares
|
|
119
|
+
done), optional **self-verification** (Task 4.2, `lib/verify.js`) may run a
|
|
120
|
+
configured command before the turn is accepted; in enforcing mode a failing verify
|
|
121
|
+
returns the agent to the loop (bounded by `verify.max_attempts`). See
|
|
122
|
+
**Self-Verification** for the full contract.
|
|
123
|
+
|
|
124
|
+
```
|
|
125
|
+
1. Send messages[] to LLM via chatStream()
|
|
126
|
+
2. Stream response tokens to terminal (StreamRenderer)
|
|
127
|
+
3. After full response: extract tool-call tags from text
|
|
128
|
+
4. If no tool tags → done
|
|
129
|
+
5. For each tag: request user permission (once / always / no)
|
|
130
|
+
6. Execute approved operations via ToolExecutor (wrapped in try/catch)
|
|
131
|
+
7. Append tool results to messages[]
|
|
132
|
+
8. Goto 1
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
Each tool dispatch is wrapped in try/catch; errors print a warning and continue to the next tag rather than aborting the loop.
|
|
136
|
+
|
|
137
|
+
|
|
138
|
+
## Lifecycle Hooks (`lib/hooks.js`, Task 3.4)
|
|
139
|
+
|
|
140
|
+
Users map agent-lifecycle events to **shell commands** (or static **prompt** text)
|
|
141
|
+
under `config.hooks` (user + project, merged via Task 2.2). Events:
|
|
142
|
+
`PreToolUse`, `PostToolUse`, `UserPromptSubmit`, `Stop`, `PreCompact`.
|
|
143
|
+
|
|
144
|
+
- **Dispatch points** (`lib/agent.js`): `UserPromptSubmit` fires once before the
|
|
145
|
+
loop for the latest user prompt; `PreToolUse`/`PostToolUse` fire per tool call
|
|
146
|
+
(honoring an optional `matcher` against the tool tag); `Stop` fires once when a
|
|
147
|
+
turn ends (not on user abort). `PreCompact` fires in the compaction sites
|
|
148
|
+
(`chat-slash.js` `/compact`, `chat-turn.js` auto-compact) before summarizing.
|
|
149
|
+
- **Exit-code semantics:** a **non-zero** exit from a `PreToolUse` hook **blocks**
|
|
150
|
+
the tool — it does not run and the hook's stdout/stderr is fed back to the agent
|
|
151
|
+
as the reason (the loop continues with the next call). Exit **zero allows**; any
|
|
152
|
+
non-empty stdout (from any event) is surfaced to the agent as feedback.
|
|
153
|
+
- **Security (load-bearing):** hook commands are shell, so each is checked against
|
|
154
|
+
the Phase 0 **deny-list** (`lib/deny.js`) before running — a hit is skipped,
|
|
155
|
+
never run, and does not block the tool. Command hooks then run through the **same
|
|
156
|
+
OS sandbox** as every other shell call (Pre-Task 5.0a, `resolveSandboxedSpawn` in
|
|
157
|
+
`lib/sandbox.js`) with the identical fail-safe fallback (failIfUnavailable hard
|
|
158
|
+
error / human approval / refuse); a sandbox refusal is contained like a timeout
|
|
159
|
+
(not run, logged, does not block the tool). **Prompt** hooks execute no shell, so
|
|
160
|
+
the sandbox does not apply to them. Hook output entering the agent is
|
|
161
|
+
**untrusted** — fenced in the same `<<<UNTRUSTED_EXTERNAL_CONTENT>>>` delimiter
|
|
162
|
+
as `http_get`/MCP results (`lib/prompts.js` governs both).
|
|
163
|
+
- **Project can only NARROW (Pre-Task 5.0a):** a project-layer
|
|
164
|
+
(`.semalt/config.json`, attacker-controllable in a cloned repo) **command** hook
|
|
165
|
+
is **quarantined** — dropped before any runner sees it (`loadHookLayers`,
|
|
166
|
+
consumed by `lib/config.js loadConfig`), with a one-time warning. A project may
|
|
167
|
+
add only **prompt** hooks (text injection, already untrusted). User-layer
|
|
168
|
+
(`~/.semalt-ai`) hooks are trusted as before. The layers are read **separately**
|
|
169
|
+
(raw configs, not the shallow-merged view), mirroring `loadRuleLayers`.
|
|
170
|
+
- **Containment:** hooks run via `spawnSync` with a timeout (`timeout_ms`, default
|
|
171
|
+
30 s). Timeouts and any failure are contained — a bad hook logs and the loop
|
|
172
|
+
continues, never crashing.
|
|
173
|
+
- **Payload to hooks:** env vars (`SEMALT_HOOK_EVENT`, `SEMALT_TOOL_NAME`,
|
|
174
|
+
`SEMALT_TOOL_INPUT`, `SEMALT_TOOL_RESULT`, `SEMALT_USER_PROMPT`) plus a JSON
|
|
175
|
+
payload on stdin.
|
|
176
|
+
|
|
177
|
+
**Hook definition:** `{ type: "command"|"prompt", command|prompt, matcher?, timeout_ms? }`.
|
|
178
|
+
`matcher` (PreToolUse/PostToolUse) is `*`/absent = all, else a `|`-separated list
|
|
179
|
+
of anchored regexes matched against the tool tag (e.g. `"shell|exec"`, `"mcp__.*"`).
|
|
180
|
+
`createHookRunner({ getConfig, spawn?, log?, onUnsandboxed?, sandbox? })` is the
|
|
181
|
+
injectable dispatcher; `normalizeHooks`/`hookMatches`/`loadHookLayers` are pure.
|
|
182
|
+
Tested by `test/hooks.test.js` (unit, injected spawn + pass-through sandbox),
|
|
183
|
+
`test/hooks-agent.test.js` (real loop + mock-LLM + real spawn, sandbox off:
|
|
184
|
+
PreToolUse block, PostToolUse observe, UserPromptSubmit inject, deny-list skip,
|
|
185
|
+
failure containment, Stop firing), `test/hooks-verify-sandbox.test.js` (sandbox
|
|
186
|
+
routing: fallback refuse/hard-error/approve + REAL bwrap out-of-CWD block,
|
|
187
|
+
deny-list-before-sandbox, prompt-hook-unaffected), and
|
|
188
|
+
`test/config-quarantine.test.js` (project command-hook quarantine, prompt kept).
|
|
189
|
+
|
|
190
|
+
## Self-Verification (`lib/verify.js`, Task 4.2)
|
|
191
|
+
|
|
192
|
+
When the agent declares a task done (the loop's natural end — a final answer with
|
|
193
|
+
no tool calls), an optional configured **verify command** is run and its result
|
|
194
|
+
fed back. Two modes, **default advisory**:
|
|
195
|
+
|
|
196
|
+
- **advisory** (default): run the command once when the agent finishes, append the
|
|
197
|
+
fenced result to context as information, and **end the turn regardless** of
|
|
198
|
+
pass/fail. Advisory **never blocks**.
|
|
199
|
+
- **enforcing**: a pass ends the turn; a **failing** verify returns the agent to
|
|
200
|
+
the loop with the fenced result so it can fix the problem, and it cannot finish
|
|
201
|
+
until verify passes — **bounded** (see below).
|
|
202
|
+
|
|
203
|
+
**Bounding (load-bearing).** Enforcing has its own **verify-attempt limit**
|
|
204
|
+
(`max_attempts`, default 3) — a *precise* bound distinct from the coarse iteration
|
|
205
|
+
cap. After N failed verifies the loop terminates with the dedicated stop reason
|
|
206
|
+
**`verify_failed`** (not by grinding to `max_iterations`). So enforcing always
|
|
207
|
+
terminates via one of: verify-pass, the verify-attempt limit, or the iteration cap
|
|
208
|
+
— never unbounded.
|
|
209
|
+
|
|
210
|
+
**Verify is shell — treated like a hook** (`lib/verify.js` mirrors `lib/hooks.js`):
|
|
211
|
+
- **Deny-list first** — the command passes through the Phase 0 deny-list
|
|
212
|
+
(`lib/deny.js`) before running; a hit is refused (never run) and reported as a
|
|
213
|
+
non-passing verify.
|
|
214
|
+
- **OS sandbox (Pre-Task 5.0a)** — after the deny-list the command is wrapped by
|
|
215
|
+
the **same** OS sandbox as every other shell call (`resolveSandboxedSpawn`),
|
|
216
|
+
with the identical fail-safe fallback (failIfUnavailable hard error / human
|
|
217
|
+
approval / refuse). A sandbox refusal is reported as a non-passing verify —
|
|
218
|
+
never a silent unsandboxed run.
|
|
219
|
+
- **Project can only NARROW (Pre-Task 5.0a)** — a project-layer
|
|
220
|
+
(`.semalt/config.json`) `verify.command` is **quarantined** (`loadVerifyLayers`,
|
|
221
|
+
consumed by `lib/config.js loadConfig`, with a one-time warning): the effective
|
|
222
|
+
verify is the **user layer's**, full stop. A cloned repo can never introduce or
|
|
223
|
+
alter the executable verify command.
|
|
224
|
+
- **Timeout** — runs via `spawnSync` with `timeout_ms` (default 120 s). A hung
|
|
225
|
+
verify (e.g. a stuck `npm test`) is killed and treated as a **failed** verify —
|
|
226
|
+
it never hangs the agent.
|
|
227
|
+
- **Untrusted output** — the command output (a failing test name could carry an
|
|
228
|
+
injection) is fenced in the same `<<<UNTRUSTED_EXTERNAL_CONTENT>>>` delimiter as
|
|
229
|
+
hook/MCP/`http_get` output before it enters context.
|
|
230
|
+
- **Success is exit-code based** — exit == `expected_exit_code` (default 0) is a
|
|
231
|
+
pass. **stdout is never parsed** for success patterns (avoids brittleness).
|
|
232
|
+
- **Contained** — a spawn failure is a non-passing verify, never a crash.
|
|
233
|
+
|
|
234
|
+
**Config (`config.verify`):** `{ mode: "advisory"|"enforcing", command, timeout_ms,
|
|
235
|
+
expected_exit_code, max_attempts }`. Empty `command` → the feature is a **no-op**.
|
|
236
|
+
**`--no-verify`** is a one-off skip honored in both modes (→ `verifyStatus: skipped`).
|
|
237
|
+
|
|
238
|
+
**Surfacing.** `runAgentLoop` returns `verifyStatus` (`"skipped"|"passed"|"failed"`)
|
|
239
|
+
alongside `stopReason`; headless `json`/`stream-json` carry both in the envelope.
|
|
240
|
+
**Subagents never trigger verify** — it is a top-level gate on the user's task, so
|
|
241
|
+
child loops run with `noVerify: true`.
|
|
242
|
+
|
|
243
|
+
`normalizeVerify`, `createVerifyRunner` (now also accepting `onUnsandboxed?`/
|
|
244
|
+
`sandbox?`), and `loadVerifyLayers` are injectable/unit-testable. Tested by
|
|
245
|
+
`test/verify.test.js` (normalizer + runner with a pass-through sandbox: exit-code
|
|
246
|
+
success, custom expected code, deny-list refusal, timeout, no-op/skip, untrusted
|
|
247
|
+
fencing), `test/verify-agent.test.js` (real loop + mock-LLM + real spawn, sandbox
|
|
248
|
+
off: advisory feeds result and ends, enforcing pass, fail-then-pass re-entry,
|
|
249
|
+
exhaust→`verify_failed`, timeout, deny-list, `--no-verify`, no-command no-op,
|
|
250
|
+
headless `verifyStatus`), `test/hooks-verify-sandbox.test.js` (sandbox routing:
|
|
251
|
+
fallback refuse/hard-error/approve + REAL bwrap out-of-CWD block, deny-list-first),
|
|
252
|
+
and `test/config-quarantine.test.js` (project `verify.command` quarantine).
|
|
253
|
+
|
|
254
|
+
## Checkpoints & Rewind (`lib/checkpoints.js`, Task 4.3 / 4.3b)
|
|
255
|
+
|
|
256
|
+
Before each **file-tool mutation** the affected file's prior state is snapshotted
|
|
257
|
+
so `/rewind` (and `semalt-code rewind`) can restore it. Restoration is a straight
|
|
258
|
+
**content-restore** (write the prior bytes back, or delete a file that did not
|
|
259
|
+
exist before) — never a fragile reverse-diff replay. Task 4.3b adds
|
|
260
|
+
**restore-path guard re-validation** and **three restore modes**
|
|
261
|
+
(code/conversation/both) — see the two subsections at the end of this section.
|
|
262
|
+
|
|
263
|
+
Rewind is **human-only — there is NO rewind tool in the registry** (static,
|
|
264
|
+
dynamic, `TOOL_SPECS`, or `TAG_REGISTRY`), asserted by a test. A tool-triggerable
|
|
265
|
+
rewind would be a low-value escalation surface (an agent could rewind past a
|
|
266
|
+
newly-added `deny` rule); `/rewind` and `semalt-code rewind` are the only entries.
|
|
267
|
+
|
|
268
|
+
**Scope limit (load-bearing, surfaced to the user).** Checkpoints cover
|
|
269
|
+
**file-tool mutations only**: `write`, `append`, `edit_file`, `replace_in_file`,
|
|
270
|
+
`delete_file`, `move_file`, `copy_file`, `upload` (`CHECKPOINTABLE_ACTIONS`).
|
|
271
|
+
**Shell side effects are NOT reversible** — a command that created a file, touched
|
|
272
|
+
a DB, or hit the network is out of scope. `/rewind` output and the docs say so
|
|
273
|
+
plainly (`SCOPE_NOTICE`); a false sense of "full undo" is worse than no undo.
|
|
274
|
+
Directory ops (`make_dir`/`remove_dir`) are not snapshotted either.
|
|
275
|
+
|
|
276
|
+
**Capture point.** Capture happens in the executor (`agentExecFile`, `lib/tools.js`)
|
|
277
|
+
**after** the permission gate approves and **before** the mutation runs:
|
|
278
|
+
`beginCapture(action, args)` reads prior state pre-mutation; `commit()` runs only
|
|
279
|
+
on a `status:'ok'` result. So a **denied/withheld** call (refused at the gate, in
|
|
280
|
+
plan mode, or by the executor's own `--readonly`/sandbox/dry-run guards) produces
|
|
281
|
+
**no checkpoint**. Capture is **fail-safe**: a snapshot failure (disk full,
|
|
282
|
+
EACCES) warns and returns null — the mutation **still proceeds**, never blocked.
|
|
283
|
+
|
|
284
|
+
**Subagents are checkpointed into the parent session.** A subagent reuses the
|
|
285
|
+
parent's `agentExecFile`, so its mutations flow through the **same** store and are
|
|
286
|
+
rewindable from the parent. The subagent's child runner is built **without** a
|
|
287
|
+
`checkpoints` binding, so it never resets the turn linkage — a child's mutations
|
|
288
|
+
stay linked to the **parent's** current turn (the 4.2 inheritance finding, here
|
|
289
|
+
*wanted*).
|
|
290
|
+
|
|
291
|
+
**On-disk layout.** `~/.semalt-ai/checkpoints/<session>/<seq>.json`, one record
|
|
292
|
+
per mutation:
|
|
293
|
+
|
|
294
|
+
```json
|
|
295
|
+
{
|
|
296
|
+
"version": 1, "seq": 1, "session": "abcd1234", "ts": "…", "action": "write",
|
|
297
|
+
"turn": { "turnId": "turn-1", "promptId": "…", "promptIndex": 3, "messageCountAtStart": 4 },
|
|
298
|
+
"targets": [
|
|
299
|
+
{ "path": "/abs/p", "role": "primary", "existedBefore": true, "isDir": false,
|
|
300
|
+
"oversize": false, "rewindable": true, "priorContentB64": "…", "priorMode": 420,
|
|
301
|
+
"afterExists": true, "afterHash": "<sha256 of what the agent left>" }
|
|
302
|
+
],
|
|
303
|
+
"rewindable": true
|
|
304
|
+
}
|
|
305
|
+
```
|
|
306
|
+
|
|
307
|
+
**Conversation linkage (load-bearing).** Every checkpoint records its `turn`
|
|
308
|
+
linkage (`turnId`/`promptId`/`promptIndex`/`messageCountAtStart`) set by the agent
|
|
309
|
+
loop at turn start (`lib/agent.js`). **Task 4.3b builds conversation-rewind on
|
|
310
|
+
exactly this schema — the on-disk format was NOT changed** (a record written by
|
|
311
|
+
the 4.3 code still rewinds under 4.3b, asserted by a test). Do not remove these
|
|
312
|
+
fields.
|
|
313
|
+
|
|
314
|
+
**Delete & move reversal.** Each target is restored to its prior state generically:
|
|
315
|
+
`existedBefore` → write the prior bytes back (so **delete** recreates the file);
|
|
316
|
+
`!existedBefore` → remove the file if it now exists (so a **created** file is
|
|
317
|
+
deleted). A **move** records two targets — `move_src` (existed → restored to
|
|
318
|
+
origin) and `move_dst` (its prior state, deleted if it didn't exist) — so rewind
|
|
319
|
+
returns the file to its origin. A **copy** records only `copy_dst` (src untouched).
|
|
320
|
+
|
|
321
|
+
**External-modification integrity.** Each target stores the **after-state** the
|
|
322
|
+
agent left (existence + content hash). Before overwriting, `/rewind` compares the
|
|
323
|
+
current file against the **latest** agent-left after-state for that path (across
|
|
324
|
+
the session, so the agent's own later writes aren't mistaken for an out-of-band
|
|
325
|
+
edit). A file changed externally is **reported and NOT clobbered** — the rewind is
|
|
326
|
+
blocked with a `force` hint; `{ force: true }` (CLI/in-chat `force`) overrides.
|
|
327
|
+
|
|
328
|
+
**Retention + size cap (mandatory).** A per-file cap (`max_file_bytes`, default
|
|
329
|
+
5 MB): an oversize file (or a directory) is **not** snapshotted — recorded
|
|
330
|
+
`rewindable:false` so rewind reports it as unavailable rather than exhausting disk
|
|
331
|
+
— and the mutation still proceeds. A per-session retention cap (`max_per_session`,
|
|
332
|
+
default 100) prunes the oldest checkpoints. (Session-scoped now; the schema's `ts`
|
|
333
|
+
leaves room to move to time-based pruning later.)
|
|
334
|
+
|
|
335
|
+
**Surfacing.** Each commit and rewind writes a `checkpoint` row to the audit log
|
|
336
|
+
(`logCheckpoint`, `lib/audit.js`) and emits a `checkpoint:<seq>` log line.
|
|
337
|
+
`semalt-code rewind` targets the **most-recently-active session**
|
|
338
|
+
(`latestSession`); in chat `/rewind` uses the current session (the store's id is
|
|
339
|
+
realigned to the chat `session.id` at startup).
|
|
340
|
+
|
|
341
|
+
**Config (`config.checkpoints`):** `{ enabled (true), max_file_bytes (5 MB),
|
|
342
|
+
max_per_session (100) }` — normalized in `lib/config.js`. The store
|
|
343
|
+
(`createCheckpointStore`) is injectable (`fs`/`now`/`log`/`audit`/`rootDir`/
|
|
344
|
+
`restoreGuard`) and exhaustively unit-tested by `test/checkpoints.test.js`
|
|
345
|
+
(normalizer, capture pre-mutation, no-commit→no-checkpoint, restore, rewind-to-seq,
|
|
346
|
+
delete/move reversal, external-mod block+force, size cap, retention prune,
|
|
347
|
+
fail-safe, turn-linkage, scope notice, **+ 4.3b: guard re-validation, the three
|
|
348
|
+
modes, turn-boundary cutting, orphan-free native map, human-only, on-disk
|
|
349
|
+
unchanged**) and `test/checkpoints-agent.test.js` (real loop + mock-LLM: top-level
|
|
350
|
+
write checkpointed + rewound, denied call → no checkpoint, **subagent mutation
|
|
351
|
+
checkpointed in the parent session and rewindable**).
|
|
352
|
+
|
|
353
|
+
### Restore-path guard re-validation (Task 4.3b, Part 1)
|
|
354
|
+
|
|
355
|
+
The restore path does NOT blindly re-write the prior bytes. Before each
|
|
356
|
+
write/delete, the target path is re-checked against the **current** guards via an
|
|
357
|
+
injected `restoreGuard` (wired in `index.js` from the same primitives the executors
|
|
358
|
+
use): **`isPathSafe`** (CWD confinement / `--allow-anywhere`), the **secret-file
|
|
359
|
+
guard** (`isProtectedSecretPath`), the **protected-config write guard**
|
|
360
|
+
(`isProtectedConfigPath`, 5.0b), and any active **`deny` permission rule**
|
|
361
|
+
(`permissionManager.resolveRule`, 4.1). A target the guards now forbid — e.g. a
|
|
362
|
+
path that was inside the CWD at capture but is now covered by a `deny` rule, or
|
|
363
|
+
`--allow-anywhere` is no longer set — is **refused and skipped** (surfaced in the
|
|
364
|
+
`refused[]` result and the audit line), **not** aborting the rest of the rewind.
|
|
365
|
+
This holds whether or not `force` is used: **`force` overrides only the
|
|
366
|
+
external-modification check, never the guards.** A `restoreGuard` that throws fails
|
|
367
|
+
**closed** (refused). The store's own unit tests default `restoreGuard` to allow
|
|
368
|
+
(a no-op), preserving the 4.3 behavior when none is injected.
|
|
369
|
+
|
|
370
|
+
### Conversation-rewind + restore modes (Task 4.3b, Part 2)
|
|
371
|
+
|
|
372
|
+
`/rewind <seq> [code|conversation|both]` (default **both**); same syntax for
|
|
373
|
+
`semalt-code rewind`. The `turnId`/`promptId`/`promptIndex` linkage maps a
|
|
374
|
+
checkpoint to its conversation point.
|
|
375
|
+
|
|
376
|
+
- **code** — files only (the original 4.3 behavior); conversation untouched.
|
|
377
|
+
- **conversation** — session history only; files untouched.
|
|
378
|
+
- **both** (default) — files restored to the checkpoint **and** history truncated
|
|
379
|
+
to the matching turn together — the coherent linked state (code-without-
|
|
380
|
+
conversation leaves the agent amnesiac about how it got there;
|
|
381
|
+
conversation-without-code leaves it reasoning over stale files).
|
|
382
|
+
|
|
383
|
+
**Turn-boundary cutting (load-bearing).** Conversation restore truncates history
|
|
384
|
+
back to the **start of the turn** that produced the checkpoint — the cut always
|
|
385
|
+
lands on a `user` message boundary (`planConversationRewind` →
|
|
386
|
+
`locateTurnStart`/`snapToTurnBoundary`), **never mid-`tool_call`/`tool-result`
|
|
387
|
+
pair**, so the restored history has **no orphaned `tool_call`** and the native
|
|
388
|
+
function-calling map stays consistent (the 4.0c invariant; `findOrphanedToolCalls`
|
|
389
|
+
asserts it in tests, including a native-path case). Locating the turn is robust to
|
|
390
|
+
index shifts (compaction): it prefers matching the recorded `promptId` (a hash of
|
|
391
|
+
the turn's user prompt) and falls back to `promptIndex`/`messageCountAtStart`,
|
|
392
|
+
snapping a stale mid-turn index back to the boundary.
|
|
393
|
+
|
|
394
|
+
**Post-rewind message policy: DISCARD.** The messages after the rewind point are
|
|
395
|
+
**removed from active history** (the truncated array replaces `ctx.messages` in
|
|
396
|
+
chat, or the saved session file for `semalt-code rewind`). They are returned as
|
|
397
|
+
`conversation.removed` for transparency / optional archival but are **not retained**
|
|
398
|
+
by the store or re-persisted. The store never owns the conversation — `rewind`
|
|
399
|
+
takes the live `messages` array and **returns** the truncated `conversation.messages`
|
|
400
|
+
for the caller to apply.
|
|
401
|
+
|
|
402
|
+
## OS Sandbox (`lib/sandbox.js`, Task 4.4 / 4.4b)
|
|
403
|
+
|
|
404
|
+
Wraps **every shell command (and its child processes)** in a kernel-enforced
|
|
405
|
+
filesystem **and (binary) network** jail so confinement is the OS's job, not trust
|
|
406
|
+
or pattern-matching. It is an **additional boundary UNDER** the deny-list
|
|
407
|
+
(`lib/deny.js`), per-pattern permissions (Task 4.1), `--readonly`, and `isPathSafe`
|
|
408
|
+
— defense in depth. All of those still run; the sandbox catches what they miss.
|
|
409
|
+
|
|
410
|
+
**Binary network isolation (Task 4.4b).** A sandboxed command has either **normal
|
|
411
|
+
network** (the default — otherwise `npm install`/`pip` are unusable) or **NONE**,
|
|
412
|
+
kernel-enforced: bwrap `--unshare-net` (a fresh network namespace with no real
|
|
413
|
+
interfaces) on Linux, a Seatbelt `(deny network*)` clause on macOS. There is
|
|
414
|
+
deliberately **no host proxy, no domain allowlist, and no TLS interception** (so
|
|
415
|
+
Go binaries like `gh`/`gcloud` are unaffected). This is **on/off per sandboxed
|
|
416
|
+
command, not "allow github, block the rest."**
|
|
417
|
+
|
|
418
|
+
> **Why binary, not a domain allowlist (the state-of-the-art lesson).** The
|
|
419
|
+
> reference implementation (Claude Code) shipped a domain-allowlist network
|
|
420
|
+
> sandbox via a host-side SOCKS/HTTP proxy. It was bypassed **completely, twice, by
|
|
421
|
+
> two independent researchers, over 5.5 months** — because OS enforcement correctly
|
|
422
|
+
> pins the agent to localhost, but the egress decision is delegated to a host-side
|
|
423
|
+
> proxy with full network privileges, and fooling the proxy makes the **host** dial
|
|
424
|
+
> out. Documented failures: (a) `allowedDomains: []` (most-restrictive intent) read
|
|
425
|
+
> as "allow all" via an `allowedDomains.length > 0` check — a **fail-open**
|
|
426
|
+
> (CVE-2025-66479); (b) a JS-vs-libc hostname-parser differential (`endsWith()`);
|
|
427
|
+
> (c) TLS MITM in the proxy broke Go binaries. The proxy also rode on an abandoned
|
|
428
|
+
> dependency in the security path. We choose **binary** isolation to remove that
|
|
429
|
+
> entire class of bypass *by construction*. Domain-granularity is **deferred**
|
|
430
|
+
> (see **Deferred / Not Yet Implemented**), with this rationale recorded.
|
|
431
|
+
|
|
432
|
+
**Anti-fail-open (constraint, the `allowedDomains:[]` lesson).** Network defaults
|
|
433
|
+
**on**, but the moment a human **touches** the network setting — `sandbox.network`
|
|
434
|
+
in config, or the `--no-network` flag — that is an "isolation-requested" context,
|
|
435
|
+
and there anything not **exactly** `"on"` (empty / missing-in-an-object /
|
|
436
|
+
malformed / a typo / `false` / `null`) resolves to the **safe isolated state
|
|
437
|
+
(no-network)**, never silently back to network (`normalizeSandbox`). The
|
|
438
|
+
intended-most-restrictive input is the most-restrictive outcome. *Limitation:*
|
|
439
|
+
no-network is enforced **by the jail**, so it only applies to **sandboxed**
|
|
440
|
+
commands — a `mode: "off"` or human-approved-unavailable run has the host network
|
|
441
|
+
(reported honestly as `net:on`).
|
|
442
|
+
|
|
443
|
+
**Chokepoint (unified Pre-Task 5.0a).** The sandbox decision lives in the shared
|
|
444
|
+
`resolveSandboxedSpawn` shim (`lib/sandbox.js`) — folding the config×detection
|
|
445
|
+
decision, the command wrapping, and the fail-safe fallback into one async
|
|
446
|
+
resolution the caller spawns. **Every** shell-executing path routes through it:
|
|
447
|
+
`agentExecShell` (`lib/tools.js`, the `exec`/`shell` tool — both XML and native
|
|
448
|
+
tags converge here), **self-verification** (`lib/verify.js`), and **command-type
|
|
449
|
+
lifecycle hooks** (`lib/hooks.js`). So the model has **no path that runs a command
|
|
450
|
+
outside the sandbox** — the previously-unsandboxed verify/hook `spawnSync` paths
|
|
451
|
+
are now covered (the gap the re-audit found). Prompt hooks execute no shell and so
|
|
452
|
+
are unaffected.
|
|
453
|
+
|
|
454
|
+
**Platforms.**
|
|
455
|
+
- **macOS** → Seatbelt via `sandbox-exec` (built-in; an SBPL policy is generated
|
|
456
|
+
per call).
|
|
457
|
+
- **Linux / WSL2** → `bwrap` (bubblewrap, unprivileged user namespaces).
|
|
458
|
+
- **Native Windows / WSL1** → no OS primitive (bwrap needs namespaces WSL1 lacks;
|
|
459
|
+
native Windows has none) → the sandbox is **unavailable**; the fallback applies.
|
|
460
|
+
|
|
461
|
+
**Policy model (what's allowed / denied).** Reads are allowed broadly (whole FS
|
|
462
|
+
readable). Writes are confined to the **working directory** (+ a writable temp
|
|
463
|
+
dir). With `--allow-anywhere` the whole FS becomes writable **except** the
|
|
464
|
+
protected paths, which stay read-only regardless. bwrap: `--ro-bind / /` (or
|
|
465
|
+
`--bind / /` for allow-anywhere) → fresh `--proc /proc` + `--dev /dev` → `--bind`
|
|
466
|
+
the writable roots → `--ro-bind` the protected paths **last** (so they win on
|
|
467
|
+
overlap, e.g. cwd == `$HOME`) → `--chdir`. Seatbelt mirrors this with
|
|
468
|
+
last-match-wins SBPL: `(allow default)` → deny all writes → re-allow writable
|
|
469
|
+
roots → re-deny protected. **Network:** when network is `off`, bwrap prepends
|
|
470
|
+
`--unshare-net` and Seatbelt adds `(deny network*)` right after `(allow default)`
|
|
471
|
+
(last-match-wins keeps it denied); when `on` (the default) neither is emitted.
|
|
472
|
+
|
|
473
|
+
**The three real-CVE constraints it enforces:**
|
|
474
|
+
1. **The agent can NEVER disable the sandbox — or widen network access.** No
|
|
475
|
+
tool/flag/config the *model* can reach turns the sandbox off **or flips
|
|
476
|
+
no-network back to network**. `sandbox.mode` / `sandbox.network` live in the
|
|
477
|
+
human-edited user/project config; the only runtime signals are human-typed CLI
|
|
478
|
+
flags (`--dangerously-skip-permissions`, `--no-network`) or config. Call-level
|
|
479
|
+
options the model might influence cannot flip the decision (proven by tests —
|
|
480
|
+
including one passing a `{ network: 'on' }` call option that is ignored under a
|
|
481
|
+
no-network jail).
|
|
482
|
+
2. **config / hooks / secrets are READ-ONLY inside the jail — including
|
|
483
|
+
not-yet-existing files** (CVE-2026-25725). The whole `~/.semalt-ai` dir, the
|
|
484
|
+
secret dirs (`~/.ssh`/`~/.aws`/`~/.gnupg`), `/etc`, **and every project
|
|
485
|
+
`.semalt` dir from the CWD up to the repo root** (Pre-Task 5.0b) are bound
|
|
486
|
+
read-only, so a sandboxed process cannot **create** a missing `config.json`
|
|
487
|
+
(or `agents`/hooks) — under `~/.semalt-ai` *or* the in-CWD `.semalt` — to
|
|
488
|
+
inject host-privileged execution. The protected-config dir set is
|
|
489
|
+
single-sourced as `protectedConfigDirs` (`lib/constants.js`) and shared by the
|
|
490
|
+
jail (`protectedPaths`) and the host write guard (see below).
|
|
491
|
+
3. **procfs / symlink / `..` rewrites are confined on the RESOLVED real path**
|
|
492
|
+
(the `/proc/self/root` bypass). bwrap mounts a fresh `/proc` and the kernel
|
|
493
|
+
enforces every bind on the resolved path; protected paths are
|
|
494
|
+
`realpath()`-canonicalized before binding. (The deny-list got a matching fix —
|
|
495
|
+
see below.)
|
|
496
|
+
|
|
497
|
+
**Fallback (fail-safe, defaults safe).** If the sandbox can't start (missing
|
|
498
|
+
bwrap, unsupported platform) the command is **never silently run unsandboxed**:
|
|
499
|
+
- default (`auto`) → fall back to a **human approval** (`onUnsandboxed`, injected
|
|
500
|
+
by `index.js`, never reachable by the model); with **no approver** (non-TTY /
|
|
501
|
+
headless / tests) the command is **REFUSED**.
|
|
502
|
+
- `sandbox.failIfUnavailable: true` → a **hard error** (strict gate) instead.
|
|
503
|
+
- `sandbox.mode: "off"` (a deliberate human opt-out) → run unsandboxed, status
|
|
504
|
+
`off`. `--dangerously-skip-permissions` (human-only) bypasses all safety,
|
|
505
|
+
sandbox included.
|
|
506
|
+
|
|
507
|
+
**Child-process confinement.** The bwrap/`sandbox-exec` process is the
|
|
508
|
+
process-group leader (`spawnWithGroup`), so the existing `lib/proc.js`
|
|
509
|
+
tree-kill/abort plumbing tears down the **whole jailed subtree**, and a spawned
|
|
510
|
+
subprocess (e.g. an `npm install` postinstall hook) is bound by the same jail.
|
|
511
|
+
|
|
512
|
+
**Surfacing.** Each shell result carries `sandbox: 'on' | 'off' | 'unavailable'`
|
|
513
|
+
**and `network: 'on' | 'off'`** fields; both appear in `--debug` (shell debug rows
|
|
514
|
+
— `sandbox:` and `net:`) and the audit log (the `exec` row's input + a
|
|
515
|
+
`sandbox-blocked`/`sandbox-refused` result status when the fallback blocks).
|
|
516
|
+
`/sandbox` and `semalt-code sandbox` print the full status report including the
|
|
517
|
+
effective network mode (`effective: ON (net:on|off)`).
|
|
518
|
+
|
|
519
|
+
**Config (`config.sandbox`):** `{ mode: "auto"|"off", failIfUnavailable: bool,
|
|
520
|
+
network: "on"|"off" }` — normalized by `normalizeSandbox` (`lib/sandbox.js`).
|
|
521
|
+
`auto`/`network:"on"` by default; `mode:"off"` and `network:"off"` are
|
|
522
|
+
**human-only** settings (plus the `--no-network` CLI flag, read once at module load
|
|
523
|
+
in `lib/tools.js` and from argv in the shared shim). Detection (`detectSandbox`) is
|
|
524
|
+
**cached** per process and fully injectable (`platform`/`which`/`probe`/`readFile`)
|
|
525
|
+
so every platform path is unit-testable. The shared `resolveSandboxedSpawn` shim
|
|
526
|
+
(Pre-Task 5.0a) is the universal entry both `agentExecShell` and the verify/hook
|
|
527
|
+
paths call; it threads the network mode through `decideSandbox` →
|
|
528
|
+
`wrapCommand` → the policy builders. Tested by `test/sandbox.test.js` (normalizer
|
|
529
|
+
incl. the **anti-fail-open** malformed-network case, detection per platform,
|
|
530
|
+
policy/argv generation incl. `--unshare-net` / `(deny network*)`, wrap, decision
|
|
531
|
+
network mode, status report), `test/sandbox-agent.test.js` (executor fallback:
|
|
532
|
+
refuse-on-unavailable, failIfUnavailable hard gate, approver yes/no, mode-off, no
|
|
533
|
+
model-reachable bypass, deny-list still fires under the layer, **a REAL no-network
|
|
534
|
+
jail surfaces `net:off` and a `{network:'on'}` call option cannot re-enable it**),
|
|
535
|
+
`test/sandbox-integration.test.js` (REAL bwrap/sandbox-exec jails — out-of-dir
|
|
536
|
+
write blocked, not-yet-existing config denied, nested-protected wins,
|
|
537
|
+
`/proc/self/root` confined, child confinement, broad reads, **no-network blocked +
|
|
538
|
+
paired network-on reachable + composes-with-fs + child inherits no-network** —
|
|
539
|
+
**skips gracefully** when the primitive is absent), and
|
|
540
|
+
`test/hooks-verify-sandbox.test.js` (the same shim applied to verify + command
|
|
541
|
+
hooks: fallback rules + REAL kernel out-of-CWD block + **REAL no-network jail for
|
|
542
|
+
verify and hook commands**).
|
|
543
|
+
|
|
544
|
+
## Project Memory (`lib/memory.js`, Task 2.3)
|
|
545
|
+
|
|
546
|
+
On session start, `getSystemPrompt()` appends project-local instruction files to the base prompt as a distinct, clearly-marked `<<<PROJECT_MEMORY>>>` section (trusted project context, not untrusted external content). Files are loaded in this hierarchy, all that exist, in order:
|
|
547
|
+
|
|
548
|
+
1. **global** — `~/.semalt-ai/AGENTS.md`
|
|
549
|
+
2. **project root** — `<repo root>/AGENTS.md` (repo root = nearest `.git` ancestor)
|
|
550
|
+
3. **cwd** — `<cwd>/AGENTS.md` (only when the CWD is nested below the repo root)
|
|
551
|
+
|
|
552
|
+
At each level **`CLAUDE.md` is an alias for `AGENTS.md`** — `AGENTS.md` wins when both exist, and the ignored `CLAUDE.md` is reported. Total size is bounded (`DEFAULT_MEMORY_MAX_BYTES` = 32 KB); oversized memory is truncated with a visible notice. With no memory files present, the system prompt is byte-for-byte the pre-2.3 prompt. `/memory` lists the loaded files and their resolved paths. A full system-prompt override (`--system-prompt <file>`) bypasses memory auto-loading.
|
|
553
|
+
|
|
554
|
+
---
|
|
555
|
+
|
|
556
|
+
## Multimodal Image Input (`lib/images.js`, Task 5.4)
|
|
557
|
+
|
|
558
|
+
Accept **image input** (screenshots, mockups, diagrams) so the agent can *see*.
|
|
559
|
+
**Input only** — formats **PNG, JPEG, WebP, GIF**. PDF is **deferred**; image
|
|
560
|
+
**generation** is out of scope entirely.
|
|
561
|
+
|
|
562
|
+
- **Entry points (all three):** `--image <path>` (repeatable) on the CLI/headless
|
|
563
|
+
(`lib/args.js` → `opts.image`, attached in `cmdCode`, `lib/commands/oneshot.js`),
|
|
564
|
+
the in-chat **`/image <path>`** command (stages into `ctx.pendingImages`,
|
|
565
|
+
consumed + cleared by the next user turn — `chat-slash.js`/`chat-turn.js`), and
|
|
566
|
+
the SDK facade `agent.run(prompt, { images: [...] })` (`lib/sdk.js`, accepts file
|
|
567
|
+
paths **or** pre-encoded `{ media_type, data }` records). Each image is read
|
|
568
|
+
through **`isPathSafe`** (same guard as every file read), **size-checked**,
|
|
569
|
+
**base64-encoded**, and its **media type detected from magic bytes** (extension
|
|
570
|
+
fallback). Images attach to the **latest user turn** as an internal `images`
|
|
571
|
+
field on the message — the rest of the loop (tools, permissions, headless
|
|
572
|
+
envelope) is unchanged.
|
|
573
|
+
|
|
574
|
+
- **Provider-specific content-part shape (constraint #1).** `lib/api.js` builds
|
|
575
|
+
the right encoding per endpoint at the wire, stripping the internal `images`
|
|
576
|
+
field:
|
|
577
|
+
- **OpenAI-style** (default): `{ type:"image_url", image_url:{ url:
|
|
578
|
+
"data:<media_type>;base64,<data>" } }`.
|
|
579
|
+
- **Anthropic-style**: `{ type:"image", source:{ type:"base64", media_type,
|
|
580
|
+
data } }`.
|
|
581
|
+
`selectImageFormat(config, model)` chooses by precedence: (1) the matching
|
|
582
|
+
`models[]` profile's `image_format`, (2) top-level `config.image_format`, (3)
|
|
583
|
+
heuristic — an Anthropic-native `api_base` → `anthropic`, else `openai` (the
|
|
584
|
+
project's OpenAI-compatible lingua franca). Same per-profile mechanism that
|
|
585
|
+
handles the MiniMax/Qwen quirks.
|
|
586
|
+
|
|
587
|
+
- **Vision capability — FAIL LOUD, never silently drop (constraint #2).**
|
|
588
|
+
`resolveVisionCapability(config, model)` returns `true` / `false` / `null`.
|
|
589
|
+
`false` (a profile/config marked `vision:false`, or a well-known text-only
|
|
590
|
+
family — embeddings/whisper/tts/moderation) → `chatStream` **throws a clear
|
|
591
|
+
error before any request is sent** ("Model X is not vision-capable…") and the
|
|
592
|
+
image is **never** stripped from the payload. `true` (a `vision:true` profile or
|
|
593
|
+
a known vision family) → proceed. `null` (unknown) → proceed and let the
|
|
594
|
+
endpoint reject cleanly. Capability comes from config/model metadata where
|
|
595
|
+
available; otherwise the endpoint error surfaces.
|
|
596
|
+
|
|
597
|
+
- **Size cap + path safety (constraint #3).** `image_max_bytes` (default **5 MB**)
|
|
598
|
+
caps the **raw** bytes before base64 (which inflates ~33%); over the cap is a
|
|
599
|
+
**clear error**, not an opaque endpoint failure. `isPathSafe` confines reads to
|
|
600
|
+
the CWD / refuses sensitive dirs exactly like other file reads.
|
|
601
|
+
|
|
602
|
+
- **Config:** `image_max_bytes` (int), `image_format` (`''`|`anthropic`|`openai`);
|
|
603
|
+
per-`models[]`-profile `vision` (bool) and `image_format`. Detection/format/
|
|
604
|
+
capability/shaping live in `lib/images.js` (pure, exhaustively unit-tested).
|
|
605
|
+
|
|
606
|
+
Tested by `test/images.test.js` (magic-byte detection per format incl.
|
|
607
|
+
header-beats-extension; read path — size cap, `isPathSafe` refusal, unsupported,
|
|
608
|
+
missing; both provider shapes; format-selection precedence; vision-capability
|
|
609
|
+
fail-loud; transform helpers) and `test/images-api.test.js` (REAL api client / SDK
|
|
610
|
+
↔ mock-LLM: OpenAI-style + Anthropic-style parts on the wire; **a text-only model
|
|
611
|
+
errors and sends NO request — image not silently dropped — paired with a vision
|
|
612
|
+
model accepting it**; a plain text turn still sends string content; the SDK
|
|
613
|
+
`images` option reads a real file and the out-of-CWD path is refused).
|
|
614
|
+
|
|
615
|
+
---
|
|
616
|
+
|
|
617
|
+
## Web Fetch Pipeline (`lib/web-extract.js` + `lib/web-summarize.js`, Task W.1 / W.1b)
|
|
618
|
+
|
|
619
|
+
`http_get` no longer dumps raw HTML into context **by default** (the old behavior
|
|
620
|
+
put up to 256 KB ≈ 60–80k tokens of verbatim page into the model). It runs a
|
|
621
|
+
pipeline whose depth is selected by a three-level **`mode` enum** (Task W.1b):
|
|
622
|
+
|
|
623
|
+
- **`summarized`** (default) — extract → Markdown → secondary-LLM summary; only
|
|
624
|
+
the compact summary enters context. For find/answer tasks.
|
|
625
|
+
- **`extracted`** — extract → Markdown, **no** summary. For reading an
|
|
626
|
+
article/doc verbatim or grabbing an exact snippet/quote.
|
|
627
|
+
- **`raw`** — **bypass extraction entirely**; return the **original** fetched
|
|
628
|
+
HTML/content, token-capped + fenced. For analyzing a page's HTML/CSS/JS/markup/
|
|
629
|
+
structure — the one task extraction destroys (W.1 had removed this access; W.1b
|
|
630
|
+
restores it as an explicit mode). The raw short-circuit lives at the top of
|
|
631
|
+
`processWebContent` (before `extractContent`); **`capToTokens` and the untrusted
|
|
632
|
+
fence still apply** (raw HTML is token-heavier, so the budget matters more).
|
|
633
|
+
|
|
634
|
+
**Mode resolution / precedence (no ambiguity).** An explicit `mode` wins over the
|
|
635
|
+
deprecated boolean aliases, which win over the `web.summarize` config default.
|
|
636
|
+
The aliases `summarize="false"` and `raw="true"` both map to **`extracted`**
|
|
637
|
+
(kept for back-compat — `raw="true"` still does **not** return raw HTML; use
|
|
638
|
+
`mode="raw"`). Resolved both at parse time (`_httpGetOpts`/`_httpGetOptsFromParams`)
|
|
639
|
+
and defensively in `http_get`'s execute (legacy booleans may arrive directly on
|
|
640
|
+
the call-opts). `WEB_FETCH_MODES` (`lib/tool_registry.js`) is the canonical enum.
|
|
641
|
+
|
|
642
|
+
For the `summarized`/`extracted` (non-raw) modes the stages are:
|
|
643
|
+
|
|
644
|
+
1. **Extract + convert (`lib/web-extract.js`).** Classify by content-type (with a
|
|
645
|
+
light sniff fallback). For **HTML**: **Mozilla Readability** extracts the main
|
|
646
|
+
article (dropping nav/sidebar/footer/ads/scripts), then **Turndown** converts
|
|
647
|
+
it to clean Markdown. **JSON / plain-text / Markdown pass through verbatim** —
|
|
648
|
+
they are never run through the HTML parser or summarizer (no mangling).
|
|
649
|
+
2. **Token budget (`capToTokens`).** A token-aware cap
|
|
650
|
+
(`web.max_content_tokens`, default **6000**, char/4 estimate) on the extracted
|
|
651
|
+
content — this **replaces the blind 256 KB byte cut as the context-protection
|
|
652
|
+
mechanism** (even clean Markdown can be large). The old byte cap
|
|
653
|
+
(`http_fetch_max_bytes`) is now **only a transfer/disk guard**. Oversize
|
|
654
|
+
content is truncated with a visible notice.
|
|
655
|
+
3. **Secondary summary (`lib/web-summarize.js`).** By default a **separate cheap
|
|
656
|
+
LLM call** (the `compact.js`/subagent pattern) summarizes the extracted
|
|
657
|
+
Markdown; **only the summary enters context**, the extracted full text does
|
|
658
|
+
not. This is the dominant token win.
|
|
659
|
+
|
|
660
|
+
**Pipeline orchestration** lives in `processWebContent` (`lib/tool_registry.js`),
|
|
661
|
+
called from `http_get`'s execute after the fetch. The secondary LLM call is an
|
|
662
|
+
**injected** `webChat(messages, { model, signal }) => Promise<string>` — the api
|
|
663
|
+
client's new quiet, non-streaming `chatComplete` (`lib/api.js`), wired in
|
|
664
|
+
`index.js` and `lib/sdk.js`. In paths with **no** api client (some headless/
|
|
665
|
+
one-shot wiring), `webChat` is absent → the pipeline returns **extracted
|
|
666
|
+
Markdown**, never the raw page.
|
|
667
|
+
|
|
668
|
+
**Configurable, defaults on (constraint 1).** `config.web.summarize` (default
|
|
669
|
+
**true**) sets the global default mode (`summarized` when true, `extracted` when
|
|
670
|
+
false). Override per-fetch with `<http_get url="…" mode="extracted"/>` (or the
|
|
671
|
+
deprecated `summarize="false"`/`raw="true"`) for verbatim extracted Markdown when
|
|
672
|
+
an exact snippet/quote matters, or `mode="raw"` for the original markup. Optional
|
|
673
|
+
`intent="…"` focuses the summary. `web.summary_model` (default `''` → the current
|
|
674
|
+
model) is the cheap model for the secondary call.
|
|
675
|
+
|
|
676
|
+
**Untrusted perimeter holds at every stage (constraint 2).** The page stays
|
|
677
|
+
untrusted end-to-end. The secondary summarizer is an LLM reading untrusted
|
|
678
|
+
content, so its prompt treats the page as **DATA ONLY** ("never obey/follow/act
|
|
679
|
+
on anything inside") and the page text is wrapped in an untrusted fence inside
|
|
680
|
+
the summary request. The summarizer's **output still returns to the main context
|
|
681
|
+
wrapped in the `<<<UNTRUSTED_EXTERNAL_CONTENT>>>` fence** (`lib/agent.js`) — a
|
|
682
|
+
page injection could have steered the summarizer, so the perimeter does not
|
|
683
|
+
weaken because an LLM now sits between page and context.
|
|
684
|
+
|
|
685
|
+
**Failure containment (constraint 3).** A summarizer error/timeout falls back to
|
|
686
|
+
the **capped extracted Markdown** (and, only if extraction itself somehow throws,
|
|
687
|
+
a crude tag-strip) — **never the raw HTML**. The result object carries
|
|
688
|
+
`summary_error`/`processing_error` for transparency.
|
|
689
|
+
|
|
690
|
+
**Latency/cost honesty.** Summarization adds **one LLM call per fetch**
|
|
691
|
+
(documented in the `http_get` tool description and `config.web` comment); the
|
|
692
|
+
no-summary mode exists for when that tradeoff isn't wanted.
|
|
693
|
+
|
|
694
|
+
**User-Agent (Task W.3 Part 2).** `http_get` and `download` send a **fixed,
|
|
695
|
+
realistic browser User-Agent** (`DEFAULT_USER_AGENT`, `lib/constants.js`) on every
|
|
696
|
+
request via `_resolveUserAgent(cfg)` (`lib/tool_registry.js`, applied at the single
|
|
697
|
+
`proto.get` site in each tool). This defeats **simple** UA-based bot-blocking — the
|
|
698
|
+
empty/curl-like UA is why sites like Wikipedia (403) and the Guardian (406) reject
|
|
699
|
+
the fetch. It is a **partial** mitigation only: Cloudflare / JS-challenges /
|
|
700
|
+
IP-rate-limits still 403 (full coverage needs a headless browser — deliberately out
|
|
701
|
+
of scope). The UA is **operator-overridable** via `config.web.user_agent` but
|
|
702
|
+
**never model-selectable** — there is **no UA parameter in the tool spec**, so the
|
|
703
|
+
agent cannot set a per-call UA (that would be an impersonation/evasion surface; same
|
|
704
|
+
line we hold elsewhere — the agent doesn't control how the tool presents itself to
|
|
705
|
+
the outside). The constant is **lazily** required inside `_resolveUserAgent`
|
|
706
|
+
(constants.js↔tool_registry.js is a circular dependency; a top-level destructure
|
|
707
|
+
would capture `undefined`). Tested by `test/http-get-user-agent.test.js`
|
|
708
|
+
(default + override on both tools via a header-capturing local server; the spec
|
|
709
|
+
exposes no UA knob; normalization defaults/trims).
|
|
710
|
+
|
|
711
|
+
**Result shape.** `http_get` returns `{ status_code, body, bytes, kind, mode,
|
|
712
|
+
extracted, summarized, content_tokens, content_truncated, transfer_capped,
|
|
713
|
+
title?, summary_error? }`. `body` is the summary, the extracted Markdown, or (in
|
|
714
|
+
`raw` mode) the original token-capped content. The `lib/agent.js` formatter notes
|
|
715
|
+
the mode (`summarized` / `extracted Markdown` / `raw <kind> (verbatim, capped)` /
|
|
716
|
+
`<kind> (verbatim)`) in the visible prefix, still inside the untrusted fence.
|
|
717
|
+
|
|
718
|
+
Tested by `test/web-extract.test.js` (classification, extraction drops
|
|
719
|
+
chrome/scripts/ads, ≥3× extraction-only token reduction, JSON/text pass-through,
|
|
720
|
+
token cap + notice, data-only summary-request framing),
|
|
721
|
+
`test/web-fetch-agent.test.js` (real local fixture server + real extraction +
|
|
722
|
+
mock summarizer: summarize-on → only the summary enters context, **≥10× token
|
|
723
|
+
reduction vs raw HTML**, summarize-off → capped extracted Markdown, **injection
|
|
724
|
+
in the page does not steer the summarizer and stays fenced as data**, summarizer
|
|
725
|
+
failure → fallback to extracted Markdown never raw HTML, no-summarizer path,
|
|
726
|
+
JSON/text pass-through, token-budget cap; **+ W.1b: `mode="raw"` returns the
|
|
727
|
+
original HTML (markup intact) capped, `extracted`≡legacy `summarize=false`,
|
|
728
|
+
`summarized`≡default, legacy `raw="true"`→extracted, precedence mode>boolean**),
|
|
729
|
+
and `test/web-fetch-mode.test.js` (W.1b unit: alias-resolution precedence XML +
|
|
730
|
+
native, the raw short-circuit returning original markup + still token-capped +
|
|
731
|
+
no summarizer call, the spec exposing the three-mode enum).
|
|
732
|
+
|
|
733
|
+
---
|
|
734
|
+
|
|
735
|
+
## Web Search (`web_search` tool, Task W.2b)
|
|
736
|
+
|
|
737
|
+
A **separate `web_search` tool** closes the URL-guessing gap: the agent **searches**
|
|
738
|
+
for candidate pages (snippets via SearXNG through the backend) and then **fetches
|
|
739
|
+
the relevant one(s)** with `http_get` (the W.1 pipeline). Clean two-step
|
|
740
|
+
separation — `web_search` *finds*, `http_get` *reads* — replacing blind
|
|
741
|
+
multi-fetch with targeted fetch.
|
|
742
|
+
|
|
743
|
+
- **Backend-backed (`dashboardSearch`, `lib/api.js`).** `web_search` calls the
|
|
744
|
+
backend `POST /api/search` (W.2a — authenticates the existing Bearer token,
|
|
745
|
+
queries SearXNG, returns `{ results: [{title,url,snippet}, …] }`).
|
|
746
|
+
`dashboardSearch(query, { count })` is modeled byte-for-byte on
|
|
747
|
+
`dashboardListModels` (`requireAuthToken()` → `requestJson(dashboardUrl('/api/search'), …)`)
|
|
748
|
+
and is injected into the tool executor as `webSearch` (wired in `index.js` and
|
|
749
|
+
`lib/sdk.js`, exactly like the W.1 `webChat`).
|
|
750
|
+
- **Backend-unavailable is a clean tool error, never a crash (the http_get-fix
|
|
751
|
+
lesson).** The backend runs on another machine and may be down / unreachable /
|
|
752
|
+
timing out / returning a non-2xx or `{error}` envelope; auth or dashboard
|
|
753
|
+
config may be missing. The executor catches **every** failure mode — including
|
|
754
|
+
the *synchronous* `requireAuthToken()` throw — and returns
|
|
755
|
+
`{ error: "web search unavailable: <reason>" }`. **Nothing throws out of the
|
|
756
|
+
executor**, proven paired with a healthy-backend positive.
|
|
757
|
+
- **The spec drives search→fetch (this is what prevents the "fetch everything"
|
|
758
|
+
mess).** The model-facing `web_search` description (`lib/tool_specs.js`) says:
|
|
759
|
+
this returns *candidate* results (title/url/snippet, **not** page content) —
|
|
760
|
+
read the snippets, pick the most relevant one or few, and fetch **only those**
|
|
761
|
+
with `http_get` (`mode="summarized"` to read, `mode="raw"` for markup); **do NOT
|
|
762
|
+
fetch every result**.
|
|
763
|
+
- **Untrusted + gated like `http_get`.** Titles/snippets are third-party content,
|
|
764
|
+
so the result is wrapped in the same `<<<UNTRUSTED_EXTERNAL_CONTENT>>>` fence
|
|
765
|
+
(`lib/agent.js`) as `http_get`/MCP results. The permission descriptor matches
|
|
766
|
+
`http_get` (`actionType: 'net'`, gated — not a privileged path; performs no
|
|
767
|
+
mutation).
|
|
768
|
+
- **Compact + bounded.** Output is a compact `{title,url,snippet}` list (small
|
|
769
|
+
token cost vs fetching pages). `count` is optional, bounded client-side
|
|
770
|
+
(`_clampSearchCount`, ≤ 10) before the call and clamped again by the backend;
|
|
771
|
+
the surfaced list is never re-expanded past the request.
|
|
772
|
+
- **Scope (like MCP / W.1 summarizer).** `webSearch` is wired only where an api
|
|
773
|
+
client exists (interactive chat + the SDK). In headless/one-shot paths without
|
|
774
|
+
one, `web_search` returns a clean "no backend client configured" tool error.
|
|
775
|
+
|
|
776
|
+
A single registration object in `lib/tool_registry.js` (spec + native
|
|
777
|
+
`fromParams` + XML `parseXml` + `execute` + `permission`) with matching
|
|
778
|
+
`lib/tool_specs.js`, `lib/constants.js` (TAG_REGISTRY parity), and `lib/prompts.js`
|
|
779
|
+
entries. Tested by `test/web-search.test.js` (offline, mocked `webSearch`): compact
|
|
780
|
+
list from a healthy backend; XML ↔ native dispatch parity; **every backend failure
|
|
781
|
+
mode → clean tool error with no exception escaping, paired with a positive**;
|
|
782
|
+
missing-auth / no-client / empty-query clean errors; untrusted fence proven
|
|
783
|
+
end-to-end through the real agent loop; the spec's search→fetch guidance; `count`
|
|
784
|
+
passthrough + bounding.
|
|
785
|
+
|
|
786
|
+
---
|
|
787
|
+
|
|
788
|
+
## Web-Activity Output Summary (`lib/ui/web-activity.js`, Task W.3 Part 1)
|
|
789
|
+
|
|
790
|
+
A web task now runs `web_search` (find) → targeted `http_get` (read), which used
|
|
791
|
+
to print **one tool line per operation** (a noisy `tool · web_search` / `net · GET …`
|
|
792
|
+
list). The **default** chat view now **collapses a run of consecutive web ops into
|
|
793
|
+
a single process-summary line** that reads as one process:
|
|
794
|
+
|
|
795
|
+
```
|
|
796
|
+
✓ web · search "коррупционные скандалы…" · 2 queries · 3 sources read · 1 blocked
|
|
797
|
+
```
|
|
798
|
+
|
|
799
|
+
- **Scope: `web_search` + `http_get` only.** `download` is a file-save (not a page
|
|
800
|
+
read for the search→fetch flow) and keeps its own line; all **non-web** tools
|
|
801
|
+
render exactly as before.
|
|
802
|
+
- **`--debug` keeps the full per-operation lines** — the collapser is bypassed in
|
|
803
|
+
debug mode (`sessionCtx.debugMode` in `lib/commands/chat-turn.js`), so every
|
|
804
|
+
`tool · web_search` / `net · GET … · status · size` row is still shown. Nothing
|
|
805
|
+
is lost, just hidden by default.
|
|
806
|
+
- **Failures are visible, never dropped.** An `http_get` that timed out OR returned
|
|
807
|
+
**≥ 400** (a 403/406 is a real block even though the fetch completed) is counted
|
|
808
|
+
as **"blocked"**; a failed `web_search` (backend down) shows as **"search failed"**
|
|
809
|
+
(`opSucceeded`). The compact view never silently omits a source that didn't load.
|
|
810
|
+
- **Display only — the audit log is unchanged.** Per-operation `logToolCall` rows
|
|
811
|
+
are written in the executors (untouched); this is purely the chat-render path.
|
|
812
|
+
- **Runtime model.** `createWebActivityTracker({ writerModule })` (per turn) owns
|
|
813
|
+
one writer **activity** entry per group of consecutive web ops, updating it in
|
|
814
|
+
place as ops complete and committing a **single** final summary to scrollback on
|
|
815
|
+
`flush()`. Tools run sequentially in the agent loop, so at most one group is open
|
|
816
|
+
(no concurrency). The group is flushed when a non-web tool starts (so its summary
|
|
817
|
+
lands above that tool's line) and once more at turn end (`finally`). Pure helpers
|
|
818
|
+
(`aggregateWebOps`, `webSummaryText`, `formatWebSummaryLine`, `renderWebActivity`)
|
|
819
|
+
are zero-dep and unit-tested.
|
|
820
|
+
- **Replay model (Output Refactor — Phase 6c).** Web ops are intercepted before
|
|
821
|
+
becoming a normal descriptor, so each persists a dedicated **web-op core**
|
|
822
|
+
`{v:1,kind:'web',tag,query,url,status,bytes,error,durationMs}` (`serializeWebOp`,
|
|
823
|
+
recognised by `isWebCore`) into its `_display` slot — native onto the
|
|
824
|
+
`{role:'tool'}` message, XML into the per-call `_display[]` array (6c-i). On
|
|
825
|
+
replay, `displayLoadedMessages` (`chat-session.js`) keeps a **loop-level** web
|
|
826
|
+
buffer and reproduces the live committed summary **byte-identical** by feeding the
|
|
827
|
+
flat core fields straight into the **pure** `aggregateWebOps` + `formatWebSummaryLine`
|
|
828
|
+
and committing via `chatHistory.addRawLine` (6c-ii). A group is flushed at the
|
|
829
|
+
same three points the live tracker flushes — a non-web tool/slot rendering, a
|
|
830
|
+
**terminal** assistant message with content (replay analogue of live
|
|
831
|
+
`cleanContent!==''`: an assistant NOT immediately followed by tool activity; an
|
|
832
|
+
empty intermediate iteration never flushes, so a group coalesces across
|
|
833
|
+
iterations/blobs/messages into one line), and end-of-loop (the live `finally`).
|
|
834
|
+
Replay **never instantiates `createWebActivityTracker`** nor touches the live
|
|
835
|
+
activity region — it only calls the pure helpers (Inv. 3, append-only scrollback).
|
|
836
|
+
The XML per-slot gate is fail-safe: any `null`/unknown slot drops the whole blob
|
|
837
|
+
to the legacy summary after flushing the open run.
|
|
838
|
+
|
|
839
|
+
Tested by `test/web-activity.test.js`: scope (`isWebTool`); the 403/timeout
|
|
840
|
+
"blocked" classification; the pure summary text reflecting query count / sources
|
|
841
|
+
read / failures; `renderWebActivity` default→one collapsed line vs `--debug`→full
|
|
842
|
+
per-op lines (status codes + URLs present); and the stateful tracker collapsing a
|
|
843
|
+
multi-op group into exactly one committed line (fresh group after flush; flush
|
|
844
|
+
no-op when empty).
|
|
845
|
+
|
|
846
|
+
---
|
|
847
|
+
|
|
848
|
+
## File-Activity Output Summary (`lib/ui/file-activity.js`)
|
|
849
|
+
|
|
850
|
+
A warm-up phase often fires a long run of pure file reads back-to-back, flooding
|
|
851
|
+
scrollback with one row per op. `createFileActivityTracker` is a **second,
|
|
852
|
+
independent instance of the web-activity pattern** that collapses a run of
|
|
853
|
+
**consecutive same-type** file ops into one process-summary line:
|
|
854
|
+
|
|
855
|
+
```
|
|
856
|
+
✓ file · read ×10 (index.html, battlecity.js, …)
|
|
857
|
+
```
|
|
858
|
+
|
|
859
|
+
The web tracker is **untouched** — the file tracker is a separate per-turn
|
|
860
|
+
instance instantiated alongside `webTracker` in `chat-turn.js`.
|
|
861
|
+
|
|
862
|
+
- **Scope: `read_file` + `list_dir` only.** Both are pure reads whose descriptors
|
|
863
|
+
carry `detail: null` (no diff / no output preview), so grouping them sidesteps
|
|
864
|
+
the deferred-detail-band ordering. All other file tools keep their own per-op
|
|
865
|
+
line. `--debug` bypasses the tracker (full per-op detail), like the web one.
|
|
866
|
+
- **Two divergences from the web tracker**, both forced by append-only scrollback:
|
|
867
|
+
- **Group key = (category, normalized tag).** `read ×N` and `list ×M` are
|
|
868
|
+
**separate** groups and never merge. Both rails emit `read` as the action for
|
|
869
|
+
a `read_file` call (`tool_registry`), so the tracker normalizes `read →
|
|
870
|
+
read_file` for the key (`normalizeFileTag`); `list_dir` stays as-is. A read↔list
|
|
871
|
+
key change inside `start()` flushes the prior group and opens a fresh one.
|
|
872
|
+
- **Threshold decided at flush time.** Because individual lines can't be
|
|
873
|
+
retroactively pulled into a group, **all** commits defer to `flush()`: a run of
|
|
874
|
+
**< 3** ops commits each op as its own normal `renderOperation(…, 'result')`
|
|
875
|
+
line (byte-identical to today); a run of **≥ 3** commits ONE summary line. The
|
|
876
|
+
accept-cost is that 1–2-op runs sit in the live region slightly longer before
|
|
877
|
+
committing.
|
|
878
|
+
- **Live view.** A single growing aggregate row — `● file · reading… ×N
|
|
879
|
+
(a.js, b.js, …)` — updated in place via `updateActivity` as ops complete (mirror
|
|
880
|
+
of the web tracker). The `×N` count sits in the **fixed prefix BEFORE** the
|
|
881
|
+
truncatable basename list, so it always shows even when the basenames are cut to
|
|
882
|
+
fit the current width; the line stays one physical row.
|
|
883
|
+
- **Errored op breaks the group (does NOT join it).** A mid-run error flushes the
|
|
884
|
+
success-group first (so its summary lands **above**), then renders the error
|
|
885
|
+
line + expandable body standalone; a new group may start after. Driven in
|
|
886
|
+
`chat-turn.js` `onToolEnd`: the success branch routes to `fileTracker.end`; an
|
|
887
|
+
error or a non-grouped tool flushes the open group, then falls through.
|
|
888
|
+
- **Flush sites mirror the web tracker exactly** (`chat-turn.js`): before a new
|
|
889
|
+
non-matching tool row (`onToolStart`), before the terminal `finalizeLastMessage`
|
|
890
|
+
(gated on the explicit **terminal** flag so intermediate-iteration narration does
|
|
891
|
+
NOT split a multi-iteration run), and the turn-end `finally`. Sequenced after
|
|
892
|
+
`commitDeferredDetail` exactly as the web flush is. Exactly **one** `endActivity`
|
|
893
|
+
call per group; a no-op when no group is open guards the boundary+`finally`
|
|
894
|
+
double-flush.
|
|
895
|
+
- **Persistence / replay — NO storage format change.** Each op still persists its
|
|
896
|
+
normal `serializeOperation` core (`{v:1, tag:'read', target, …}`). On replay,
|
|
897
|
+
`displayLoadedMessages` (`chat-session.js`) keeps a **loop-level** file buffer
|
|
898
|
+
(parallel to the web one), recognises a groupable file core via
|
|
899
|
+
`isGroupableFileCore`, and re-groups at the **same boundaries** — applying the
|
|
900
|
+
same ≥ 3 threshold computed at the **replay terminal width** (so a 200-col
|
|
901
|
+
session re-groups and re-truncates correctly in an 80-col terminal). A ≥ 3 run
|
|
902
|
+
commits the aggregated summary via `chatHistory.addRawLine`; a 1–2 run commits
|
|
903
|
+
each op via the same `_display` render the live per-op path uses. Replay never
|
|
904
|
+
instantiates the live tracker. The XML per-slot gate is unchanged (a file core is
|
|
905
|
+
a normal descriptor core and already passes `descriptorFromStored`).
|
|
906
|
+
|
|
907
|
+
Tested by `test/file-activity.test.js`: 10 reads → one `read ×10` summary
|
|
908
|
+
(basenames truncated, `×10` always present); 2 reads → two individual lines; reads
|
|
909
|
+
then lists → two separate summaries (key separation); a non-file tool flushing the
|
|
910
|
+
group before its row; a mid-run error → `read ×4` summary + standalone error +
|
|
911
|
+
body + a fresh group; multi-iteration narration → one summary (terminal gating);
|
|
912
|
+
replay re-grouping byte-identical at the replay width and re-truncating narrower;
|
|
913
|
+
the once-only double-flush guard; and the web tracker staying unaffected.
|
|
914
|
+
|
|
915
|
+
---
|
|
916
|
+
|
|
917
|
+
## Custom Slash Commands (`lib/commands/custom.js`, Task 3.1)
|
|
918
|
+
|
|
919
|
+
Users define slash commands as Markdown files — no code. At chat startup `cmdChat` discovers them and registers them into the registry (the single source of truth), so `resolveCommand`/completion/`/help` see them alongside built-ins.
|
|
920
|
+
|
|
921
|
+
- **Discovery**: `~/.semalt-ai/commands/*.md` (global) then the nearest `.semalt/commands/*.md` (project, via the Task 2.2 upward walk bounded by the repo root). Filename → command name (`review.md` → `/review`).
|
|
922
|
+
- **Frontmatter** (optional, `---`-delimited): `description`, `argument-hint`, `aliases`. The body is the prompt template.
|
|
923
|
+
- **Rendering**: `$ARGUMENTS` (full arg string) and `$1`/`$2`/… (whitespace-split positionals), single-pass so injected args are not re-expanded.
|
|
924
|
+
- **Precedence**: project overrides global on name collision; **built-ins always win** over customs (a colliding custom is dropped with a startup warning).
|
|
925
|
+
- **Invocation**: handled inline by the turn handler (`chat-turn.js`) — the rendered template is submitted to the agent as a **user prompt, never executed as code**. Custom commands are therefore excluded from `commandNames()` (the slash-handler parity check) since they need no handler.
|
|
926
|
+
|
|
927
|
+
---
|
|
928
|
+
|
|
929
|
+
## Skills (`lib/skills.js`, Task 3.5)
|
|
930
|
+
|
|
931
|
+
Skills package reusable methodology as a folder containing a `SKILL.md` (frontmatter `name`/`description` + a Markdown body) and, optionally, assets/scripts. The defining behavior is **progressive disclosure**: only each skill's **name + description** is ever injected into the system prompt; the **body loads into context only when the skill is invoked**, so skills don't bloat the prompt.
|
|
932
|
+
|
|
933
|
+
- **Discovery**: `~/.semalt-ai/skills/<name>/SKILL.md` (global) then the nearest `.semalt/skills/<name>/SKILL.md` (project, via the upward walk bounded by the repo root). The folder name → invocation slug (`deep-research/` → `/deep-research`); slugs are lowercased and hyphenated.
|
|
934
|
+
- **Progressive disclosure (load-bearing)**: `discoverSkills` returns **metadata only** — no body field. `getSystemPrompt` appends a `<<<SKILLS>>>` metadata block (name + description per skill) after the project-memory block. `loadSkillBody(spec)` is the **only** place a body is read, and it runs at **invocation time**, not discovery. Proven by `test/skills.test.js` and `test/skills-chat.test.js`.
|
|
935
|
+
- **Precedence**: project overrides global on slug collision; **built-ins always win**, and skills also defer to already-registered custom commands (a colliding skill is dropped with a startup warning).
|
|
936
|
+
- **Size bounding**: total metadata is bounded (`DEFAULT_SKILLS_MAX_BYTES` = 16 KB) with a visible truncation notice. With **no skills present the system prompt is byte-for-byte unchanged**.
|
|
937
|
+
- **Invocation**: skills register into the registry (`registerSkills`) flagged `skill: true`, carrying the `skillPath` (not the body). The turn handler (`chat-turn.js`) loads the body on `/<skill>`, renders `$ARGUMENTS`/`$1` (reusing `lib/commands/custom.js`), appends the skill's assets-directory path, and submits it to the agent as a **user prompt, never executed as code**. Skills are excluded from `commandNames()` (handled inline, no handler). `/skills` lists loaded skills and their disclosure state.
|
|
938
|
+
|
|
939
|
+
---
|
|
940
|
+
|
|
941
|
+
## Subagents (`lib/subagents.js`, Task 3.6)
|
|
942
|
+
|
|
943
|
+
A **subagent** is a second agent loop run with its **own isolated message history**. It exists to keep the parent context clean: noisy work (research, reading large files, review) runs in the child and **only the child's final result returns to the parent** — the parent never absorbs the child's intermediate turns. Built directly on the `runAgentLoop` factory: a child runner is just another `createAgentRunner` instance wired with **wrapped executors** that enforce the child's allowed-tool set, sharing the parent's permission manager.
|
|
944
|
+
|
|
945
|
+
- **`spawn_agent` tool** — registered as a **dynamic** tool (`registerDynamicTool` in `index.js`, like MCP), so it dispatches through the same agent loop and stays **out of the static parity check** (`lib/constants.js`). Native schema + XML (`<spawn_agent agent="x">prompt</spawn_agent>` or a JSON body) both resolve to `['spawn_agent', params]`. Available in interactive chat **and** headless one-shot runs.
|
|
946
|
+
- **Custom agent definitions** — `~/.semalt-ai/agents/<name>.md` (global) then the nearest `.semalt/agents/<name>.md` (project, via the repo-root-bounded upward walk); project wins on slug collision. Frontmatter: `name`, `model`, `tools` (a.k.a. `allowed-tools`), `description`; the Markdown body is the child's **system prompt**. Invoke by name: `spawn_agent({ agent: "reviewer", prompt })`.
|
|
947
|
+
- **Parallel execution** — pass `tasks: [...]` (or an array) to run independent subagents with **bounded concurrency** (a fixed-size worker pool; cap from `config.subagents.max_concurrency`, default 3, clamped 1–16).
|
|
948
|
+
- **Security (load-bearing, Phase 0):**
|
|
949
|
+
- **No privilege escalation** — the child uses the **same** `permissionManager`, so it can never auto-approve anything the parent wouldn't (a child mutating tool in non-TTY without `--allow-*`/skip is refused, just like the parent).
|
|
950
|
+
- **Tool constraint** — a def's `tools` list restricts the child; the wrapped `agentExecShell`/`agentExecFile` **hard-refuse** anything outside the set (enforced at the executor, so it holds for both the XML and native paths and gives the child feedback).
|
|
951
|
+
- **No recursion** — a child can never invoke `spawn_agent` (refused by the executor + dropped from any allowed-tool set).
|
|
952
|
+
- **Untrusted result** — a subagent's returned text is fenced in the `<<<UNTRUSTED_EXTERNAL_CONTENT>>>` delimiter (`lib/agent.js`), like `http_get`/MCP/hook output, because a child may have read external data.
|
|
953
|
+
- **Result token-capped (Task W.8)** — `formatSubagentResult` (`lib/agent.js`) caps the child's final text with `capToTokens` at the **generous** `subagents.max_result_tokens` budget (default **20000**) before fencing — a safety net against a verbose child, distinct from and strictly larger than the MCP budget (the child's result is our own deliberate, synthesized answer). The truncation notice signals the result was long. Isolation / no-escalation are unchanged — this bounds the *returned text size* only.
|
|
954
|
+
- **Config** — `subagents` is normalized to `{ max_concurrency, max_result_tokens }` (defaults 3 / 20000). Tested by `test/subagents.test.js` (discovery/frontmatter, allowed-tool resolution, bounded pool, the tool entry), `test/subagents-agent.test.js` (real child loop ↔ mock-LLM: isolation, untrusted fencing, tool constraint, permission inheritance), and `test/result-cap.test.js` (W.8: result cap + fence + budgets-differ).
|
|
955
|
+
|
|
956
|
+
---
|
|
957
|
+
|
|
958
|
+
## Background Tasks (`lib/background.js`, Task 5.3)
|
|
959
|
+
|
|
960
|
+
Run an agent task as a **detached background process** that survives the terminal
|
|
961
|
+
closing, with a task registry to list, inspect, collect, and terminate it. Each
|
|
962
|
+
background task is its **own process** — its own `process.cwd()`, its own dynamic
|
|
963
|
+
tool registry, its own everything — which **sidesteps the documented in-process
|
|
964
|
+
multi-instance global-state limits** of the embedding SDK (Task 5.2): isolation
|
|
965
|
+
comes for free from the process boundary. The child reuses the **stable
|
|
966
|
+
`createAgent` facade** internally.
|
|
967
|
+
|
|
968
|
+
- **Launch (CLI/SDK, human-initiated):** `semalt-code run --background "<prompt>"`
|
|
969
|
+
(`cmdRun`, `lib/commands/tasks.js` → `launchBackground`, `lib/background.js`),
|
|
970
|
+
or programmatically via `launchBackground(...)`. Policy flags (`--allow-*`,
|
|
971
|
+
`--readonly`, `--dangerously-skip-permissions`, `-m`) are read **at launch**.
|
|
972
|
+
- **Manage:** `semalt-code tasks list|status <id>|result <id>|kill <id>|prune`
|
|
973
|
+
(`cmdTasks`). `result` prints the standard headless envelope; `prune` removes
|
|
974
|
+
finished + stale entries.
|
|
975
|
+
|
|
976
|
+
**Validate before detach (constraint 4, load-bearing).** After forking there is
|
|
977
|
+
**no terminal to surface errors to**, so `launchBackground` runs `validateLaunch`
|
|
978
|
+
**synchronously before any process is spawned** — config validity (`api_base`, a
|
|
979
|
+
resolvable model), permission-policy shape (rule `tool`/`action`/single-matcher),
|
|
980
|
+
and sandbox availability (only a hard error when `failIfUnavailable`). An optional
|
|
981
|
+
injected `probeModel` covers reachability. A validation failure **throws in the
|
|
982
|
+
parent and spawns nothing** — no orphan (proven by the spawn-spy test).
|
|
983
|
+
|
|
984
|
+
**Launch-fixed, refuse-by-default posture (constraint 1).** A background task has
|
|
985
|
+
**no TTY and no human to ask**, so its permission policy is set at launch and can
|
|
986
|
+
**never** fall through to an interactive prompt. The child builds its agent via
|
|
987
|
+
`createAgent` with the launch policy; with **no policy the default REFUSES every
|
|
988
|
+
mutating/effectful tool** (read-only tools still run), inheriting the 5.2 embedded
|
|
989
|
+
perimeter. The **OS sandbox + destructive-command deny-list stay ON** in the child
|
|
990
|
+
unless an opt-out is passed **explicitly at launch** (`sandbox.mode: 'off'`, or
|
|
991
|
+
`--dangerously-skip-permissions`, which is propagated into the child's argv so
|
|
992
|
+
`lib/tools.js` honors it for the deny-list/secret/config guards). An unavailable
|
|
993
|
+
sandbox in `auto` mode **refuses** the command (no human to approve).
|
|
994
|
+
|
|
995
|
+
**IPC via files, not a live channel (constraint 3).** The detached child writes
|
|
996
|
+
**NDJSON** progress + a result envelope into the task dir; the parent reads them
|
|
997
|
+
on `collect`. This survives the terminal closing and needs no live IPC.
|
|
998
|
+
|
|
999
|
+
**Task store layout — `~/.semalt-ai/tasks/<id>/`** (`createTaskStore`, injectable
|
|
1000
|
+
`fs`/`now`/`rootDir`; atomic `meta.json` writes via temp+rename):
|
|
1001
|
+
- `spec.json` — the launch spec the child reads (prompt, apiBase, model, cwd,
|
|
1002
|
+
policy, sandbox, maxIterations). **No secrets on disk** — the API key is passed
|
|
1003
|
+
to the child via its **env** (`SEMALT_API_KEY`), never written here.
|
|
1004
|
+
- `meta.json` — registry record / current status snapshot `{ id, pid, status,
|
|
1005
|
+
started_at, finished_at, prompt_summary, model, policy_summary, stopReason?,
|
|
1006
|
+
error? }`.
|
|
1007
|
+
- `events.ndjson` — append-only progress log (one JSON object per line, like the
|
|
1008
|
+
audit log): `status` / `tool` (with `ok` + a `detail` excerpt on failure, e.g. a
|
|
1009
|
+
deny-list refusal) / `warning` / `error` / `result`.
|
|
1010
|
+
- `result.json` — the final headless envelope `{ result, toolCalls, usage, cost,
|
|
1011
|
+
stopReason, verifyStatus }`.
|
|
1012
|
+
|
|
1013
|
+
**Orphan lifecycle (constraint 2).** `proc.js` gains `spawnDetached` (session
|
|
1014
|
+
leader + `stdio: 'ignore'` + `unref()`), `killTreeByPid(pid, signal)` (POSIX
|
|
1015
|
+
negative-PID group kill / Windows `taskkill /T`, used by `tasks kill` after the
|
|
1016
|
+
launcher has exited), and `isProcessAlive(pid)` (`process.kill(pid, 0)`,
|
|
1017
|
+
EPERM = alive). A task marked `running` whose PID is no longer alive is **computed
|
|
1018
|
+
as `stale`** (`effectiveStatus`) — never persisted as a lie — so zombies never
|
|
1019
|
+
accumulate invisibly: `tasks list` flags them and `prunableIds`/`prune` clean them
|
|
1020
|
+
up. `killTask` SIGTERMs the recorded PID, waits a grace period, escalates to
|
|
1021
|
+
SIGKILL if still alive, then marks the record `terminated`.
|
|
1022
|
+
|
|
1023
|
+
**Tool-exposure decision (constraint 5) — NOT an agent tool, deliberately.**
|
|
1024
|
+
Background-launch is reachable **only** from the human-initiated CLI/SDK surface;
|
|
1025
|
+
there is **no `run_background`/`spawn_background` tag, no `TOOL_SPECS` entry, and
|
|
1026
|
+
nothing in the static or dynamic tool registry** (asserted by a test). Rationale:
|
|
1027
|
+
a model-reachable background launcher would be a **privilege-escalation surface**
|
|
1028
|
+
— the agent could fork a fresh process to escape its own permission perimeter (the
|
|
1029
|
+
subagent no-escalation rule, 4.5). Subagents already give the model in-process
|
|
1030
|
+
parallelism while **sharing the parent permission manager**; background tasks
|
|
1031
|
+
serve a different, human-owned need, so keeping the launcher off the tool surface
|
|
1032
|
+
removes the escalation question entirely. *If* a future task exposes such a tool,
|
|
1033
|
+
it MUST inherit and not exceed the launching agent's posture.
|
|
1034
|
+
|
|
1035
|
+
Tested by `test/background.test.js`: store CRUD + list ordering; validation flags
|
|
1036
|
+
empty prompt / missing model / malformed policy / strict-unavailable sandbox;
|
|
1037
|
+
**validation failure spawns no process (no orphan)**; launch persists spec+record,
|
|
1038
|
+
detaches via an injected spawn, defaults sandbox ON with explicit opt-out and the
|
|
1039
|
+
key carried via env (not disk); **real `createAgent` ↔ mock-LLM** child completes
|
|
1040
|
+
and writes the envelope; **safe posture** (no policy refuses a write, paired with
|
|
1041
|
+
an allow rule permitting it); **deny-list active inside the background process**;
|
|
1042
|
+
stale detection + prune; `killTask` tree-kills + marks terminated; a **real
|
|
1043
|
+
detached process** is alive then tree-killable by PID; an **E2E real detached
|
|
1044
|
+
`__bg-exec` child** runs the agent and writes the envelope; and the
|
|
1045
|
+
**no-background-tool** decision.
|
|
1046
|
+
|
|
1047
|
+
---
|
|
1048
|
+
|
|
1049
|
+
## Native Git Tools (`lib/tool_registry.js`, Task 5.1)
|
|
1050
|
+
|
|
1051
|
+
First-class git tools for the common operations where structured results help the
|
|
1052
|
+
agent; the long tail (rebase, reflog, cherry-pick, stash, submodule, remote ops…)
|
|
1053
|
+
stays in the **sandboxed** generic shell. Each tool is a single registration object
|
|
1054
|
+
(spec + native `fromParams` + XML `parseXml` + `execute` + `permission`) alongside
|
|
1055
|
+
every other tool — same `TOOL_SPECS` / `TAG_REGISTRY` parity guard, same
|
|
1056
|
+
`[action, opts]` dispatch over both the XML and native rails.
|
|
1057
|
+
|
|
1058
|
+
- **The eight tools.** Read-only: `git_status`, `git_diff`, `git_log`. Mutating:
|
|
1059
|
+
`git_add`, `git_commit`, `git_branch`, `git_checkout`. Infrastructure:
|
|
1060
|
+
`git_worktree` (create/list/remove worktrees for parallel agents in isolated
|
|
1061
|
+
trees). Everything else is plain shell.
|
|
1062
|
+
- **Structured output.** They shell out to `git` (no new dependency) but **parse the
|
|
1063
|
+
output into structured results** the model can act on:
|
|
1064
|
+
- `git_status` → `{ branch, staged:[{path,status}], unstaged:[…], untracked:[…], clean, summary }`
|
|
1065
|
+
(porcelain v1 + `--branch`).
|
|
1066
|
+
- `git_diff` → `{ staged, files:[{file, additions, deletions, hunks:[{header, lines}]}], additions, deletions, raw, summary }`.
|
|
1067
|
+
- `git_log` → `{ commits:[{hash, short, author, email, date, subject}], count, summary }`
|
|
1068
|
+
(a fresh repo with no commits degrades to an empty list, not an error).
|
|
1069
|
+
- `git_add` → `{ added, summary }`; `git_commit` → `{ hash, short, branch, summary }`;
|
|
1070
|
+
`git_branch` (list) → `{ branches:[{name,current}], current }`, (create/delete) →
|
|
1071
|
+
`{ created|deleted, summary }`; `git_checkout` → `{ branch, created, summary }`;
|
|
1072
|
+
`git_worktree` → `{ op, worktrees|path|branch, summary }`.
|
|
1073
|
+
The model sees a `summary` string (`formatFileResult` surfaces it); structured
|
|
1074
|
+
fields are returned for callers/tests.
|
|
1075
|
+
- **Permission posture by operation type (constraint).** Read-only tools — and the
|
|
1076
|
+
**list** ops of `git_branch`/`git_worktree` — return a **null** permission
|
|
1077
|
+
descriptor (no prompt). Mutating tools return a descriptor, honor `--readonly`
|
|
1078
|
+
(`git_add`/`git_commit`/`git_branch`/`git_checkout`/`git_worktree` ∈
|
|
1079
|
+
`READONLY_BLOCKED`), and pass through the per-pattern rule layer (a `deny` rule
|
|
1080
|
+
refuses them; an `allow` rule lets them run). Git tools are **not** in any
|
|
1081
|
+
`--allow-*` tier, so they are never auto-approved by a coarse tier flag.
|
|
1082
|
+
- **Confinement (constraint).** Every git invocation runs through
|
|
1083
|
+
`ctx.agentExecShell` — the **same** sandbox + deny-list chokepoint as `<shell>` —
|
|
1084
|
+
so git gets no privileged path around confinement. Arguments are shell-quoted
|
|
1085
|
+
(platform-aware) before the command string is handed to the chokepoint; the
|
|
1086
|
+
deny-list/sandbox remain the security boundary.
|
|
1087
|
+
- **`git_commit` message is the agent's, structured.** `message` is required and
|
|
1088
|
+
must be non-empty; an empty/whitespace message **errors without committing**
|
|
1089
|
+
(never a placeholder commit).
|
|
1090
|
+
- **Destructive-git ↔ checkpoint honesty (load-bearing).** Checkpoints (Task 4.3)
|
|
1091
|
+
snapshot **file-tool** mutations only. `git_checkout` (and any reset-like effect)
|
|
1092
|
+
can overwrite or discard uncommitted working-tree changes that checkpoints never
|
|
1093
|
+
captured — **git-discarded changes are NOT recoverable via `/rewind`.** This is
|
|
1094
|
+
stated in the tool descriptions (`TOOL_SPECS`), the permission prompt text, and
|
|
1095
|
+
here; do not imply `/rewind` covers git.
|
|
1096
|
+
- **Graceful degradation.** Not-a-repo and git-absent return a clear `{ error }`
|
|
1097
|
+
(mapped from the git output), never a crash.
|
|
1098
|
+
|
|
1099
|
+
Tested by `test/git-tools.test.js` (real `git init` temp repo, sandbox off):
|
|
1100
|
+
structured status/diff/log; read-only descriptors don't prompt while mutating ones
|
|
1101
|
+
do; add+commit produces a real commit (hash matches the log) and an empty message
|
|
1102
|
+
errors with no commit; branch/checkout switch; the **paired** `--readonly` block +
|
|
1103
|
+
non-readonly success and the **paired** per-pattern `deny`/`allow` resolution;
|
|
1104
|
+
worktree add/list/remove; not-a-repo and git-absent degrade gracefully; the
|
|
1105
|
+
checkpoint-scope caveat is present in the description; XML ↔ native tuple parity.
|
|
1106
|
+
|
|
1107
|
+
---
|
|
1108
|
+
|
|
1109
|
+
## Embedding SDK (`lib/sdk.js` + `lib/internals.js`, Task 5.2)
|
|
1110
|
+
|
|
1111
|
+
The project is consumable as a **library**, not only an executable, with a
|
|
1112
|
+
**two-tier surface physically separated by `package.json` `exports`** (not just
|
|
1113
|
+
documented):
|
|
1114
|
+
|
|
1115
|
+
- **Stable facade** — `require('@semalt-ai/code')` → `{ createAgent }` (main
|
|
1116
|
+
entry, `exports['.']` → `lib/sdk.js`). The supported, semver-stable contract.
|
|
1117
|
+
- **Unstable building blocks** — `require('@semalt-ai/code/internals')`
|
|
1118
|
+
(`exports['./internals']` → `lib/internals.js`) re-exports `createAgentRunner`,
|
|
1119
|
+
`createApiClient`, `createToolExecutor`, the registries, config, etc., behind a
|
|
1120
|
+
loud **NO STABILITY GUARANTEE** notice and an `__unstable__: true` marker.
|
|
1121
|
+
Internal refactors don't break facade consumers because the boundary is the
|
|
1122
|
+
`exports` map. Both subpaths resolve for `require` **and** `import` (CJS named
|
|
1123
|
+
exports via ESM interop — the project stays CommonJS).
|
|
1124
|
+
|
|
1125
|
+
**`createAgent(options)` → `{ run, on, off, close, getConfig, cwd, closed }`.**
|
|
1126
|
+
- `run(prompt, opts?)` executes a prompt to completion and returns the **headless
|
|
1127
|
+
envelope** `{ result, toolCalls, usage, cost, stopReason, verifyStatus }` (built
|
|
1128
|
+
by reusing `createHeadlessSink`), plus `messages` for multi-turn continuation
|
|
1129
|
+
(`run(next, { messages })`). Accepts `images: [...]` (file paths or pre-encoded
|
|
1130
|
+
`{ media_type, data }` records) to attach images to the turn (Task 5.4 — read
|
|
1131
|
+
through `isPathSafe`, size-capped, sent only to a vision model). Streams via
|
|
1132
|
+
`on(event, cb)` —
|
|
1133
|
+
`token`/`assistant`/`tool`/`tool-start`/`error`/`warning`/`done`. Chrome is
|
|
1134
|
+
suppressed for the run (`setUIActive`) so the host's stdout stays clean.
|
|
1135
|
+
- It assembles a **per-instance** config closure, api client, permission manager,
|
|
1136
|
+
tool executor, and agent runner — no shared module-global config between two
|
|
1137
|
+
`createAgent` instances.
|
|
1138
|
+
|
|
1139
|
+
**Programmatic permission perimeter — defaults safe (load-bearing).** No TTY in
|
|
1140
|
+
embedded use, so the policy is programmatic:
|
|
1141
|
+
- `approve(call) → boolean|Promise<boolean>` — an async approver (the programmatic
|
|
1142
|
+
equivalent of the interactive prompt), wired through a new `approver` option on
|
|
1143
|
+
`createPermissionManager`. Consulted only when the gate would otherwise refuse
|
|
1144
|
+
for lack of a way to ask, so it never widens what a tier already granted;
|
|
1145
|
+
throwing/falsy = no (fail closed).
|
|
1146
|
+
- `rules: [...]` (or `{ user, project }`) — preset allow/deny/ask rules reusing the
|
|
1147
|
+
Task 4.1 engine (host rules are the **user** layer = trusted; `loadProjectRules:
|
|
1148
|
+
true` adds the on-disk project layer, which can still only **narrow**).
|
|
1149
|
+
- `allow: ['fs'|'exec'|'net'|'sys'|'all']`, `readonly: true` — coarse tiers.
|
|
1150
|
+
- **With NO policy the default is to REFUSE every mutating/effectful tool**
|
|
1151
|
+
(read-only tools still run), mirroring non-TTY — never auto-approve.
|
|
1152
|
+
|
|
1153
|
+
**Sandbox/deny-list stay on; opt-out is explicit (load-bearing).** The OS sandbox
|
|
1154
|
+
defaults to `auto` (on) and the destructive-command deny-list + secret/config
|
|
1155
|
+
guards stay active in embedded mode — **not** disabled by the absence of a TTY.
|
|
1156
|
+
Disabling is deliberate, documented opt-in: `sandbox: { mode: 'off' }`,
|
|
1157
|
+
`onUnsandboxed` to permit an unsandboxed run when the kernel primitive is missing,
|
|
1158
|
+
and `dangerouslySkipPermissions: true` for the gate (still cannot bypass a `deny`
|
|
1159
|
+
rule or the deny-list). By default the SDK does **not** read the operator's
|
|
1160
|
+
`~/.semalt-ai/config.json` (`loadUserConfig: true` opts in).
|
|
1161
|
+
|
|
1162
|
+
**Lifecycle.** `createAgent` may open resources (MCP servers — connected lazily on
|
|
1163
|
+
first `run` when `config.mcp.servers` is set). Hosts **must** call `await
|
|
1164
|
+
close()`, which shuts down the MCP manager and removes listeners; `run()` after
|
|
1165
|
+
`close()` throws.
|
|
1166
|
+
|
|
1167
|
+
**Multi-instance — documented module-global limitations (constraint 4).** Per-
|
|
1168
|
+
instance config is isolated, but a few surfaces are process-global because they
|
|
1169
|
+
were built for the single-process CLI: the **dynamic tool registry**
|
|
1170
|
+
(`lib/tool_registry.js _dynamic`, where MCP + `spawn_agent` register) is shared;
|
|
1171
|
+
`isPathSafe` / the deny-list / secret+config guards read `process.cwd()` and
|
|
1172
|
+
`process.argv` **once at module load** (so the deny-list opt-out needs the host
|
|
1173
|
+
process launched with `--dangerously-skip-permissions`); and the chrome-suppress
|
|
1174
|
+
flag is process-wide. Fully-isolated agents → separate processes. This is stated
|
|
1175
|
+
honestly in the README rather than papered over.
|
|
1176
|
+
|
|
1177
|
+
Documented in README **Embedding SDK**; runnable `examples/embed.js`. Tested by
|
|
1178
|
+
`test/sdk.test.js` (real `createAgent` ↔ mock-LLM: envelope shape; **safe default
|
|
1179
|
+
refuses a mutating write with no policy** + paired positives via approver and via
|
|
1180
|
+
an allow rule; deny-list still blocks under an approving gate; sandbox default-on
|
|
1181
|
+
vs explicit opt-out; per-instance config isolation; `close()` disconnects a REAL
|
|
1182
|
+
stdio MCP server; run-after-close throws; the `exports` map resolves both
|
|
1183
|
+
subpaths).
|
|
1184
|
+
|
|
1185
|
+
---
|
|
1186
|
+
|
|
1187
|
+
|
|
1188
|
+
## Context Compaction & Payload Tuning (`lib/compact.js`, `lib/payload.js`, Task 2.7)
|
|
1189
|
+
|
|
1190
|
+
**`/compact`** is a real LLM summarization turn: `selectForCompaction` splits history into a head to summarize and a recent tail (plus pinned messages) to keep, the model summarizes the head (`summarizationRequest` → `chatSync`), and `buildCompactedMessages` rebuilds `pinned + summary + tail`. Before/after token counts are shown. **Auto-compaction** runs the same path in `chat-turn.js` when `shouldAutoCompact` fires (usage past 85% of a known limit), complementing — not duplicating — api.js `trimToTokenBudget` (which drops rather than summarizes). All selection/replacement logic is pure and unit-tested.
|
|
1191
|
+
|
|
1192
|
+
**Prompt caching** (`config.prompt_caching` / `--prompt-caching`): `applyPromptCaching` adds `cache_control:{type:'ephemeral'}` to the stable prefix (last system message + last tool) in the request body — opt-in, so it's never sent to endpoints that reject it. **`reasoning_effort`** (`config.reasoning_effort` / `--reasoning-effort`): `applyReasoningEffort` adds the param only for reasoning models (`supportsReasoningEffort` heuristic, or `reasoning_effort_force`). Both are applied in `api.js doRequest` and proven present/absent by request-body tests.
|
|
1193
|
+
|
|
1194
|
+
---
|
|
1195
|
+
|
|
1196
|
+
## Self-Diagnostics & Cost (`lib/doctor.js`, `lib/pricing.js`, Task 2.6)
|
|
1197
|
+
|
|
1198
|
+
**`/doctor`** (and `semalt-code doctor`) aggregate pass/warn/fail checks: config validity + resolved layers (2.2), API-key source (Phase 0), selected model + whether its context limit is known, dashboard reachability, audit-log writability, and loaded project-memory files (2.3). `aggregateChecks`/`formatDoctorReport` are pure; `diagnose` injects the impure gatherers. Overall = fail if any fail, else warn if any warn, else pass.
|
|
1199
|
+
|
|
1200
|
+
**Cost** (`lib/pricing.js`): a per-model price table (USD per 1,000,000 tokens) × token usage. `priceForModel` matches exact then longest-substring; `config.pricing` (`{ "<model>": { input, output } }`) overrides/extends the built-in table. `computeCost` returns `null` for an unknown price and `formatCost` renders that as **"unknown"** — never a fake `$0`. `show_cost` defaults **on**; cost appears in the status bar (`setCost`) and in headless `json` output. All cost math and doctor aggregation are unit-tested.
|
|
1201
|
+
|
|
1202
|
+
---
|
|
1203
|
+
|
|
1204
|
+
## Plan Mode (Task 2.5)
|
|
1205
|
+
|
|
1206
|
+
`--plan` (one-shot/headless) and `/plan` (in-chat toggle) gate execution: while active, the agent investigates with read-only tools and proposes a plan, but every **mutating** tool is withheld until the user approves. The mutating-vs-read-only split comes straight from the **permission descriptor** in the tool registry — `describePermission(call)` returns `null` for read-only tools and a descriptor for effectful ones — not from string-matching tool names (`lib/agent.js`). Withheld calls are recorded in the loop's `withheldActions` return and surfaced via the `onPlanWithhold` callback. In chat, `/plan` toggles `ctx.planMode` (threaded into the loop as `getPlanMode`); toggling it back off is the approval — the agent then executes with the plan already in context. `/clear` discards. A `PLAN_MODE_NOTICE` (`lib/prompts.js`) is appended to the system prompt while active.
|
|
1207
|
+
|
|
1208
|
+
---
|
|
1209
|
+
|
|
1210
|
+
## Per-Pattern Permissions (`lib/permission-rules.js`, Task 4.1)
|
|
1211
|
+
|
|
1212
|
+
Rich permission rules that layer **on top of** the coarse `--allow-fs`/`--allow-exec`/`--allow-net` tiers, `--readonly`, and the per-session "always for `<tag>`". A rule matches on a **tool** *and* (optionally) its **arguments** and resolves to one of `allow` / `deny` / `ask`. The whole resolver (`lib/permission-rules.js`) is **pure** and exhaustively unit-tested (`test/permission-rules.test.js`); the gate wiring is proven end-to-end against the mock LLM (`test/permission-rules-agent.test.js`).
|
|
1213
|
+
|
|
1214
|
+
**Rule schema** — under `permissions.rules` in user (`~/.semalt-ai/config.json`) and project (`.semalt/config.json`) config:
|
|
1215
|
+
|
|
1216
|
+
```json
|
|
1217
|
+
{ "permissions": { "rules": [
|
|
1218
|
+
{ "tool": "shell", "pattern": "git *", "action": "allow" },
|
|
1219
|
+
{ "tool": "shell", "pattern": "/curl.*\\| *sh/", "action": "deny" },
|
|
1220
|
+
{ "tool": "write_file", "path": "src/**", "action": "allow" },
|
|
1221
|
+
{ "tool": "read_file", "path": "**/*.env", "action": "ask" },
|
|
1222
|
+
{ "tool": "http_get", "url": "https://internal/*", "action": "allow" }
|
|
1223
|
+
] } }
|
|
1224
|
+
```
|
|
1225
|
+
|
|
1226
|
+
- **`tool`** — required. Matched (as a glob, so `*` / `mcp__*` work) against **both** the canonical action and the public tag (`shell`↔`exec`, `write`↔`write_file`, …).
|
|
1227
|
+
- **One matcher key** — `pattern` (command, greedy glob), `path` (segment-aware glob: `*` stops at `/`, `**` crosses), `url`, or generic `match`. Omit for a tool-only rule. Supplying more than one is malformed.
|
|
1228
|
+
- **Glob vs regex by syntax** — a value wrapped in `/…/` (optional `imsuy` flags) is a **regex**; anything else is a **glob**.
|
|
1229
|
+
- **`action`** — `allow` | `deny` | `ask`.
|
|
1230
|
+
|
|
1231
|
+
**Precedence (total + deterministic).** Within a layer: most-specific rule wins (specificity = literal-char count; a literal `tool` outweighs `*`); among equal specificity, **deny > ask > allow** — so the result is **order-independent**. Across layers the **most-restrictive** decision wins (`deny` > `ask` > `allow` > none). No rule matching → `null`, falling back to the tier/descriptor default.
|
|
1232
|
+
|
|
1233
|
+
**Project can only NARROW (the security core).** `.semalt/config.json` is attacker-controllable (cloned repos). The two layers are loaded **separately** (`loadRuleLayers`, NOT the shallow-merged config) and `resolvePermission` **drops every project `allow` rule before resolution** — structurally, so a project rule can only ever contribute `deny`/`ask` and can never grant a permission the user layer didn't. Proven adversarially (`ADVERSARIAL: project allow(shell *) does NOT grant shell…`).
|
|
1234
|
+
|
|
1235
|
+
**Other load-bearing properties:**
|
|
1236
|
+
- **Canonicalize before matching** — `normalizeCall` resolves `..`, symlinks (`fs.realpathSync`), and absolute/relative forms (matching on both, posix-normalized) so `write(src/../../etc/passwd)` cannot satisfy an `allow` scoped to `src/**`.
|
|
1237
|
+
- **Regex safety / fail closed** — a pathological or invalid pattern is dropped at load (ReDoS heuristic + bounded subject length); a matcher that errors at runtime **never grants** (erroring `allow` → no-match) and **still restricts** (erroring `deny`/`ask` → match); a malformed rule is dropped with a startup warning.
|
|
1238
|
+
- **Compose, never bypass** — rules sit *alongside* the Phase 0 controls. An `allow` rule auto-approves the *gate* but the call still passes through the unbypassable **deny-list** (`agentExecShell`), the **secret-file guard**, **`--readonly`**, and `isPathSafe` in the executors — an `allow` can never re-enable what those forbid (proven by the `COMPOSE:` tests).
|
|
1239
|
+
- **`deny` beats `--dangerously-skip-permissions`** — an explicit user `deny` rule is a fail-closed hard stop honored even under skip (unlike the heuristic deny-list, which skip disables); `allow`/`ask` are subsumed by skip's auto-approve.
|
|
1240
|
+
|
|
1241
|
+
**Integration.** `index.js` loads the layers and passes them to `createPermissionManager({ rules, cwd })`. The agent gate (`lib/agent.js`) calls `permissionManager.resolveRule(call)` for **every** tool call (covering XML *and* native — they converge on the same `[action, ...args]` tuple): `deny` hard-blocks (the model gets the reason and adapts), `allow`/`ask` thread into `askPermission(...)` (allow auto-approves what a tier wouldn't; `ask` forces a prompt a tier would skip — refused in non-TTY). Matched rules surface in `--debug` (a `perm_rule:` row) and the audit log (`rule-denied:<reason>`).
|
|
1242
|
+
|
|
1243
|
+
---
|
|
1244
|
+
|
|
1245
|
+
## Headless Output (`lib/headless.js`, Task 2.4)
|
|
1246
|
+
|
|
1247
|
+
`-p/--print` runs a one-shot agent task non-interactively; `--output-format` selects the surface (and implies `-p`):
|
|
1248
|
+
|
|
1249
|
+
- **text** (default) — current human output.
|
|
1250
|
+
- **json** — a single JSON object `{ result, toolCalls: [...], usage, cost, stopReason, verifyStatus }` to stdout, nothing else.
|
|
1251
|
+
- **stream-json** — newline-delimited JSON events (`{type:'assistant'|'tool'|'result', …}`), one per line, for piping. The terminal `result` event carries `stopReason` and `verifyStatus`.
|
|
1252
|
+
|
|
1253
|
+
**Per-tool rec shape (Phase 6d-ii).** Each per-tool rec — both a stream `{type:'tool', …}` event and a `toolCalls[]` entry (the array entry omits `type`) — carries the **legacy** fields `{ tool, args, ok, ms }` (byte-identical to pre-6d-ii: `tool` = canonical action, `args` = the tuple minus the action, `ok` = `!error`, `ms` = elapsed) **plus** descriptor-derived **additive** fields: `status` (`'ok'|'error'`), `category` (`file`/`shell`/`net`/…), `durationMs` (mirrors `ms`), `target`, `attrs`, `meta`, `error`, `noDuration`, `tag`, and `detail` (the collapsed `{kind:'diff'|'output', payload}` body, `null` on errors). These come from a **sink-local** `buildToolOperation` → `renderOperation(op, {mode:'json'})` build (the same descriptor the interactive sink builds), merged **legacy-last so legacy wins** on any name clash. The merge is strictly additive: no legacy field is removed/renamed/retyped, and the **finalize** top-level key-set is unchanged. Web ops (`web_search`/`http_get`) are ordinary tools here — they emit **N** per-op events, not a collapsed summary (no web-activity collapse on the headless rail).
|
|
1254
|
+
|
|
1255
|
+
Machine modes (`json`/`stream-json`) suppress all chrome via `setUIActive(true)` for the run — the two headless chrome sinks (tools' `_log` ✓/✗ lines and the write/append permission diff) both honor that flag — so stdout stays byte-pure (no ANSI). `runHeadless` takes an injectable `write` sink so the formatter is unit-testable. `cost` is `null` until the price table lands in Task 2.6. Phase 0 safety is unchanged: headless still refuses deny-listed/interactive-approval actions unless `--dangerously-skip-permissions`. Usage: `semalt-code -p --output-format json "your task"` or `semalt-code code -p --output-format stream-json "…"`.
|
|
1256
|
+
|
|
1257
|
+
---
|
|
1258
|
+
|
|
1259
|
+
## Audit Log (`lib/audit.js`)
|
|
1260
|
+
|
|
1261
|
+
Every tool execution is appended to `~/.semalt-ai/audit.log` as NDJSON:
|
|
1262
|
+
```json
|
|
1263
|
+
{"ts":"2026-01-01T00:00:00.000Z","tag":"exec","input":"{\"command\":\"ls\"}","approved":true,"result":"ok"}
|
|
1264
|
+
```
|
|
1265
|
+
|
|
1266
|
+
View the last 50 entries with `semalt-code audit`. Checkpoint activity (Task 4.3) is recorded as a `checkpoint` row (`logCheckpoint`) when prior file state is snapshotted before a mutation and on rewind.
|
|
1267
|
+
|
|
1268
|
+
---
|
|
1269
|
+
|
|
1270
|
+
## Session Storage (`lib/storage.js`)
|
|
1271
|
+
|
|
1272
|
+
Local chat sessions are saved to `~/.semalt-ai/sessions/` as JSON files named `<timestamp>-<id>.json`. Use `/history` in-chat to browse and load any saved local session. To resume a **dashboard** chat by ID, pass `-r/--resume <chat-id>` (loaded via `dashboardGetChat`).
|
|
1273
|
+
|
|
1274
|
+
> **Not auto-resumed.** There is no startup prompt that offers to resume the most recent session (e.g. "< 24 h old"). Resuming is always explicit — `/history` for local sessions, `--resume <id>` for dashboard chats. See **Deferred / Not Yet Implemented**.
|
|
1275
|
+
|
|
1276
|
+
---
|
|
1277
|
+
|
|
1278
|
+
## Metrics (`lib/metrics.js`)
|
|
1279
|
+
|
|
1280
|
+
`Metrics` is instantiated per `runAgentLoop` call and tracks per-turn token usage, latency, and total session duration. A summary box is printed on exit (SIGINT or natural quit) and after `cmdCode` runs. Use `/compact` in-chat to see the live summary.
|
|
1281
|
+
|
|
1282
|
+
### Split context counter (Variant B, display-only)
|
|
1283
|
+
|
|
1284
|
+
The counter shows the real measured context alongside an **estimated** base/working
|
|
1285
|
+
breakdown. The API returns `usage.prompt_tokens` **pre-summed** — it never splits
|
|
1286
|
+
the prompt into base (system prompt + tool specs) vs working (history + tool
|
|
1287
|
+
results) — so the split **cannot be measured; it is estimated**.
|
|
1288
|
+
|
|
1289
|
+
- **Both halves are `char/4` estimates from the SAME estimator** (`estimateContextSplit`
|
|
1290
|
+
in `lib/api.js`), so they sum consistently — the point of **Variant B** (no
|
|
1291
|
+
"real − estimate" mixing where `working` would look measured but secretly carry
|
|
1292
|
+
the base estimate's error). `base = estimate(system messages) + estimate(serialized
|
|
1293
|
+
tool schema)`; `working = estimate(every non-system message)` — the part that grows.
|
|
1294
|
+
- **The real `prompt_tokens` is the anchor of truth, shown WITHOUT a `~`.** The
|
|
1295
|
+
estimated split sits alongside it with a `~` prefix. Status line format:
|
|
1296
|
+
`~12k working · ~5.6k base · 17,600 / 200,000 tok (9%)` (working first; the real
|
|
1297
|
+
total/limit/percent carries no `~`). The Session Summary adds an `Est. split:`
|
|
1298
|
+
row under the measured `Token limit:` row.
|
|
1299
|
+
- **Recomputed PER REQUEST** in `chatStream`'s `finalize()` from the payload
|
|
1300
|
+
ACTUALLY sent (`trimmedMessages` post-retry + `payload.tools`), so it stays
|
|
1301
|
+
correct when MCP connects, plan mode toggles (`PLAN_MODE_NOTICE`), or dynamic
|
|
1302
|
+
tools change the base mid-session — never a frozen value.
|
|
1303
|
+
- **XML mode:** `payload.tools` is absent (tools are embedded in the system prompt
|
|
1304
|
+
string), so estimating the actual system message still captures the tool weight —
|
|
1305
|
+
the base is **never silently zero**.
|
|
1306
|
+
- **Threading:** attached to the `chatStream` result as `context_estimate`
|
|
1307
|
+
(`{ base, working }`) → `metrics.endTurn(usage, model, contextEstimate)` (stored
|
|
1308
|
+
per turn, exposed via `contextBaseEst()`/`contextWorkingEst()`) → `onMetricsUpdate`
|
|
1309
|
+
(`baseEst`/`workingEst`) → `statusBar.updateMetrics`/`_buildTokenField`.
|
|
1310
|
+
- **Headless/JSON/SDK:** `usageFromMetrics` (`lib/headless.js`) adds **additive**
|
|
1311
|
+
`context_base_est` / `context_working_est` fields (last turn) — the existing real
|
|
1312
|
+
`prompt_tokens`/`total_tokens`/`context_tokens` fields are unchanged.
|
|
1313
|
+
- **Display-only:** changes nothing about what's sent to the model or what's
|
|
1314
|
+
counted; it just shows the existing real total split into an honest estimated
|
|
1315
|
+
breakdown. Tested by `test/context-split.test.js` (estimator base/working +
|
|
1316
|
+
sum-consistency + XML-no-tools + per-request recompute incl. MCP-tools-grow and
|
|
1317
|
+
plan-mode-notice; Metrics store/expose; status-line format with `~` on estimates
|
|
1318
|
+
and none on the real total; additive headless fields with no envelope regression).
|
|
1319
|
+
|
|
1320
|
+
---
|
|
1321
|
+
|