@ducci/jarvis 1.0.13 → 1.0.15

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,116 @@
1
+ # Finding 001: Context Window Explosion via Tool Output Accumulation
2
+
3
+ **Date:** 2026-02-26
4
+ **Severity:** High — renders session completely unusable after enough handoffs
5
+ **Status:** Fixed
6
+
7
+ ---
8
+
9
+ ## What Happened
10
+
11
+ A session was started with the question *"Hast du Zugriff auf deinen source code? Wo liegt er?"* (Does Jarvis have access to its own source code?). The agent began exploring the filesystem using `exec` and `list_dir`, running commands like `cat agent.js`, `cat tools.js`, `cat app.js`, and various `find` commands.
12
+
13
+ The task required more than 10 iterations to complete, so the checkpoint/handoff mechanism fired. The agent ran 6 consecutive handoff runs before hitting `maxHandoffs` and stopping with `intervention_required`.
14
+
15
+ By that point the session `conversation.json` had grown to **687KB**. On the very next user message (*"Why?"*), both the primary and fallback models returned a `400 Provider returned error`. The session was permanently broken — no further messages could be processed.
16
+
17
+ ---
18
+
19
+ ## Root Cause
20
+
21
+ Two compounding problems:
22
+
23
+ ### 1. Tool output stored verbatim, without size limit
24
+
25
+ `exec` returns raw `stdout` from shell commands. When the model runs `cat agent.js` (440 lines, ~22 000 chars), that entire output gets stored in `session.messages` as a `role: "tool"` message. Every subsequent model request in that run — and in all future runs — sends this content in full.
26
+
27
+ There was no cap anywhere on tool result content. A single run of 10 iterations with a few `cat` calls could easily produce 100–200 KB of tool messages.
28
+
29
+ ### 2. Handoff runs accumulated on top of each other
30
+
31
+ When the iteration limit is hit, the checkpoint/handoff mechanism pushes `checkpoint.remaining` as a new user message and starts a fresh agent run — but on top of the **same, growing** `session.messages` array. Each of the 6 handoff runs added another 10 iterations of tool call messages to the history. Nothing was ever removed.
32
+
33
+ After 6 runs × ~10 iterations × multiple `cat` commands each, the context reached approximately 170 000 tokens — exceeding the free model's 128 000 token limit. The `400` was the provider rejecting the oversized request.
34
+
35
+ ### Why the `400` appeared on the *next* user message, not during the run
36
+
37
+ The session's final run hit `maxHandoffs` and stopped. At that point the context was already at or near the limit. When the user sent a new message, the full bloated history was loaded and sent again — this time slightly over the limit — causing the rejection.
38
+
39
+ ---
40
+
41
+ ## Model Context Windows (for reference)
42
+
43
+ | Model | Context Window |
44
+ |---|---|
45
+ | arcee-ai/trinity-large-preview:free | ~128 000 tokens |
46
+ | Claude Sonnet 4.6 | 200 000 tokens |
47
+ | Gemini 2.5 Pro / 2.0 Flash | 1 000 000 tokens |
48
+
49
+ A larger model would have delayed the failure, but not prevented it. The conversation would still grow unboundedly.
50
+
51
+ ---
52
+
53
+ ## What We Considered
54
+
55
+ **Truncate tool results in `prepareMessages`** — works, but runs on every loop iteration and is the wrong place conceptually. The content is already stored in full in the session before `prepareMessages` is ever called.
56
+
57
+ **Naive sliding window (drop oldest N messages)** — breaks the OpenRouter/OpenAI API contract. Every `role: "tool"` message must be paired with the assistant message containing the matching `tool_call_id`. Slicing arbitrarily through the message array orphans tool results and causes a `400` — the exact error we're trying to fix.
58
+
59
+ **Token budget / summarisation** — more adaptive but significantly more complex. Requires either token counting per model or an extra LLM call. Overkill for v1.
60
+
61
+ ---
62
+
63
+ ## Fix
64
+
65
+ Two targeted changes to `src/server/agent.js`.
66
+
67
+ ### 1. Cap tool result content at write time (`MAX_TOOL_RESULT = 4000`)
68
+
69
+ Right where a tool result is pushed to `session.messages`, cap the content to 4 000 characters. The full result is still passed to `runToolCalls` and therefore written to the JSONL session log — no information is lost for debugging. Only what the model sees is limited.
70
+
71
+ ```js
72
+ const sessionContent = resultStr.length > MAX_TOOL_RESULT
73
+ ? resultStr.slice(0, MAX_TOOL_RESULT) + '\n[...truncated]'
74
+ : resultStr;
75
+ session.messages.push({ role: 'tool', tool_call_id: toolCall.id, content: sessionContent });
76
+ ```
77
+
78
+ 4 000 chars is ~80 lines of code or a full `ls -la` listing — enough for the model to reason about any output. If more detail is needed, the model should use targeted commands (`grep`, `head`, `tail`) rather than `cat`-ing entire files.
79
+
80
+ ### 2. Strip intermediate tool messages before each handoff
81
+
82
+ Before calling `runAgentLoop`, snapshot `session.messages.length` as `runStartIndex`. If the run ends with `checkpoint_reached`, splice out all messages added during that run *except the final wrap-up assistant response*, then push `checkpoint.remaining` as the new user message.
83
+
84
+ ```js
85
+ const runStartIndex = session.messages.length;
86
+ const run = await runAgentLoop(...);
87
+
88
+ // on checkpoint_reached, before resuming:
89
+ session.messages.splice(runStartIndex, session.messages.length - runStartIndex - 1);
90
+ session.messages.push({ role: 'user', content: run.checkpoint.remaining });
91
+ ```
92
+
93
+ **Before** (after 6 handoffs):
94
+ ```
95
+ [system] [user: question] [assistant/tool ×10] [wrap-up] [user: checkpoint1]
96
+ [assistant/tool ×10] [wrap-up] [user: checkpoint2]
97
+ [assistant/tool ×10] [wrap-up] ... → 687 KB
98
+ ```
99
+
100
+ **After** (after 6 handoffs):
101
+ ```
102
+ [system] [user: question] [wrap-up] [user: checkpoint1]
103
+ [wrap-up] [user: checkpoint2]
104
+ [wrap-up] ... → ~5 KB
105
+ ```
106
+
107
+ Each handoff now adds 2 messages instead of 20+. The wrap-up message carries the relevant state (what was done, what remains) so the model is not flying blind — it just doesn't have the raw tool noise from previous runs.
108
+
109
+ ---
110
+
111
+ ## Outcome
112
+
113
+ - Sessions with long-running tasks no longer grow the context unboundedly.
114
+ - The JSONL session log is unaffected — full tool outputs are always written there.
115
+ - The model can still access previous run output via `read_session_log` if needed.
116
+ - A follow-up message after a completed multi-handoff task will no longer receive a `400`.
@@ -0,0 +1,84 @@
1
+ # Finding 002: Handoff Edge Cases Found During Review of Finding 001
2
+
3
+ **Date:** 2026-02-26
4
+ **Severity:** Medium
5
+ **Status:** Fixed
6
+
7
+ ---
8
+
9
+ ## Context
10
+
11
+ While reviewing the fix for [Finding 001](./001-context-explosion.md), two edge cases in the handoff system were found. Neither caused problems in the observed debugging session, but both could cause failures under specific conditions.
12
+
13
+ ---
14
+
15
+ ## Issue A: `checkpoint.remaining` could be `null`, causing a 400 on the next iteration
16
+
17
+ ### What could happen
18
+
19
+ When the iteration limit is hit, the agent asks the model for a wrap-up response that includes a `checkpoint` field:
20
+
21
+ ```json
22
+ {
23
+ "response": "...",
24
+ "logSummary": "...",
25
+ "checkpoint": {
26
+ "progress": "...",
27
+ "remaining": "..."
28
+ }
29
+ }
30
+ ```
31
+
32
+ The server then pushes `checkpoint.remaining` as a user message to start the next run:
33
+
34
+ ```js
35
+ session.messages.push({ role: 'user', content: run.checkpoint.remaining });
36
+ ```
37
+
38
+ Weaker or free models occasionally omit required fields or set them to `null`. If `remaining` is `null`, the session gets a `{ role: 'user', content: null }` message. Most providers reject a null content field with a `400 Bad Request` on the next model call — the same error that surfaced in Finding 001, but from a different cause.
39
+
40
+ ### Fix
41
+
42
+ ```js
43
+ session.messages.push({ role: 'user', content: run.checkpoint.remaining || 'Continue with the task.' });
44
+ ```
45
+
46
+ ---
47
+
48
+ ## Issue B: `intervention_required` did not strip tool history before saving
49
+
50
+ ### What could happen
51
+
52
+ The tool history strip introduced in Finding 001 runs right before pushing `checkpoint.remaining` for the next run. But the `intervention_required` path (max handoffs exceeded) breaks out of the loop *before* reaching the strip:
53
+
54
+ ```js
55
+ if (session.metadata.handoffCount > config.maxHandoffs) {
56
+ // ... log and set status ...
57
+ break; // ← strip never ran
58
+ }
59
+
60
+ // strip only reached here, after the if-block
61
+ session.messages.splice(runStartIndex, session.messages.length - runStartIndex - 1);
62
+ ```
63
+
64
+ This meant a session that hit the handoff limit was saved with the full tool history of the last run still in it. When the user sends a new message after `intervention_required`, the model receives all of that accumulated tool history — the same context bloat risk as before the fix in Finding 001.
65
+
66
+ ### Fix
67
+
68
+ Strip the tool history inside the `intervention_required` branch, before breaking:
69
+
70
+ ```js
71
+ if (session.metadata.handoffCount > config.maxHandoffs) {
72
+ // ... log and set status ...
73
+ session.messages.splice(runStartIndex, session.messages.length - runStartIndex - 1);
74
+ break;
75
+ }
76
+ ```
77
+
78
+ The wrap-up assistant message (last in the array) is preserved — it gives the model context about what was attempted when the user resumes.
79
+
80
+ ---
81
+
82
+ ## Why these weren't caught earlier
83
+
84
+ Both issues only manifest under specific conditions (model omitting a field; hitting maxHandoffs exactly). The debugging session in Finding 001 stopped at `intervention_required` after 6 handoffs, but the 400 error on the next message was attributed to the overall context size, masking the fact that the strip hadn't run for that final run.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@ducci/jarvis",
3
- "version": "1.0.13",
3
+ "version": "1.0.15",
4
4
  "description": "A fully automated agent system that lives on a server.",
5
5
  "main": "./src/index.js",
6
6
  "type": "module",
package/src/index.js CHANGED
@@ -180,6 +180,30 @@ program
180
180
  }
181
181
  });
182
182
 
183
+ program
184
+ .command('restart')
185
+ .description('Restart the Jarvis server (starts it if not running).')
186
+ .action(async () => {
187
+ preflight();
188
+ try {
189
+ await connectPm2();
190
+ const desc = await pm2Describe().catch(() => []);
191
+ const isRunning = desc.length > 0 && desc[0].pm2_env?.status === 'online';
192
+ if (isRunning) {
193
+ await pm2Restart();
194
+ console.log('Jarvis server restarted.');
195
+ } else {
196
+ await pm2Start();
197
+ console.log('Jarvis server started.');
198
+ }
199
+ pm2.disconnect();
200
+ } catch (e) {
201
+ console.error('Failed to restart Jarvis server:', e.message);
202
+ pm2.disconnect();
203
+ process.exit(1);
204
+ }
205
+ });
206
+
183
207
  program
184
208
  .command('status')
185
209
  .description('Display the status of the Jarvis server.')
@@ -8,6 +8,7 @@ import chalk from 'chalk';
8
8
 
9
9
  const FORMAT_NUDGE = 'Your previous response was not valid JSON. Respond only with the required JSON object: {"response": "...", "logSummary": "..."}';
10
10
  const LOOP_DETECTION_THRESHOLD = 3;
11
+ const MAX_TOOL_RESULT = 4000;
11
12
 
12
13
  const WRAP_UP_NOTE = `[System: You have reached the iteration limit. This is your final response for this run.
13
14
  Respond with your normal JSON, but add a checkpoint field:
@@ -151,10 +152,13 @@ async function runAgentLoop(client, config, session, prepareMessages) {
151
152
  const resultStr = typeof result === 'string' ? result : JSON.stringify(result);
152
153
  runToolCalls.push({ name: toolName, args: toolArgs, status: toolStatus, result: resultStr });
153
154
 
155
+ const sessionContent = resultStr.length > MAX_TOOL_RESULT
156
+ ? resultStr.slice(0, MAX_TOOL_RESULT) + '\n[...truncated]'
157
+ : resultStr;
154
158
  session.messages.push({
155
159
  role: 'tool',
156
160
  tool_call_id: toolCall.id,
157
- content: resultStr,
161
+ content: sessionContent,
158
162
  });
159
163
 
160
164
  const callKey = `${toolName}|${JSON.stringify(toolArgs)}|${resultStr}`;
@@ -344,6 +348,7 @@ export async function handleChat(config, requestSessionId, userMessage) {
344
348
  // Handoff loop
345
349
  try {
346
350
  while (true) {
351
+ const runStartIndex = session.messages.length;
347
352
  const run = await runAgentLoop(client, config, session, prepareMessages);
348
353
  allToolCalls.push(...run.runToolCalls);
349
354
 
@@ -405,11 +410,20 @@ export async function handleChat(config, requestSessionId, userMessage) {
405
410
  logSummary: 'Max handoffs exceeded. Human intervention required.',
406
411
  status: 'intervention_required',
407
412
  });
413
+ // Strip tool history even when stopping — prevents context bloat on the
414
+ // next user message when human intervention resumes the session.
415
+ session.messages.splice(runStartIndex, session.messages.length - runStartIndex - 1);
408
416
  break;
409
417
  }
410
418
 
411
- // Resume with checkpoint.remaining as new prompt
412
- session.messages.push({ role: 'user', content: run.checkpoint.remaining });
419
+ // Strip intermediate tool messages from this run before resuming.
420
+ // Keep only the wrap-up assistant response (last message added by runAgentLoop)
421
+ // it summarises what was done and is far cheaper context than the raw tool history.
422
+ session.messages.splice(runStartIndex, session.messages.length - runStartIndex - 1);
423
+
424
+ // Resume with checkpoint.remaining as new prompt.
425
+ // Guard against null/undefined in case the model omitted the field.
426
+ session.messages.push({ role: 'user', content: run.checkpoint.remaining || 'Continue with the task.' });
413
427
  }
414
428
  } catch (e) {
415
429
  const errorLog = {