@ducci/jarvis 1.0.29 → 1.0.31

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,59 @@
1
+ # Finding 013 — stderr Visibility and Output Truncation
2
+
3
+ ## Observed Behaviour
4
+
5
+ During a multi-run ZAP security scan session, the agent repeatedly failed to diagnose and fix the root cause of scan failures. It issued `pkill`/`kill` variants dozens of times, burned through 6 iteration limits and 2 handoffs, and ultimately gave up without producing results.
6
+
7
+ Post-mortem analysis of the debug session logs revealed two compounding problems.
8
+
9
+ ## Root Causes
10
+
11
+ ### 1. Head-only truncation buried errors at the end of output
12
+
13
+ `MAX_TOOL_RESULT = 4000` was applied as a simple head slice: `resultStr.slice(0, 4000)`. ZAP produces verbose startup logs (JVM init, add-on loading, database migration) that easily exceed 4000 characters. Five separate ZAP exec results were truncated exactly at the limit, cutting off during database migration messages. Any errors that appeared later in the output — after the verbose preamble — were silently dropped before the model ever saw them.
14
+
15
+ ### 2. Model ignored stderr even when it was visible
16
+
17
+ In one un-truncated result (810 chars), the critical error was plainly present:
18
+
19
+ ```
20
+ g_module_open() failed for libpixbufloader-tiff.so: libtiff.so.5: cannot open shared object file: No such file or directory
21
+ ```
22
+
23
+ The model's subsequent response ignored it entirely and concluded only that "no results were found in the output directory." There was no mechanism forcing the model to re-examine stderr before forming its conclusion or retrying.
24
+
25
+ Similarly, all `pkill`/`kill` commands returned `exitCode: 1` with clear stderr — yet the model continued issuing variations of the same commands without diagnosing why process termination was failing.
26
+
27
+ ## Fixes
28
+
29
+ ### Fix 1 — Head + tail truncation (`agent.js`)
30
+
31
+ Replace head-only truncation with a head+tail strategy:
32
+
33
+ ```js
34
+ // Before
35
+ resultStr.slice(0, MAX_TOOL_RESULT) + '\n[...truncated]'
36
+
37
+ // After
38
+ resultStr.slice(0, 2000) + `\n[...${resultStr.length - 4000} chars truncated...]\n` + resultStr.slice(-2000)
39
+ ```
40
+
41
+ The first 2000 chars preserve startup context; the last 2000 chars preserve the diagnostic tail where errors typically appear. The marker in the middle shows how much was dropped. Total budget stays at 4000 chars.
42
+
43
+ ### Fix 2 — Stderr nudge injection (`agent.js`)
44
+
45
+ After each iteration's tool-call loop, if any tool failed (`status === 'error'`) with non-empty `stderr`, inject a system message:
46
+
47
+ ```
48
+ [System: A command failed and produced stderr output. Examine the stderr field in the tool result carefully — it likely describes the root cause of the failure. Do not retry the same command without first addressing what stderr reports.]
49
+ ```
50
+
51
+ This creates an active forcing function — the model cannot continue to the next iteration without the nudge appearing in its context. The nudge is suppressed if loop detection already fired (to avoid contradictory instructions).
52
+
53
+ ## Known Gap
54
+
55
+ The stderr nudge only fires when `toolFailed` is true (i.e., `exec` returned `status: 'error'`). Commands that return `exitCode: 0` but still emit meaningful errors to stderr (e.g., a shell script that succeeds but a subprocess inside it fails) will not trigger the nudge. Catching that case without generating noise from normal stderr usage (npm warnings, apt-get progress) requires more context than is available at this level. Documented as a known limitation.
56
+
57
+ ## Files Changed
58
+
59
+ - `src/server/agent.js` — truncation strategy + stderr nudge injection
@@ -0,0 +1,202 @@
1
+ # Finding 014: exec stderr Artifact and Malformed Tool Call Arguments
2
+
3
+ **Date:** 2026-03-02
4
+ **Severity:** Medium — caused spurious context noise (spurious nudges), agent confusion on malformed args, and silent loss of failedApproaches across user turns
5
+ **Status:** Fixed
6
+
7
+ ---
8
+
9
+ ## Observed Session
10
+
11
+ Session `d97070f7-e50f-4d9e-a38b-b68e7b27e7b7`. User asked Jarvis to set up an OWASP ZAP web security scanning project. The session ran 4 runs (40 total iterations) without completing the task. The snap-installed ZAP does not include `zap-baseline.sh` (which only ships in the ZAP Docker image); the agent never discovered that `zaproxy -cmd -quickurl` is the correct snap-native CLI equivalent.
12
+
13
+ Three compounding issues were identified that degraded the agent's ability to self-correct.
14
+
15
+ ---
16
+
17
+ ## Issue 1: exec Tool Injects Node.js Error Message into `stderr` Field
18
+
19
+ ### What happened
20
+
21
+ Commands like `which zap-cli` (not installed), `grep` returning no matches, and piped `find | grep` with no results all return exit code 1. In each case the actual process wrote nothing to stderr. But the exec tool result showed:
22
+
23
+ ```json
24
+ {"status":"error","exitCode":1,"stdout":"","stderr":"Command failed: which zap-cli\n"}
25
+ ```
26
+
27
+ The `"Command failed: ..."` string is not from the process — it is `e.message` from Node.js's `execAsync` error object, injected via `e.stderr || e.message`.
28
+
29
+ The stderr nudge check in `agent.js` fires on any non-empty `resultObj.stderr`:
30
+
31
+ ```js
32
+ if (resultObj && resultObj.stderr) {
33
+ stderrErrorInIteration = true;
34
+ }
35
+ ```
36
+
37
+ This triggered the nudge "Examine the stderr field carefully — it likely describes the root cause of the failure" 3–4 times per run, when there was nothing actionable to examine in stderr.
38
+
39
+ ### Root cause
40
+
41
+ In the exec seed tool (`src/server/tools.js`, line 76):
42
+
43
+ ```js
44
+ return { status: "error", exitCode: e.code || 1, stdout: e.stdout || "", stderr: e.stderr || e.message };
45
+ ```
46
+
47
+ The `|| e.message` fallback was designed to show something when `e.stderr` is empty. But it conflates process-generated stderr with Node.js meta-messages about the process exiting non-zero.
48
+
49
+ ### Fix
50
+
51
+ ```js
52
+ // Before:
53
+ return { status: "error", exitCode: e.code || 1, stdout: e.stdout || "", stderr: e.stderr || e.message };
54
+
55
+ // After:
56
+ return { status: "error", exitCode: e.code || 1, stdout: e.stdout || "", stderr: e.stderr || "" };
57
+ ```
58
+
59
+ `status: error` and `exitCode` already signal failure. The "Command failed: ..." Node.js string is not diagnostic and should not appear in the stderr field.
60
+
61
+ **File**: `src/server/tools.js` — exec seed tool code (propagates via seedTools() on next server start)
62
+
63
+ ---
64
+
65
+ ## Issue 2: Malformed Tool Call JSON Arguments Silently Swallowed
66
+
67
+ ### What happened
68
+
69
+ The model sent a tool call with malformed arguments (missing opening `{`):
70
+
71
+ ```json
72
+ {"name": "exec", "arguments": "\"cmd\": \"find /snap/zaproxy/current -type f -name '*.sh' | head -10\"}"}
73
+ ```
74
+
75
+ `JSON.parse` threw, the catch block silently used `toolArgs = {}`, and the exec tool failed with:
76
+
77
+ ```
78
+ {"status":"error","exitCode":"ERR_INVALID_ARG_TYPE","stdout":"","stderr":"The \"command\" argument must be of type string. Received undefined"}
79
+ ```
80
+
81
+ The model saw `ERR_INVALID_ARG_TYPE` / "Received undefined" — no indication that the JSON formatting was wrong. The stderr nudge also fired (Issue 1 compounding: `err.message` put in stderr).
82
+
83
+ ### Root cause
84
+
85
+ In `src/server/agent.js`:
86
+
87
+ ```js
88
+ try {
89
+ toolArgs = JSON.parse(toolCall.function.arguments || '{}');
90
+ } catch {
91
+ toolArgs = {};
92
+ }
93
+ ```
94
+
95
+ The JSON parse error is swallowed. The tool is called with empty args. The resulting type error is cryptic and doesn't tell the model to fix its JSON.
96
+
97
+ ### Fix
98
+
99
+ Detect the parse failure early, push an explicit error tool result, and skip execution:
100
+
101
+ ```js
102
+ let toolArgs;
103
+ let argParseError = null;
104
+ try {
105
+ toolArgs = JSON.parse(toolCall.function.arguments || '{}');
106
+ } catch (e) {
107
+ argParseError = e;
108
+ }
109
+
110
+ if (argParseError) {
111
+ const errorContent = JSON.stringify({
112
+ status: 'error',
113
+ error: `Tool arguments could not be parsed as JSON: ${argParseError.message}. Ensure arguments are a valid JSON object, e.g. {"key": "value"}.`,
114
+ });
115
+ session.messages.push({ role: 'tool', tool_call_id: toolCall.id, content: errorContent });
116
+ runToolCalls.push({ name: toolName, args: {}, status: 'error', result: errorContent });
117
+ consecutiveFailures++;
118
+ continue;
119
+ }
120
+ ```
121
+
122
+ The model immediately sees "Tool arguments could not be parsed as JSON" instead of an opaque `ERR_INVALID_ARG_TYPE`. It can fix its JSON and retry.
123
+
124
+ **File**: `src/server/agent.js` — inner tool execution loop
125
+
126
+ ---
127
+
128
+ ## Issue 3: Accumulated failedApproaches Cleared on New User Message
129
+
130
+ ### What happened
131
+
132
+ In multi-run sessions where multiple checkpoint handoffs accumulate `session.metadata.failedApproaches`, the next user message resets this list to `[]`:
133
+
134
+ ```js
135
+ session.metadata.handoffCount = 0;
136
+ session.metadata.failedApproaches = [];
137
+ ```
138
+
139
+ This was designed to give the model a clean slate after human review. But "Ok do it" is not a review — it's a continuation. The model loses knowledge of what was already tried and can repeat the same failed strategies in the new round of runs.
140
+
141
+ (Note: in this specific session run 1 ended with `ok`, so `failedApproaches` was empty at reset time anyway. But in sessions where checkpoint runs accumulate a list, the reset discards it entirely.)
142
+
143
+ ### Fix
144
+
145
+ Embed accumulated `failedApproaches` into the incoming user message before resetting:
146
+
147
+ ```js
148
+ let userMessageWithContext = userMessage;
149
+ if (session.metadata.failedApproaches && session.metadata.failedApproaches.length > 0) {
150
+ userMessageWithContext += `\n\n[System: The following approaches were tried and failed in previous runs — consider them exhausted:\n${session.metadata.failedApproaches.map((a, i) => `${i + 1}. ${a}`).join('\n')}]`;
151
+ }
152
+ session.messages.push({ role: 'user', content: userMessageWithContext });
153
+ session.metadata.handoffCount = 0;
154
+ session.metadata.failedApproaches = [];
155
+ ```
156
+
157
+ The model enters the new round with awareness of what has already been exhausted. If the user's message implies a fresh task, the model can ignore the list; if it's a continuation, it benefits from the context.
158
+
159
+ **File**: `src/server/agent.js` — `_runHandleChat`, before user message push
160
+
161
+ ---
162
+
163
+ ## Issue 4: WRAP_UP_NOTE Did Not Require Verified Progress Claims
164
+
165
+ ### What happened
166
+
167
+ Runs 2 and 3 claimed to have created project directories and files in the `progress` field of their checkpoints — but their tool calls contained no file creation commands. The model fabricated progress, causing subsequent resume messages to start from false premises.
168
+
169
+ ### Fix
170
+
171
+ Updated the `progress` field description in `WRAP_UP_NOTE`:
172
+
173
+ ```
174
+ // Before:
175
+ "progress": "What has been fully completed so far.",
176
+
177
+ // After:
178
+ "progress": "What has been fully completed — only include items confirmed by tool output (e.g., successful exec with exit code 0, or verified by ls/cat). Do not report planned steps as completed.",
179
+ ```
180
+
181
+ **File**: `src/server/agent.js` — `WRAP_UP_NOTE` constant
182
+
183
+ ---
184
+
185
+ ## Secondary Observations (Not Fixed)
186
+
187
+ **zap-baseline.sh does not exist in snap ZAP**: This script is part of the ZAP Docker image. The snap installation provides `zaproxy -cmd -quickurl <url> -quickout <file>` instead, which is visible in `zaproxy -help` output. The agent saw this output but never connected it to the need for a baseline scan. This is a model knowledge gap.
188
+
189
+ **Zero-progress detection correctly did not fire**: Runs 2 and 3 had genuinely different `remaining` strings (run 3 claimed partial progress). The detection works as designed; the circumvention was through model hallucination of progress, not a code bug.
190
+
191
+ **failedApproaches from `ok`-status runs are not captured**: Run 1 ended with `status: ok` despite having 10 iterations of failed searches. The `failedApproaches` mechanism only captures failures from `checkpoint_reached` runs. Capturing failures from `ok`-status runs would require the model to include a `failedApproaches` field in its final response — a more significant protocol change left for a future finding.
192
+
193
+ ---
194
+
195
+ ## Files Changed
196
+
197
+ | File | Change |
198
+ |------|--------|
199
+ | `src/server/tools.js` | exec seed tool: `e.stderr \|\| ''` instead of `e.stderr \|\| e.message` |
200
+ | `src/server/agent.js` | Malformed JSON args: inject error tool result instead of silent `{}` |
201
+ | `src/server/agent.js` | Preserve failedApproaches in user message before resetting |
202
+ | `src/server/agent.js` | Strengthen WRAP_UP_NOTE `progress` field description |
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "@ducci/jarvis",
3
- "version": "1.0.29",
3
+ "version": "1.0.31",
4
4
  "description": "A fully automated agent system that lives on a server.",
5
5
  "main": "./src/index.js",
6
6
  "type": "module",
@@ -18,7 +18,7 @@ Respond with your normal JSON, but add a checkpoint field:
18
18
  "response": "Brief message to the user that the task is still in progress.",
19
19
  "logSummary": "Human-readable summary of what happened in this run.",
20
20
  "checkpoint": {
21
- "progress": "What has been fully completed so far.",
21
+ "progress": "What has been fully completed only include items confirmed by tool output (e.g., successful exec with exit code 0, or verified by ls/cat). Do not report planned steps as completed.",
22
22
  "remaining": "What still needs to be done to finish the task — as a plain text string, never an array or object.",
23
23
  "failedApproaches": ["Concise description of each approach that was tried and failed, e.g. 'downloading subfinder via curl from GitHub releases — connection reset'. Omit array entries for things that succeeded. Leave as empty array if nothing failed."]
24
24
  }
@@ -139,13 +139,26 @@ async function runAgentLoop(client, config, session, prepareMessages) {
139
139
  });
140
140
 
141
141
  let toolsModified = false;
142
+ let stderrErrorInIteration = false;
142
143
  for (const toolCall of assistantMessage.tool_calls) {
143
144
  const toolName = toolCall.function.name;
144
145
  let toolArgs;
146
+ let argParseError = null;
145
147
  try {
146
148
  toolArgs = JSON.parse(toolCall.function.arguments || '{}');
147
- } catch {
148
- toolArgs = {};
149
+ } catch (e) {
150
+ argParseError = e;
151
+ }
152
+
153
+ if (argParseError) {
154
+ const errorContent = JSON.stringify({
155
+ status: 'error',
156
+ error: `Tool arguments could not be parsed as JSON: ${argParseError.message}. Ensure arguments are a valid JSON object, e.g. {"key": "value"}.`,
157
+ });
158
+ session.messages.push({ role: 'tool', tool_call_id: toolCall.id, content: errorContent });
159
+ runToolCalls.push({ name: toolName, args: {}, status: 'error', result: errorContent });
160
+ consecutiveFailures++;
161
+ continue;
149
162
  }
150
163
 
151
164
  let result;
@@ -165,6 +178,9 @@ async function runAgentLoop(client, config, session, prepareMessages) {
165
178
  const toolFailed = toolStatus === 'error' || (resultObj && resultObj.status === 'error');
166
179
  if (toolFailed) {
167
180
  consecutiveFailures++;
181
+ if (resultObj && resultObj.stderr) {
182
+ stderrErrorInIteration = true;
183
+ }
168
184
  } else {
169
185
  consecutiveFailures = 0;
170
186
  }
@@ -173,7 +189,7 @@ async function runAgentLoop(client, config, session, prepareMessages) {
173
189
  runToolCalls.push({ name: toolName, args: toolArgs, status: toolStatus, result: resultStr });
174
190
 
175
191
  const sessionContent = resultStr.length > MAX_TOOL_RESULT
176
- ? resultStr.slice(0, MAX_TOOL_RESULT) + '\n[...truncated]'
192
+ ? resultStr.slice(0, 2000) + `\n[...${resultStr.length - 4000} chars truncated...]\n` + resultStr.slice(-2000)
177
193
  : resultStr;
178
194
  session.messages.push({
179
195
  role: 'tool',
@@ -201,6 +217,13 @@ async function runAgentLoop(client, config, session, prepareMessages) {
201
217
  });
202
218
  }
203
219
 
220
+ if (stderrErrorInIteration && !loopDetected) {
221
+ session.messages.push({
222
+ role: 'user',
223
+ content: '[System: A command failed and produced stderr output. Examine the stderr field in the tool result carefully — it likely describes the root cause of the failure. Do not retry the same command without first addressing what stderr reports.]',
224
+ });
225
+ }
226
+
204
227
  // Reload tools if any were created/updated this iteration
205
228
  if (toolsModified) {
206
229
  tools = await loadTools();
@@ -427,8 +450,15 @@ async function _runHandleChat(config, sessionId, userMessage) {
427
450
  session = createSession(systemPromptTemplate);
428
451
  }
429
452
 
453
+ // Preserve accumulated failedApproaches in conversation history before resetting
454
+ // so the model retains knowledge of what failed in the previous batch of handoff runs.
455
+ let userMessageWithContext = userMessage;
456
+ if (session.metadata.failedApproaches && session.metadata.failedApproaches.length > 0) {
457
+ userMessageWithContext += `\n\n[System: The following approaches were tried and failed in previous runs — consider them exhausted:\n${session.metadata.failedApproaches.map((a, i) => `${i + 1}. ${a}`).join('\n')}]`;
458
+ }
459
+
430
460
  // Append user message and reset handoff state
431
- session.messages.push({ role: 'user', content: userMessage });
461
+ session.messages.push({ role: 'user', content: userMessageWithContext });
432
462
  session.metadata.handoffCount = 0;
433
463
  session.metadata.failedApproaches = [];
434
464
 
@@ -73,7 +73,7 @@ const SEED_TOOLS = {
73
73
  });
74
74
  return { status: "ok", exitCode: 0, stdout, stderr };
75
75
  } catch (e) {
76
- return { status: "error", exitCode: e.code || 1, stdout: e.stdout || "", stderr: e.stderr || e.message };
76
+ return { status: "error", exitCode: e.code || 1, stdout: e.stdout || "", stderr: e.stderr || "" };
77
77
  }
78
78
  `,
79
79
  },