@ducci/jarvis 1.0.38 → 1.0.39
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/docs/agent.md +43 -4
- package/docs/crons.md +100 -0
- package/docs/identity.md +38 -0
- package/docs/skills.md +77 -0
- package/docs/system-prompt.md +25 -13
- package/docs/telegram.md +19 -0
- package/package.json +2 -1
- package/src/server/agent.js +44 -14
- package/src/server/app.js +125 -2
- package/src/server/config.js +43 -0
- package/src/server/cron-scheduler.js +35 -0
- package/src/server/crons.js +106 -0
- package/src/server/tools.js +192 -71
- package/docs/findings/001-context-explosion.md +0 -116
- package/docs/findings/002-handoff-edge-cases.md +0 -84
- package/docs/findings/003-event-loop-blocking-and-reliability.md +0 -120
- package/docs/findings/004-agent-reliability-improvements.md +0 -162
- package/docs/findings/005-installation-timeout.md +0 -128
- package/docs/findings/006-malformed-tool-schema.md +0 -118
- package/docs/findings/007-telegram-errors-and-handoff-stalling.md +0 -271
- package/docs/findings/008-exec-timeout-architecture.md +0 -118
- package/docs/findings/009-non-string-response-field.md +0 -153
- package/docs/findings/010-checkpoint-field-type-safety.md +0 -121
- package/docs/findings/011-empty-model-response.md +0 -157
- package/docs/findings/012-empty-nudge-loses-recovery-text.md +0 -121
- package/docs/findings/013-stderr-visibility-and-truncation.md +0 -59
- package/docs/findings/014-exec-stderr-artifact-and-malformed-tool-args.md +0 -202
- package/docs/findings/015-failed-run-context-strip.md +0 -142
- package/docs/findings/016-file-writing-corruption-and-stderr-loop.md +0 -119
- package/docs/findings/017-looping-intervention-and-lossy-checkpoint.md +0 -110
- package/docs/findings/018-anthropic-oauth-token-support.md +0 -72
|
@@ -1,121 +0,0 @@
|
|
|
1
|
-
# Finding 010: Non-String `checkpoint.remaining` Crashes Zero-Progress Detection
|
|
2
|
-
|
|
3
|
-
**Date:** 2026-03-01
|
|
4
|
-
**Severity:** High — caused "Sorry, something went wrong" in Telegram with no useful context; crashed the handoff loop mid-run
|
|
5
|
-
**Status:** Fixed
|
|
6
|
-
|
|
7
|
-
---
|
|
8
|
-
|
|
9
|
-
## Observed Session
|
|
10
|
-
|
|
11
|
-
The session ran 13+ agent runs working on OWASP ZAP installation. Runs 8–13 were consecutive `checkpoint_reached` handoffs. On entry 14 (immediately after entry 13), the server logged:
|
|
12
|
-
|
|
13
|
-
```
|
|
14
|
-
status: error
|
|
15
|
-
response: "An unexpected server error occurred: (run.checkpoint.remaining || "").trim is not a function"
|
|
16
|
-
```
|
|
17
|
-
|
|
18
|
-
The Telegram user received:
|
|
19
|
-
|
|
20
|
-
```
|
|
21
|
-
Sorry, something went wrong: (run.checkpoint.remaining || "").trim is not a function
|
|
22
|
-
```
|
|
23
|
-
|
|
24
|
-
---
|
|
25
|
-
|
|
26
|
-
## Bug Chain
|
|
27
|
-
|
|
28
|
-
### Step 1 — Wrap-up call returns non-string `remaining`
|
|
29
|
-
|
|
30
|
-
At iteration limit, `runAgentLoop` sends the `WRAP_UP_NOTE` and parses the model's JSON response. The model returned `checkpoint.remaining` as a non-string value (array or object) instead of a plain text string. `parsedWrapUp.checkpoint` was stored and returned with no type validation.
|
|
31
|
-
|
|
32
|
-
### Step 2 — Zero-progress detection crashes on `.trim()`
|
|
33
|
-
|
|
34
|
-
In `_runHandleChat`, finding 007 introduced zero-progress detection:
|
|
35
|
-
|
|
36
|
-
```js
|
|
37
|
-
const currentRemaining = (run.checkpoint.remaining || '').trim();
|
|
38
|
-
```
|
|
39
|
-
|
|
40
|
-
The `|| ''` guard only catches falsy values (null, undefined). A truthy non-string (array, object) passes through the `||` and `.trim()` is called on a non-string:
|
|
41
|
-
|
|
42
|
-
```
|
|
43
|
-
TypeError: (run.checkpoint.remaining || "").trim is not a function
|
|
44
|
-
```
|
|
45
|
-
|
|
46
|
-
### Step 3 — Outer catch logs the error and re-throws
|
|
47
|
-
|
|
48
|
-
The `try/catch` at the top of the handoff loop caught the TypeError, wrote an `error` status log entry, and re-threw. The Telegram handler surfaced the raw error message.
|
|
49
|
-
|
|
50
|
-
---
|
|
51
|
-
|
|
52
|
-
## Secondary Issues
|
|
53
|
-
|
|
54
|
-
**`resumeContent` (line 520)**: `run.checkpoint.remaining || 'Continue with the task.'` — if `remaining` is a truthy non-string, it would be pushed directly into `session.messages` as the next user message content. The message API expects a string, so this would produce a malformed conversation message.
|
|
55
|
-
|
|
56
|
-
**`failedApproaches` spread (lines 461–463)**: If the model returns `failedApproaches` as a non-array (string, object), `push(...value)` would spread wrong data. A string spreads individual characters; an object spreads its enumerable values.
|
|
57
|
-
|
|
58
|
-
---
|
|
59
|
-
|
|
60
|
-
## Root Cause
|
|
61
|
-
|
|
62
|
-
Same class of bug as finding 009 (non-string `response` field). Finding 009 hardened `response` and `logSummary` extraction, but the `checkpoint` sub-object fields were not included in that hardening pass. Models — especially smaller/free models under iteration-limit pressure — sometimes return structured data (arrays, objects) in fields the system prompt specifies as plain text strings.
|
|
63
|
-
|
|
64
|
-
---
|
|
65
|
-
|
|
66
|
-
## Fix
|
|
67
|
-
|
|
68
|
-
### `src/server/agent.js` — normalize checkpoint fields at source
|
|
69
|
-
|
|
70
|
-
Added a normalization block immediately inside the `if (parsedWrapUp.checkpoint)` branch, before any checkpoint field is accessed downstream:
|
|
71
|
-
|
|
72
|
-
```js
|
|
73
|
-
const cp = parsedWrapUp.checkpoint;
|
|
74
|
-
// remaining must be a string — used as the next run's resume prompt
|
|
75
|
-
if (typeof cp.remaining !== 'string') {
|
|
76
|
-
cp.remaining = Array.isArray(cp.remaining)
|
|
77
|
-
? cp.remaining.map(String).join('\n')
|
|
78
|
-
: cp.remaining != null ? JSON.stringify(cp.remaining) : '';
|
|
79
|
-
}
|
|
80
|
-
// failedApproaches must be an array of strings — spread into session metadata
|
|
81
|
-
if (!Array.isArray(cp.failedApproaches)) {
|
|
82
|
-
cp.failedApproaches = [];
|
|
83
|
-
} else {
|
|
84
|
-
cp.failedApproaches = cp.failedApproaches.map(item =>
|
|
85
|
-
typeof item === 'string' ? item : JSON.stringify(item)
|
|
86
|
-
);
|
|
87
|
-
}
|
|
88
|
-
```
|
|
89
|
-
|
|
90
|
-
**Array coercion for `remaining`**: when the model returns an array (e.g., `["install Java", "create symlink"]`), elements are joined with newlines rather than JSON-stringified — producing a natural readable resume prompt rather than raw JSON syntax.
|
|
91
|
-
|
|
92
|
-
**Centralized normalization**: fixing at source (right after parse) rather than at each use site means lines 469 and 520 need no change. Any future use of `checkpoint.remaining` or `checkpoint.failedApproaches` is automatically safe.
|
|
93
|
-
|
|
94
|
-
### `src/server/agent.js` — update `WRAP_UP_NOTE`
|
|
95
|
-
|
|
96
|
-
Added explicit type constraints to the `remaining` field description and a trailing instruction:
|
|
97
|
-
|
|
98
|
-
```
|
|
99
|
-
"remaining": "What still needs to be done — as a plain text string, never an array or object."
|
|
100
|
-
...
|
|
101
|
-
remaining must be a plain text string. failedApproaches must be a JSON array of strings.
|
|
102
|
-
```
|
|
103
|
-
|
|
104
|
-
---
|
|
105
|
-
|
|
106
|
-
## What Was Not Changed
|
|
107
|
-
|
|
108
|
-
- `agent.js` lines 469 and 520 — no changes needed; normalization at source makes them safe
|
|
109
|
-
- `src/channels/telegram/index.js` — finding 007 and 009 already added `.catch(() => {})` and type guards on delivery
|
|
110
|
-
- `sessions.js`, `tools.js` — no changes needed
|
|
111
|
-
|
|
112
|
-
---
|
|
113
|
-
|
|
114
|
-
## Outcome
|
|
115
|
-
|
|
116
|
-
| Fix | Files changed |
|
|
117
|
-
|-----|--------------|
|
|
118
|
-
| Normalize `checkpoint.remaining` to string and `checkpoint.failedApproaches` to string array at source | `src/server/agent.js` |
|
|
119
|
-
| Add explicit type constraints to WRAP_UP_NOTE | `src/server/agent.js` |
|
|
120
|
-
|
|
121
|
-
**Effect**: instead of a `TypeError` crash mid-handoff-loop, the model's non-string `remaining` value is coerced to a readable string and used as the resume prompt. The session continues normally.
|
|
@@ -1,157 +0,0 @@
|
|
|
1
|
-
# Finding 011: Empty Model Response Causes Generic Telegram Error
|
|
2
|
-
|
|
3
|
-
**Date:** 2026-03-01
|
|
4
|
-
**Severity:** High — user sees generic "please try again" with no actionable information
|
|
5
|
-
**Status:** Fixed
|
|
6
|
-
|
|
7
|
-
---
|
|
8
|
-
|
|
9
|
-
## Observed Session
|
|
10
|
-
|
|
11
|
-
Session `33a50dfe-38ea-4972-adac-498ef0525b0c`, run 16 of 17 (session.jsonl line 16):
|
|
12
|
-
|
|
13
|
-
```
|
|
14
|
-
status=format_error
|
|
15
|
-
model=nvidia/nemotron-3-nano-30b-a3b:free
|
|
16
|
-
iteration=4
|
|
17
|
-
userInput='Ok. Kannst du bitte jetzt das shell script ausführen mit der domain...'
|
|
18
|
-
logSummary='Model returned non-JSON final response after recovery attempts.'
|
|
19
|
-
rawResponse=''
|
|
20
|
-
response=''
|
|
21
|
-
```
|
|
22
|
-
|
|
23
|
-
The Telegram user received:
|
|
24
|
-
|
|
25
|
-
```
|
|
26
|
-
The agent encountered an error and could not produce a response. Please try again.
|
|
27
|
-
```
|
|
28
|
-
|
|
29
|
-
---
|
|
30
|
-
|
|
31
|
-
## What Happened
|
|
32
|
-
|
|
33
|
-
The agent executed a ZAP scan (`./scan.sh juice-shop.herokuapp.com`). The tool result was a large ZAP startup log, truncated at 4000 characters. Two subsequent tool calls failed:
|
|
34
|
-
|
|
35
|
-
- `pkill -f zaproxy || true` → exit 1 (no process to kill)
|
|
36
|
-
- `zaproxy -help | grep -i shutdown -A5` → failed (`libtiff.so.5` missing)
|
|
37
|
-
|
|
38
|
-
On iteration 4, the model returned `assistantMessage.content = null` with no `tool_calls`. This is the "went silent" case: the model produced neither a response nor another tool call.
|
|
39
|
-
|
|
40
|
-
---
|
|
41
|
-
|
|
42
|
-
## Bug Chain
|
|
43
|
-
|
|
44
|
-
### Step 1 — Model returns null content
|
|
45
|
-
|
|
46
|
-
```js
|
|
47
|
-
let content = assistantMessage.content || '';
|
|
48
|
-
// content = ''
|
|
49
|
-
```
|
|
50
|
-
|
|
51
|
-
### Step 2 — Recovery chain falls through on empty content
|
|
52
|
-
|
|
53
|
-
The existing format recovery chain was designed for *non-empty, non-JSON* responses:
|
|
54
|
-
|
|
55
|
-
1. `JSON.parse('')` → throws
|
|
56
|
-
2. Retry with fallback model (same messages, no nudge) → also `''`
|
|
57
|
-
3. Retry with nudge "Your previous response was not valid JSON" → technically wrong for empty content; model still returns `''`
|
|
58
|
-
4. Give up
|
|
59
|
-
|
|
60
|
-
### Step 3 — Empty `response` propagates to Telegram
|
|
61
|
-
|
|
62
|
-
```js
|
|
63
|
-
response = content; // ''
|
|
64
|
-
```
|
|
65
|
-
|
|
66
|
-
`handleChat` returns `{ response: '', ... }`. In `telegram/index.js`:
|
|
67
|
-
|
|
68
|
-
```js
|
|
69
|
-
const rawResponse = typeof result.response === 'string' ? result.response : ...;
|
|
70
|
-
// rawResponse = ''
|
|
71
|
-
const text = rawResponse.trim() || 'The agent encountered an error...';
|
|
72
|
-
// '' → fallback shown
|
|
73
|
-
```
|
|
74
|
-
|
|
75
|
-
The user sees the generic Telegram fallback instead of any information about what happened or what to do.
|
|
76
|
-
|
|
77
|
-
---
|
|
78
|
-
|
|
79
|
-
## Root Causes
|
|
80
|
-
|
|
81
|
-
**Primary**: The `format_error` path set `response = content` without a fallback for the empty string case. An empty `response` triggers the Telegram handler's last-resort fallback message, giving the user no context.
|
|
82
|
-
|
|
83
|
-
**Secondary**: The format recovery chain was designed for non-empty non-JSON responses. When `content` is empty, the nudge message "Your previous response was not valid JSON" is inaccurate — the model produced nothing, not invalid JSON. A targeted nudge for the empty case increases the chance of recovery.
|
|
84
|
-
|
|
85
|
-
**Model-level cause**: The free model `nvidia/nemotron-3-nano-30b-a3b:free` can fail to produce any output after processing a heavily truncated tool result followed by consecutive tool failures. This is a model quality limitation that the recovery layer must account for.
|
|
86
|
-
|
|
87
|
-
---
|
|
88
|
-
|
|
89
|
-
## Difference from Finding 009 and 010
|
|
90
|
-
|
|
91
|
-
| Finding | Model produces... | Bug manifests at... |
|
|
92
|
-
|---------|-------------------|---------------------|
|
|
93
|
-
| 009 | Non-string `response` field (array/object) | Telegram `.trim()` crash |
|
|
94
|
-
| 010 | Non-string `checkpoint.remaining` | Zero-progress `.trim()` crash |
|
|
95
|
-
| 011 | Empty/null content (no text, no tool calls) | Telegram generic fallback (no crash, but useless to user) |
|
|
96
|
-
|
|
97
|
-
Finding 011 is the third in the same class: model output type does not match what the system expects.
|
|
98
|
-
|
|
99
|
-
---
|
|
100
|
-
|
|
101
|
-
## Fix
|
|
102
|
-
|
|
103
|
-
### `src/server/agent.js` — two changes
|
|
104
|
-
|
|
105
|
-
**1. Empty-content detection with targeted nudge**
|
|
106
|
-
|
|
107
|
-
When `content` is empty, skip the standard recovery chain (designed for non-JSON text) and apply a targeted nudge that accurately describes the situation:
|
|
108
|
-
|
|
109
|
-
```js
|
|
110
|
-
if (!content.trim()) {
|
|
111
|
-
// Model returned no content at all — use a targeted nudge instead of the
|
|
112
|
-
// standard JSON recovery chain (designed for non-empty non-JSON responses).
|
|
113
|
-
try {
|
|
114
|
-
const emptyNudge = [
|
|
115
|
-
...preparedMessages,
|
|
116
|
-
{ role: 'user', content: 'You returned an empty response. ' + FORMAT_NUDGE },
|
|
117
|
-
];
|
|
118
|
-
const nudgeResult = await callModelWithFallback(client, config, emptyNudge, toolDefs);
|
|
119
|
-
const nudgeContent = nudgeResult.choices[0]?.message?.content || '';
|
|
120
|
-
parsed = JSON.parse(nudgeContent);
|
|
121
|
-
content = nudgeContent;
|
|
122
|
-
} catch {
|
|
123
|
-
// Give up — fall through to !parsed handler below
|
|
124
|
-
}
|
|
125
|
-
} else {
|
|
126
|
-
// Non-empty content — use the existing 3-step JSON recovery chain
|
|
127
|
-
try { parsed = JSON.parse(content); } catch {
|
|
128
|
-
// Step 1: fallback model...
|
|
129
|
-
// Step 2: nudge...
|
|
130
|
-
}
|
|
131
|
-
}
|
|
132
|
-
```
|
|
133
|
-
|
|
134
|
-
**2. Non-empty fallback on format_error**
|
|
135
|
-
|
|
136
|
-
```js
|
|
137
|
-
if (!parsed) {
|
|
138
|
-
// Ensure response is never empty so the delivery layer can show something
|
|
139
|
-
// meaningful rather than its generic fallback message.
|
|
140
|
-
response = content.trim() || 'The model did not produce a response. Please try again.';
|
|
141
|
-
logSummary = 'Model returned non-JSON final response after recovery attempts.';
|
|
142
|
-
status = 'format_error';
|
|
143
|
-
return { ... };
|
|
144
|
-
}
|
|
145
|
-
```
|
|
146
|
-
|
|
147
|
-
---
|
|
148
|
-
|
|
149
|
-
## Outcome
|
|
150
|
-
|
|
151
|
-
| Scenario | Before | After |
|
|
152
|
-
|----------|--------|-------|
|
|
153
|
-
| Model returns empty content, nudge succeeds | format_error (3 wasted API calls) | Clean recovery (1 targeted API call) |
|
|
154
|
-
| Model returns empty content, nudge fails | Telegram generic fallback | "The model did not produce a response. Please try again." |
|
|
155
|
-
| Model returns non-JSON text, all recovery fails | Telegram generic fallback (if text was empty) | Raw model output shown to user |
|
|
156
|
-
|
|
157
|
-
**Effect on the debugging session**: instead of the generic Telegram fallback, the user would have received "The model did not produce a response. Please try again." — a clear signal that the model failed, not their message. In the best case, the new targeted nudge would have elicited a valid JSON response.
|
|
@@ -1,121 +0,0 @@
|
|
|
1
|
-
# Finding 012: Empty-Content Nudge Includes Tools and Loses Recovery Text
|
|
2
|
-
|
|
3
|
-
**Date:** 2026-03-02
|
|
4
|
-
**Severity:** Medium — user sees generic error when model produces a partial recovery response
|
|
5
|
-
**Status:** Fixed
|
|
6
|
-
|
|
7
|
-
---
|
|
8
|
-
|
|
9
|
-
## Observed Session
|
|
10
|
-
|
|
11
|
-
Session `21fb43a7-2b11-4208-99fb-e6b54fddc07b`, entry 9 in session.jsonl:
|
|
12
|
-
|
|
13
|
-
```
|
|
14
|
-
status=format_error
|
|
15
|
-
model=nvidia/nemotron-3-nano-30b-a3b:free
|
|
16
|
-
iteration=3
|
|
17
|
-
userInput='Ok. Read the results folder. Is there anything?'
|
|
18
|
-
logSummary='Model returned non-JSON final response after recovery attempts.'
|
|
19
|
-
response='The model did not produce a response. Please try again.'
|
|
20
|
-
```
|
|
21
|
-
|
|
22
|
-
The user received: **"The model did not produce a response. Please try again."**
|
|
23
|
-
|
|
24
|
-
---
|
|
25
|
-
|
|
26
|
-
## What Happened
|
|
27
|
-
|
|
28
|
-
1. The agent executed two tool calls:
|
|
29
|
-
- `list_dir /root/.jarvis/projects/cybersecurity/results` → success
|
|
30
|
-
- `exec "list_dir /root/.jarvis/projects/cybersecurity/results/dviet.de"` → exit 127 (`list_dir: not found`)
|
|
31
|
-
- The model confused the `list_dir` jarvis tool with a shell command
|
|
32
|
-
|
|
33
|
-
2. After the failed exec, the model returned `assistantMessage.content = null` with no `tool_calls` — it "went silent"
|
|
34
|
-
|
|
35
|
-
3. Finding 011's empty-content nudge was triggered
|
|
36
|
-
|
|
37
|
-
4. The nudge **also failed** — no valid JSON response was produced
|
|
38
|
-
|
|
39
|
-
5. The agent fell through to `format_error` with the fallback message
|
|
40
|
-
|
|
41
|
-
---
|
|
42
|
-
|
|
43
|
-
## Bug Chain
|
|
44
|
-
|
|
45
|
-
### Bug 1 — toolDefs included in empty nudge
|
|
46
|
-
|
|
47
|
-
```js
|
|
48
|
-
const nudgeResult = await callModelWithFallback(client, config, emptyNudge, toolDefs);
|
|
49
|
-
```
|
|
50
|
-
|
|
51
|
-
When the model is confused after a tool failure, it may respond to the nudge with **another tool call** instead of text. If it does:
|
|
52
|
-
|
|
53
|
-
```
|
|
54
|
-
nudgeResult.choices[0].message.content = null
|
|
55
|
-
nudgeContent = ''
|
|
56
|
-
JSON.parse('') → throws
|
|
57
|
-
catch: // Give up — content stays ''
|
|
58
|
-
```
|
|
59
|
-
|
|
60
|
-
The model had an opportunity to call more tools instead of producing a text response — the wrong behavior for a recovery nudge.
|
|
61
|
-
|
|
62
|
-
### Bug 2 — content assigned after parse
|
|
63
|
-
|
|
64
|
-
```js
|
|
65
|
-
const nudgeContent = nudgeResult.choices[0]?.message?.content || '';
|
|
66
|
-
parsed = JSON.parse(nudgeContent); // ← throws on non-JSON or empty
|
|
67
|
-
content = nudgeContent; // ← only reached if parse succeeded
|
|
68
|
-
```
|
|
69
|
-
|
|
70
|
-
If the model responds to the nudge with non-empty but non-JSON text (e.g. a plain English answer), `JSON.parse` throws and `content` is **never updated**. The non-JSON text is discarded. The `!parsed` handler then shows the fallback message instead of the model's actual text.
|
|
71
|
-
|
|
72
|
-
---
|
|
73
|
-
|
|
74
|
-
## Difference from Finding 011
|
|
75
|
-
|
|
76
|
-
| Finding | Problem | Trigger |
|
|
77
|
-
|---------|---------|---------|
|
|
78
|
-
| 011 | Empty model response propagates to Telegram | Initial empty content, no recovery chain |
|
|
79
|
-
| 012 | Recovery nudge discards best-effort text; model can respond with tool call | Recovery nudge called with toolDefs + content assigned after parse |
|
|
80
|
-
|
|
81
|
-
Finding 012 is a refinement of the recovery path introduced in Finding 011.
|
|
82
|
-
|
|
83
|
-
---
|
|
84
|
-
|
|
85
|
-
## Fix
|
|
86
|
-
|
|
87
|
-
### `src/server/agent.js` — empty-content nudge block
|
|
88
|
-
|
|
89
|
-
**Before:**
|
|
90
|
-
```js
|
|
91
|
-
const nudgeResult = await callModelWithFallback(client, config, emptyNudge, toolDefs);
|
|
92
|
-
const nudgeContent = nudgeResult.choices[0]?.message?.content || '';
|
|
93
|
-
parsed = JSON.parse(nudgeContent);
|
|
94
|
-
content = nudgeContent;
|
|
95
|
-
```
|
|
96
|
-
|
|
97
|
-
**After:**
|
|
98
|
-
```js
|
|
99
|
-
// No tools: force text response, prevent model from calling another tool
|
|
100
|
-
const nudgeResult = await callModelWithFallback(client, config, emptyNudge, []);
|
|
101
|
-
const nudgeContent = nudgeResult.choices[0]?.message?.content || '';
|
|
102
|
-
// Persist before parsing — if JSON parse throws, content still carries the
|
|
103
|
-
// model's best-effort text so the !parsed handler can show it to the user
|
|
104
|
-
if (nudgeContent.trim()) {
|
|
105
|
-
content = nudgeContent;
|
|
106
|
-
}
|
|
107
|
-
parsed = JSON.parse(nudgeContent);
|
|
108
|
-
```
|
|
109
|
-
|
|
110
|
-
---
|
|
111
|
-
|
|
112
|
-
## Outcome
|
|
113
|
-
|
|
114
|
-
| Nudge response | Before | After |
|
|
115
|
-
|---|---|---|
|
|
116
|
-
| Valid JSON | Clean recovery | Clean recovery (no change) |
|
|
117
|
-
| Non-JSON text | Text discarded, fallback shown | Text shown to user |
|
|
118
|
-
| Tool call (no content) | content='', fallback shown | Less likely; content='', fallback shown |
|
|
119
|
-
| Empty again | content='', fallback shown | content='', fallback shown (no change) |
|
|
120
|
-
|
|
121
|
-
The user in the observed session would have received the model's best-effort text about the results folder contents, rather than "The model did not produce a response. Please try again."
|
|
@@ -1,59 +0,0 @@
|
|
|
1
|
-
# Finding 013 — stderr Visibility and Output Truncation
|
|
2
|
-
|
|
3
|
-
## Observed Behaviour
|
|
4
|
-
|
|
5
|
-
During a multi-run ZAP security scan session, the agent repeatedly failed to diagnose and fix the root cause of scan failures. It issued `pkill`/`kill` variants dozens of times, burned through 6 iteration limits and 2 handoffs, and ultimately gave up without producing results.
|
|
6
|
-
|
|
7
|
-
Post-mortem analysis of the debug session logs revealed two compounding problems.
|
|
8
|
-
|
|
9
|
-
## Root Causes
|
|
10
|
-
|
|
11
|
-
### 1. Head-only truncation buried errors at the end of output
|
|
12
|
-
|
|
13
|
-
`MAX_TOOL_RESULT = 4000` was applied as a simple head slice: `resultStr.slice(0, 4000)`. ZAP produces verbose startup logs (JVM init, add-on loading, database migration) that easily exceed 4000 characters. Five separate ZAP exec results were truncated exactly at the limit, cutting off during database migration messages. Any errors that appeared later in the output — after the verbose preamble — were silently dropped before the model ever saw them.
|
|
14
|
-
|
|
15
|
-
### 2. Model ignored stderr even when it was visible
|
|
16
|
-
|
|
17
|
-
In one un-truncated result (810 chars), the critical error was plainly present:
|
|
18
|
-
|
|
19
|
-
```
|
|
20
|
-
g_module_open() failed for libpixbufloader-tiff.so: libtiff.so.5: cannot open shared object file: No such file or directory
|
|
21
|
-
```
|
|
22
|
-
|
|
23
|
-
The model's subsequent response ignored it entirely and concluded only that "no results were found in the output directory." There was no mechanism forcing the model to re-examine stderr before forming its conclusion or retrying.
|
|
24
|
-
|
|
25
|
-
Similarly, all `pkill`/`kill` commands returned `exitCode: 1` with clear stderr — yet the model continued issuing variations of the same commands without diagnosing why process termination was failing.
|
|
26
|
-
|
|
27
|
-
## Fixes
|
|
28
|
-
|
|
29
|
-
### Fix 1 — Head + tail truncation (`agent.js`)
|
|
30
|
-
|
|
31
|
-
Replace head-only truncation with a head+tail strategy:
|
|
32
|
-
|
|
33
|
-
```js
|
|
34
|
-
// Before
|
|
35
|
-
resultStr.slice(0, MAX_TOOL_RESULT) + '\n[...truncated]'
|
|
36
|
-
|
|
37
|
-
// After
|
|
38
|
-
resultStr.slice(0, 2000) + `\n[...${resultStr.length - 4000} chars truncated...]\n` + resultStr.slice(-2000)
|
|
39
|
-
```
|
|
40
|
-
|
|
41
|
-
The first 2000 chars preserve startup context; the last 2000 chars preserve the diagnostic tail where errors typically appear. The marker in the middle shows how much was dropped. Total budget stays at 4000 chars.
|
|
42
|
-
|
|
43
|
-
### Fix 2 — Stderr nudge injection (`agent.js`)
|
|
44
|
-
|
|
45
|
-
After each iteration's tool-call loop, if any tool failed (`status === 'error'`) with non-empty `stderr`, inject a system message:
|
|
46
|
-
|
|
47
|
-
```
|
|
48
|
-
[System: A command failed and produced stderr output. Examine the stderr field in the tool result carefully — it likely describes the root cause of the failure. Do not retry the same command without first addressing what stderr reports.]
|
|
49
|
-
```
|
|
50
|
-
|
|
51
|
-
This creates an active forcing function — the model cannot continue to the next iteration without the nudge appearing in its context. The nudge is suppressed if loop detection already fired (to avoid contradictory instructions).
|
|
52
|
-
|
|
53
|
-
## Known Gap
|
|
54
|
-
|
|
55
|
-
The stderr nudge only fires when `toolFailed` is true (i.e., `exec` returned `status: 'error'`). Commands that return `exitCode: 0` but still emit meaningful errors to stderr (e.g., a shell script that succeeds but a subprocess inside it fails) will not trigger the nudge. Catching that case without generating noise from normal stderr usage (npm warnings, apt-get progress) requires more context than is available at this level. Documented as a known limitation.
|
|
56
|
-
|
|
57
|
-
## Files Changed
|
|
58
|
-
|
|
59
|
-
- `src/server/agent.js` — truncation strategy + stderr nudge injection
|
|
@@ -1,202 +0,0 @@
|
|
|
1
|
-
# Finding 014: exec stderr Artifact and Malformed Tool Call Arguments
|
|
2
|
-
|
|
3
|
-
**Date:** 2026-03-02
|
|
4
|
-
**Severity:** Medium — caused spurious context noise (spurious nudges), agent confusion on malformed args, and silent loss of failedApproaches across user turns
|
|
5
|
-
**Status:** Fixed
|
|
6
|
-
|
|
7
|
-
---
|
|
8
|
-
|
|
9
|
-
## Observed Session
|
|
10
|
-
|
|
11
|
-
Session `d97070f7-e50f-4d9e-a38b-b68e7b27e7b7`. User asked Jarvis to set up an OWASP ZAP web security scanning project. The session ran 4 runs (40 total iterations) without completing the task. The snap-installed ZAP does not include `zap-baseline.sh` (which only ships in the ZAP Docker image); the agent never discovered that `zaproxy -cmd -quickurl` is the correct snap-native CLI equivalent.
|
|
12
|
-
|
|
13
|
-
Three compounding issues were identified that degraded the agent's ability to self-correct.
|
|
14
|
-
|
|
15
|
-
---
|
|
16
|
-
|
|
17
|
-
## Issue 1: exec Tool Injects Node.js Error Message into `stderr` Field
|
|
18
|
-
|
|
19
|
-
### What happened
|
|
20
|
-
|
|
21
|
-
Commands like `which zap-cli` (not installed), `grep` returning no matches, and piped `find | grep` with no results all return exit code 1. In each case the actual process wrote nothing to stderr. But the exec tool result showed:
|
|
22
|
-
|
|
23
|
-
```json
|
|
24
|
-
{"status":"error","exitCode":1,"stdout":"","stderr":"Command failed: which zap-cli\n"}
|
|
25
|
-
```
|
|
26
|
-
|
|
27
|
-
The `"Command failed: ..."` string is not from the process — it is `e.message` from Node.js's `execAsync` error object, injected via `e.stderr || e.message`.
|
|
28
|
-
|
|
29
|
-
The stderr nudge check in `agent.js` fires on any non-empty `resultObj.stderr`:
|
|
30
|
-
|
|
31
|
-
```js
|
|
32
|
-
if (resultObj && resultObj.stderr) {
|
|
33
|
-
stderrErrorInIteration = true;
|
|
34
|
-
}
|
|
35
|
-
```
|
|
36
|
-
|
|
37
|
-
This triggered the nudge "Examine the stderr field carefully — it likely describes the root cause of the failure" 3–4 times per run, when there was nothing actionable to examine in stderr.
|
|
38
|
-
|
|
39
|
-
### Root cause
|
|
40
|
-
|
|
41
|
-
In the exec seed tool (`src/server/tools.js`, line 76):
|
|
42
|
-
|
|
43
|
-
```js
|
|
44
|
-
return { status: "error", exitCode: e.code || 1, stdout: e.stdout || "", stderr: e.stderr || e.message };
|
|
45
|
-
```
|
|
46
|
-
|
|
47
|
-
The `|| e.message` fallback was designed to show something when `e.stderr` is empty. But it conflates process-generated stderr with Node.js meta-messages about the process exiting non-zero.
|
|
48
|
-
|
|
49
|
-
### Fix
|
|
50
|
-
|
|
51
|
-
```js
|
|
52
|
-
// Before:
|
|
53
|
-
return { status: "error", exitCode: e.code || 1, stdout: e.stdout || "", stderr: e.stderr || e.message };
|
|
54
|
-
|
|
55
|
-
// After:
|
|
56
|
-
return { status: "error", exitCode: e.code || 1, stdout: e.stdout || "", stderr: e.stderr || "" };
|
|
57
|
-
```
|
|
58
|
-
|
|
59
|
-
`status: error` and `exitCode` already signal failure. The "Command failed: ..." Node.js string is not diagnostic and should not appear in the stderr field.
|
|
60
|
-
|
|
61
|
-
**File**: `src/server/tools.js` — exec seed tool code (propagates via seedTools() on next server start)
|
|
62
|
-
|
|
63
|
-
---
|
|
64
|
-
|
|
65
|
-
## Issue 2: Malformed Tool Call JSON Arguments Silently Swallowed
|
|
66
|
-
|
|
67
|
-
### What happened
|
|
68
|
-
|
|
69
|
-
The model sent a tool call with malformed arguments (missing opening `{`):
|
|
70
|
-
|
|
71
|
-
```json
|
|
72
|
-
{"name": "exec", "arguments": "\"cmd\": \"find /snap/zaproxy/current -type f -name '*.sh' | head -10\"}"}
|
|
73
|
-
```
|
|
74
|
-
|
|
75
|
-
`JSON.parse` threw, the catch block silently used `toolArgs = {}`, and the exec tool failed with:
|
|
76
|
-
|
|
77
|
-
```
|
|
78
|
-
{"status":"error","exitCode":"ERR_INVALID_ARG_TYPE","stdout":"","stderr":"The \"command\" argument must be of type string. Received undefined"}
|
|
79
|
-
```
|
|
80
|
-
|
|
81
|
-
The model saw `ERR_INVALID_ARG_TYPE` / "Received undefined" — no indication that the JSON formatting was wrong. The stderr nudge also fired (Issue 1 compounding: `err.message` put in stderr).
|
|
82
|
-
|
|
83
|
-
### Root cause
|
|
84
|
-
|
|
85
|
-
In `src/server/agent.js`:
|
|
86
|
-
|
|
87
|
-
```js
|
|
88
|
-
try {
|
|
89
|
-
toolArgs = JSON.parse(toolCall.function.arguments || '{}');
|
|
90
|
-
} catch {
|
|
91
|
-
toolArgs = {};
|
|
92
|
-
}
|
|
93
|
-
```
|
|
94
|
-
|
|
95
|
-
The JSON parse error is swallowed. The tool is called with empty args. The resulting type error is cryptic and doesn't tell the model to fix its JSON.
|
|
96
|
-
|
|
97
|
-
### Fix
|
|
98
|
-
|
|
99
|
-
Detect the parse failure early, push an explicit error tool result, and skip execution:
|
|
100
|
-
|
|
101
|
-
```js
|
|
102
|
-
let toolArgs;
|
|
103
|
-
let argParseError = null;
|
|
104
|
-
try {
|
|
105
|
-
toolArgs = JSON.parse(toolCall.function.arguments || '{}');
|
|
106
|
-
} catch (e) {
|
|
107
|
-
argParseError = e;
|
|
108
|
-
}
|
|
109
|
-
|
|
110
|
-
if (argParseError) {
|
|
111
|
-
const errorContent = JSON.stringify({
|
|
112
|
-
status: 'error',
|
|
113
|
-
error: `Tool arguments could not be parsed as JSON: ${argParseError.message}. Ensure arguments are a valid JSON object, e.g. {"key": "value"}.`,
|
|
114
|
-
});
|
|
115
|
-
session.messages.push({ role: 'tool', tool_call_id: toolCall.id, content: errorContent });
|
|
116
|
-
runToolCalls.push({ name: toolName, args: {}, status: 'error', result: errorContent });
|
|
117
|
-
consecutiveFailures++;
|
|
118
|
-
continue;
|
|
119
|
-
}
|
|
120
|
-
```
|
|
121
|
-
|
|
122
|
-
The model immediately sees "Tool arguments could not be parsed as JSON" instead of an opaque `ERR_INVALID_ARG_TYPE`. It can fix its JSON and retry.
|
|
123
|
-
|
|
124
|
-
**File**: `src/server/agent.js` — inner tool execution loop
|
|
125
|
-
|
|
126
|
-
---
|
|
127
|
-
|
|
128
|
-
## Issue 3: Accumulated failedApproaches Cleared on New User Message
|
|
129
|
-
|
|
130
|
-
### What happened
|
|
131
|
-
|
|
132
|
-
In multi-run sessions where multiple checkpoint handoffs accumulate `session.metadata.failedApproaches`, the next user message resets this list to `[]`:
|
|
133
|
-
|
|
134
|
-
```js
|
|
135
|
-
session.metadata.handoffCount = 0;
|
|
136
|
-
session.metadata.failedApproaches = [];
|
|
137
|
-
```
|
|
138
|
-
|
|
139
|
-
This was designed to give the model a clean slate after human review. But "Ok do it" is not a review — it's a continuation. The model loses knowledge of what was already tried and can repeat the same failed strategies in the new round of runs.
|
|
140
|
-
|
|
141
|
-
(Note: in this specific session run 1 ended with `ok`, so `failedApproaches` was empty at reset time anyway. But in sessions where checkpoint runs accumulate a list, the reset discards it entirely.)
|
|
142
|
-
|
|
143
|
-
### Fix
|
|
144
|
-
|
|
145
|
-
Embed accumulated `failedApproaches` into the incoming user message before resetting:
|
|
146
|
-
|
|
147
|
-
```js
|
|
148
|
-
let userMessageWithContext = userMessage;
|
|
149
|
-
if (session.metadata.failedApproaches && session.metadata.failedApproaches.length > 0) {
|
|
150
|
-
userMessageWithContext += `\n\n[System: The following approaches were tried and failed in previous runs — consider them exhausted:\n${session.metadata.failedApproaches.map((a, i) => `${i + 1}. ${a}`).join('\n')}]`;
|
|
151
|
-
}
|
|
152
|
-
session.messages.push({ role: 'user', content: userMessageWithContext });
|
|
153
|
-
session.metadata.handoffCount = 0;
|
|
154
|
-
session.metadata.failedApproaches = [];
|
|
155
|
-
```
|
|
156
|
-
|
|
157
|
-
The model enters the new round with awareness of what has already been exhausted. If the user's message implies a fresh task, the model can ignore the list; if it's a continuation, it benefits from the context.
|
|
158
|
-
|
|
159
|
-
**File**: `src/server/agent.js` — `_runHandleChat`, before user message push
|
|
160
|
-
|
|
161
|
-
---
|
|
162
|
-
|
|
163
|
-
## Issue 4: WRAP_UP_NOTE Did Not Require Verified Progress Claims
|
|
164
|
-
|
|
165
|
-
### What happened
|
|
166
|
-
|
|
167
|
-
Runs 2 and 3 claimed to have created project directories and files in the `progress` field of their checkpoints — but their tool calls contained no file creation commands. The model fabricated progress, causing subsequent resume messages to start from false premises.
|
|
168
|
-
|
|
169
|
-
### Fix
|
|
170
|
-
|
|
171
|
-
Updated the `progress` field description in `WRAP_UP_NOTE`:
|
|
172
|
-
|
|
173
|
-
```
|
|
174
|
-
// Before:
|
|
175
|
-
"progress": "What has been fully completed so far.",
|
|
176
|
-
|
|
177
|
-
// After:
|
|
178
|
-
"progress": "What has been fully completed — only include items confirmed by tool output (e.g., successful exec with exit code 0, or verified by ls/cat). Do not report planned steps as completed.",
|
|
179
|
-
```
|
|
180
|
-
|
|
181
|
-
**File**: `src/server/agent.js` — `WRAP_UP_NOTE` constant
|
|
182
|
-
|
|
183
|
-
---
|
|
184
|
-
|
|
185
|
-
## Secondary Observations (Not Fixed)
|
|
186
|
-
|
|
187
|
-
**zap-baseline.sh does not exist in snap ZAP**: This script is part of the ZAP Docker image. The snap installation provides `zaproxy -cmd -quickurl <url> -quickout <file>` instead, which is visible in `zaproxy -help` output. The agent saw this output but never connected it to the need for a baseline scan. This is a model knowledge gap.
|
|
188
|
-
|
|
189
|
-
**Zero-progress detection correctly did not fire**: Runs 2 and 3 had genuinely different `remaining` strings (run 3 claimed partial progress). The detection works as designed; the circumvention was through model hallucination of progress, not a code bug.
|
|
190
|
-
|
|
191
|
-
**failedApproaches from `ok`-status runs are not captured**: Run 1 ended with `status: ok` despite having 10 iterations of failed searches. The `failedApproaches` mechanism only captures failures from `checkpoint_reached` runs. Capturing failures from `ok`-status runs would require the model to include a `failedApproaches` field in its final response — a more significant protocol change left for a future finding.
|
|
192
|
-
|
|
193
|
-
---
|
|
194
|
-
|
|
195
|
-
## Files Changed
|
|
196
|
-
|
|
197
|
-
| File | Change |
|
|
198
|
-
|------|--------|
|
|
199
|
-
| `src/server/tools.js` | exec seed tool: `e.stderr \|\| ''` instead of `e.stderr \|\| e.message` |
|
|
200
|
-
| `src/server/agent.js` | Malformed JSON args: inject error tool result instead of silent `{}` |
|
|
201
|
-
| `src/server/agent.js` | Preserve failedApproaches in user message before resetting |
|
|
202
|
-
| `src/server/agent.js` | Strengthen WRAP_UP_NOTE `progress` field description |
|