@ducci/jarvis 1.0.38 → 1.0.40
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/docs/agent.md +43 -4
- package/docs/crons.md +100 -0
- package/docs/identity.md +38 -0
- package/docs/skills.md +77 -0
- package/docs/system-prompt.md +25 -13
- package/docs/telegram.md +61 -2
- package/package.json +2 -1
- package/src/channels/telegram/index.js +65 -0
- package/src/server/agent.js +59 -19
- package/src/server/app.js +125 -2
- package/src/server/config.js +43 -0
- package/src/server/cron-scheduler.js +35 -0
- package/src/server/crons.js +106 -0
- package/src/server/tools.js +234 -72
- package/docs/findings/001-context-explosion.md +0 -116
- package/docs/findings/002-handoff-edge-cases.md +0 -84
- package/docs/findings/003-event-loop-blocking-and-reliability.md +0 -120
- package/docs/findings/004-agent-reliability-improvements.md +0 -162
- package/docs/findings/005-installation-timeout.md +0 -128
- package/docs/findings/006-malformed-tool-schema.md +0 -118
- package/docs/findings/007-telegram-errors-and-handoff-stalling.md +0 -271
- package/docs/findings/008-exec-timeout-architecture.md +0 -118
- package/docs/findings/009-non-string-response-field.md +0 -153
- package/docs/findings/010-checkpoint-field-type-safety.md +0 -121
- package/docs/findings/011-empty-model-response.md +0 -157
- package/docs/findings/012-empty-nudge-loses-recovery-text.md +0 -121
- package/docs/findings/013-stderr-visibility-and-truncation.md +0 -59
- package/docs/findings/014-exec-stderr-artifact-and-malformed-tool-args.md +0 -202
- package/docs/findings/015-failed-run-context-strip.md +0 -142
- package/docs/findings/016-file-writing-corruption-and-stderr-loop.md +0 -119
- package/docs/findings/017-looping-intervention-and-lossy-checkpoint.md +0 -110
- package/docs/findings/018-anthropic-oauth-token-support.md +0 -72
|
@@ -1,271 +0,0 @@
|
|
|
1
|
-
# Finding 007: Telegram Error Opacity, Empty Responses, and Handoff Stalling
|
|
2
|
-
|
|
3
|
-
**Date:** 2026-02-28
|
|
4
|
-
**Severity:** High — caused "Sorry, something went wrong" with no context, silent empty responses, and 40+ wasted iterations on a stuck task
|
|
5
|
-
**Status:** Fixed
|
|
6
|
-
|
|
7
|
-
---
|
|
8
|
-
|
|
9
|
-
## Observed Session
|
|
10
|
-
|
|
11
|
-
The session (`fdb3fb46`) ran on a Linux server using `nvidia/nemotron-3-nano-30b-a3b:free`. The user asked Jarvis to implement a cybersecurity scanning project and run a scan against `https://dviet.de`. The session produced 10 agent runs over approximately 2 hours, including 4 consecutive handoff runs that made no real progress, two "Sorry, something went wrong" errors in Telegram, and one completely silent (empty) Telegram message.
|
|
12
|
-
|
|
13
|
-
---
|
|
14
|
-
|
|
15
|
-
## What Happened — Full Run Sequence
|
|
16
|
-
|
|
17
|
-
| Run | Trigger | Status | Telegram received |
|
|
18
|
-
|-----|---------|--------|-------------------|
|
|
19
|
-
| 1 | "Hi" | ok | "Hello Duc! 👋" |
|
|
20
|
-
| 2 | "Weißt du wo das cybersecurity Projekt liegt?" | ok (9 iterations) | Location found |
|
|
21
|
-
| 3 | "Kannst du das readme lesen..." | ok | README analysis |
|
|
22
|
-
| 4 | "Yes implement the missing pieces..." | **format_error** | **Empty message (silent)** |
|
|
23
|
-
| 5 | "What exactly went wrong?" (handoff 1) | checkpoint_reached | — (internal handoff loop) |
|
|
24
|
-
| 6 | handoff 2 | checkpoint_reached | — (internal) |
|
|
25
|
-
| 7 | handoff 3 | checkpoint_reached | — (internal) |
|
|
26
|
-
| 8 | handoff 4 | **model_error** | **"Sorry, something went wrong: ..."** |
|
|
27
|
-
| 9 | "What is your session id" | ok | Session ID |
|
|
28
|
-
| 10 | "Do exec ls" | **model_error** | **"Sorry, something went wrong: ..."** |
|
|
29
|
-
|
|
30
|
-
---
|
|
31
|
-
|
|
32
|
-
## Issue 1: Generic "Sorry, something went wrong" with no context
|
|
33
|
-
|
|
34
|
-
### What happened
|
|
35
|
-
|
|
36
|
-
Runs 8 and 10 failed with `model_error: Empty choices array` — the nvidia free model returned a response with `choices: []`, producing no content at all. When `handleChat` threw an error, the Telegram channel's catch block sent:
|
|
37
|
-
|
|
38
|
-
```
|
|
39
|
-
Sorry, something went wrong. Please try again.
|
|
40
|
-
```
|
|
41
|
-
|
|
42
|
-
This is maximally unhelpful. The user has no idea whether the model failed, whether a tool crashed, whether the session is broken, or whether retrying will help.
|
|
43
|
-
|
|
44
|
-
### Root cause
|
|
45
|
-
|
|
46
|
-
The catch block in `src/channels/telegram/index.js` used a hardcoded string regardless of the actual error:
|
|
47
|
-
|
|
48
|
-
```js
|
|
49
|
-
} catch (e) {
|
|
50
|
-
console.error(`[telegram] agent error chat_id=${chatId}: ${e.message}`);
|
|
51
|
-
await ctx.reply('Sorry, something went wrong. Please try again.');
|
|
52
|
-
clearInterval(typingInterval);
|
|
53
|
-
return;
|
|
54
|
-
}
|
|
55
|
-
```
|
|
56
|
-
|
|
57
|
-
The error message was logged to `console.error` (only visible in server logs, not to the user) and discarded.
|
|
58
|
-
|
|
59
|
-
### Fix (`src/channels/telegram/index.js`)
|
|
60
|
-
|
|
61
|
-
Pass `e.message` to the user reply:
|
|
62
|
-
|
|
63
|
-
```js
|
|
64
|
-
const errText = e.message
|
|
65
|
-
? `Sorry, something went wrong: ${e.message}`
|
|
66
|
-
: 'Sorry, something went wrong. Please try again.';
|
|
67
|
-
await ctx.reply(errText).catch(() => {});
|
|
68
|
-
```
|
|
69
|
-
|
|
70
|
-
The `.catch(() => {})` guards against a second failure when the Telegram API itself is unreachable — without it, a failed `ctx.reply` inside the catch block would throw an unhandled rejection.
|
|
71
|
-
|
|
72
|
-
---
|
|
73
|
-
|
|
74
|
-
## Issue 2: Empty Telegram message on `format_error`
|
|
75
|
-
|
|
76
|
-
### What happened
|
|
77
|
-
|
|
78
|
-
Run 4 ended with `format_error` — the model produced a non-JSON final response and all three recovery attempts (fallback model, nudge retry) also failed. The agent returned with `response: ""` (empty string). The Telegram handler then called:
|
|
79
|
-
|
|
80
|
-
```js
|
|
81
|
-
const text = result.response; // ""
|
|
82
|
-
await ctx.reply(text); // ctx.reply("") — Telegram silently rejects empty messages
|
|
83
|
-
```
|
|
84
|
-
|
|
85
|
-
The user saw nothing. No error, no confirmation, no indication that anything had happened. From their perspective the message was sent and never received a reply.
|
|
86
|
-
|
|
87
|
-
### Root cause
|
|
88
|
-
|
|
89
|
-
The delivery block in `src/channels/telegram/index.js` used `result.response` directly without guarding against empty or null values. When `format_error` returns an empty string, `ctx.reply("")` is called and Telegram's API rejects it silently (HTTP 400 from Telegram, swallowed by grammy).
|
|
90
|
-
|
|
91
|
-
Additionally, the error log in the delivery catch block used `result.response.length`, which would throw a `TypeError` if `result.response` was `null` rather than `""`.
|
|
92
|
-
|
|
93
|
-
### Fix (`src/channels/telegram/index.js`)
|
|
94
|
-
|
|
95
|
-
Guard with `?.trim()` and a fallback message:
|
|
96
|
-
|
|
97
|
-
```js
|
|
98
|
-
const text = result.response?.trim()
|
|
99
|
-
|| 'The agent encountered an error and could not produce a response. Please try again.';
|
|
100
|
-
```
|
|
101
|
-
|
|
102
|
-
Also updated the delivery catch block to not reference `result.response.length` (which crashes on null).
|
|
103
|
-
|
|
104
|
-
---
|
|
105
|
-
|
|
106
|
-
## Issue 3: `failedApproaches` memory lost across handoff runs
|
|
107
|
-
|
|
108
|
-
### What happened
|
|
109
|
-
|
|
110
|
-
Runs 5, 6, 7, and 8 were all consecutive handoff runs triggered by a single user message ("What exactly went wrong?"). Each run correctly produced a `failedApproaches` array in its checkpoint — for example, "nuclei scan command timed out". However, the resume message for the next run was built using only the **current run's** `failedApproaches`:
|
|
111
|
-
|
|
112
|
-
```js
|
|
113
|
-
if (run.checkpoint.failedApproaches && run.checkpoint.failedApproaches.length > 0) {
|
|
114
|
-
resumeContent += `\n\n[System: The following approaches were tried and failed in the previous run — ...]`
|
|
115
|
-
// ↑ only the last run's failures, not all previous runs
|
|
116
|
-
}
|
|
117
|
-
```
|
|
118
|
-
|
|
119
|
-
Run 6 started knowing about run 5's failures. Run 7 started knowing about run 6's failures — but had forgotten run 5's. Run 8 started knowing about run 7's failures — but had forgotten runs 5 and 6. The model could only see one run of history at a time and kept rediscovering and re-attempting strategies it had already tried.
|
|
120
|
-
|
|
121
|
-
The session JSONL log shows runs 5, 6, 7, and 8 all executing nearly identical tool call sequences:
|
|
122
|
-
|
|
123
|
-
- `nuclei -update-templates`
|
|
124
|
-
- `nuclei -h`
|
|
125
|
-
- `mkdir -p results/dviet.de`
|
|
126
|
-
- `which node`
|
|
127
|
-
- `ls -al /usr/local/share/nuclei` → error (directory doesn't exist)
|
|
128
|
-
- `echo -e '#!/bin/bash...'` → broken scan.sh
|
|
129
|
-
- `nuclei -silent ... -u https://dviet.de` → **60s timeout**
|
|
130
|
-
- `nmap -sV -p 80,443 dviet.de` → host down
|
|
131
|
-
- `nuclei -silent -t http-title ...` → **60s timeout**
|
|
132
|
-
|
|
133
|
-
32 tool calls in run 8 alone, across only 3 model iterations — the model was dumping the same 10-call batch per iteration, learning nothing.
|
|
134
|
-
|
|
135
|
-
### Fix (`src/server/agent.js`, `src/server/sessions.js`)
|
|
136
|
-
|
|
137
|
-
**1.** Added `failedApproaches: []` to `session.metadata` in `createSession()`. This gives old sessions a graceful upgrade path (the field will be initialized to `[]` on the first new user message that resets handoff state).
|
|
138
|
-
|
|
139
|
-
**2.** On every new user message, `failedApproaches` is reset alongside `handoffCount`:
|
|
140
|
-
|
|
141
|
-
```js
|
|
142
|
-
session.metadata.handoffCount = 0;
|
|
143
|
-
session.metadata.failedApproaches = [];
|
|
144
|
-
```
|
|
145
|
-
|
|
146
|
-
**3.** After each `checkpoint_reached` run, the current run's failures are pushed onto the session-level accumulator:
|
|
147
|
-
|
|
148
|
-
```js
|
|
149
|
-
if (run.checkpoint.failedApproaches && run.checkpoint.failedApproaches.length > 0) {
|
|
150
|
-
if (!session.metadata.failedApproaches) session.metadata.failedApproaches = [];
|
|
151
|
-
session.metadata.failedApproaches.push(...run.checkpoint.failedApproaches);
|
|
152
|
-
}
|
|
153
|
-
```
|
|
154
|
-
|
|
155
|
-
**4.** The resume message uses the full accumulated list instead of just the last run's:
|
|
156
|
-
|
|
157
|
-
```js
|
|
158
|
-
const allFailedApproaches = session.metadata.failedApproaches || [];
|
|
159
|
-
if (allFailedApproaches.length > 0) {
|
|
160
|
-
resumeContent += `\n\n[System: The following approaches were tried and failed in previous runs — do not repeat them:\n${allFailedApproaches.map((a, i) => `${i + 1}. ${a}`).join('\n')}]`;
|
|
161
|
-
}
|
|
162
|
-
```
|
|
163
|
-
|
|
164
|
-
The message changed from "in the **previous run**" to "in **previous runs**" to accurately reflect the scope.
|
|
165
|
-
|
|
166
|
-
---
|
|
167
|
-
|
|
168
|
-
## Issue 4: Zero-progress handoffs not detected
|
|
169
|
-
|
|
170
|
-
### What happened
|
|
171
|
-
|
|
172
|
-
Runs 5, 6, and 7 all hit `checkpoint_reached` with nearly identical `remaining` fields. Each run used 10 full iterations yet made no real progress — the `remaining` list after run 7 was essentially the same as after run 5. The handoff loop continued spawning new runs until `maxHandoffs` was hit, burning 30 more iterations and about 90 minutes of wall time.
|
|
173
|
-
|
|
174
|
-
The existing `maxHandoffs` limit (default 5) is the only backstop, but it doesn't distinguish between runs that make real progress and runs that achieve nothing. It allows up to 5 useless handoffs before stopping.
|
|
175
|
-
|
|
176
|
-
### Root cause
|
|
177
|
-
|
|
178
|
-
No comparison was made between consecutive `checkpoint.remaining` values. The handoff loop always continued as long as `handoffCount <= maxHandoffs`, with no check that the agent had actually made forward progress.
|
|
179
|
-
|
|
180
|
-
### Fix (`src/server/agent.js`)
|
|
181
|
-
|
|
182
|
-
Introduced a `previousRemaining` variable in the handoff loop. Before each handoff continuation, the current `checkpoint.remaining` (trimmed) is compared against the previous run's value. If they are identical, the session is stopped immediately with `intervention_required`:
|
|
183
|
-
|
|
184
|
-
```js
|
|
185
|
-
let previousRemaining = null;
|
|
186
|
-
|
|
187
|
-
// ... inside handoff loop, after checkpoint_reached:
|
|
188
|
-
const currentRemaining = (run.checkpoint.remaining || '').trim();
|
|
189
|
-
if (previousRemaining !== null && currentRemaining === previousRemaining) {
|
|
190
|
-
finalStatus = 'intervention_required';
|
|
191
|
-
finalLogSummary = 'Zero progress detected: task state unchanged after a full run. Human intervention required.';
|
|
192
|
-
// log, strip, break
|
|
193
|
-
}
|
|
194
|
-
previousRemaining = currentRemaining;
|
|
195
|
-
```
|
|
196
|
-
|
|
197
|
-
This fires on the **second** handoff with identical remaining — one repeat is allowed because the model may be working on the same items in a different order. Identical remaining on two consecutive runs means it is genuinely stuck.
|
|
198
|
-
|
|
199
|
-
**Effect on the debugging session**: instead of running 4 useless handoffs (runs 5–8), the agent would have stopped at run 6 with `intervention_required`, saving 20 iterations and about 60 minutes.
|
|
200
|
-
|
|
201
|
-
---
|
|
202
|
-
|
|
203
|
-
## Issue 5: `echo -e` creates broken shell scripts on Ubuntu
|
|
204
|
-
|
|
205
|
-
### What happened
|
|
206
|
-
|
|
207
|
-
The model attempted to create `scan.sh` using:
|
|
208
|
-
|
|
209
|
-
```sh
|
|
210
|
-
echo -e '#!/bin/bash\n# Simple wrapper...\n...'
|
|
211
|
-
```
|
|
212
|
-
|
|
213
|
-
On Ubuntu, the default shell for non-interactive commands is `/bin/dash`, not `/bin/bash`. `dash` does not support `echo -e` — it treats `-e` as a literal argument, so the file started with:
|
|
214
|
-
|
|
215
|
-
```
|
|
216
|
-
-e #!/bin/bash
|
|
217
|
-
# Simple wrapper...
|
|
218
|
-
```
|
|
219
|
-
|
|
220
|
-
Every attempt to run `scan.sh` failed with:
|
|
221
|
-
|
|
222
|
-
```
|
|
223
|
-
/root/.jarvis/projects/cybersecurity/scan.sh: 1: -e: not found
|
|
224
|
-
```
|
|
225
|
-
|
|
226
|
-
The model detected the error on the first try, but each subsequent handoff run repeated the exact same broken creation command, presumably because the `echo -e` pattern was deep in the model's training distribution for "create a shell script".
|
|
227
|
-
|
|
228
|
-
### Root cause
|
|
229
|
-
|
|
230
|
-
The system prompt's `## exec Safety` section gave guidance about filesystem scans and `cat` vs `grep`, but said nothing about portable file creation. The `echo -e` pattern works on bash but not on sh/dash, and this distinction is invisible to a model working through `exec`.
|
|
231
|
-
|
|
232
|
-
### Fix (`docs/system-prompt.md`)
|
|
233
|
-
|
|
234
|
-
Added a bullet to `## exec Safety`:
|
|
235
|
-
|
|
236
|
-
```
|
|
237
|
-
- **Writing multi-line files**: use `printf '...'` or a heredoc (`cat <<'EOF' > file`) instead of
|
|
238
|
-
`echo -e`. The `-e` flag is not portable — on Ubuntu `/bin/sh` it is treated as literal text,
|
|
239
|
-
corrupting the file.
|
|
240
|
-
```
|
|
241
|
-
|
|
242
|
-
---
|
|
243
|
-
|
|
244
|
-
## What Was Not Changed
|
|
245
|
-
|
|
246
|
-
- The `model_error: Empty choices array` path in `runAgentLoop` — this is a provider-side failure, no server-side fix can prevent the model from returning nothing. The fix is at the Telegram delivery layer (surfacing the error to the user), not at the agent layer.
|
|
247
|
-
- The `maxHandoffs` limit — it remains as a hard cap. Zero-progress detection fires before `maxHandoffs` in the case of a truly stuck task, so the two mechanisms are complementary.
|
|
248
|
-
- The `consecutiveFailures` detector and exact-match loop tracker — unchanged; they work alongside the new zero-progress check.
|
|
249
|
-
- No per-session Perplexity call counter was added server-side.
|
|
250
|
-
|
|
251
|
-
---
|
|
252
|
-
|
|
253
|
-
## Secondary Observation: nuclei templates were present but never found
|
|
254
|
-
|
|
255
|
-
Nuclei was installed at `/usr/local/bin/nuclei`. Its templates were at `/root/nuclei-templates` (visible in the `list_dir /root` output from run 2). However, the model never connected these two facts — it searched `/usr/local/share/nuclei`, `/root/.nuclei`, and various `/usr` subdirectories, missing the templates that were right there in `/root`.
|
|
256
|
-
|
|
257
|
-
Running `nuclei -u https://dviet.de -t /root/nuclei-templates/...` (or just `nuclei -u https://dviet.de` with the templates auto-discovered from `~/nuclei-templates`) would likely have worked. The scan also failed because `nmap` reported "host seems down" for `dviet.de` — the host was either rate-limiting ICMP or the port scan was blocked. Using `nmap -Pn` to skip host discovery would have bypassed this.
|
|
258
|
-
|
|
259
|
-
Neither of these is a Jarvis bug — they are model reasoning failures specific to this session and this free model. A more capable model would have spotted the template path and the nmap flag.
|
|
260
|
-
|
|
261
|
-
---
|
|
262
|
-
|
|
263
|
-
## Outcome
|
|
264
|
-
|
|
265
|
-
| Fix | Files changed |
|
|
266
|
-
|-----|--------------|
|
|
267
|
-
| Better error text in Telegram catch block | `src/channels/telegram/index.js` |
|
|
268
|
-
| Guard against empty response before `ctx.reply()` | `src/channels/telegram/index.js` |
|
|
269
|
-
| Accumulate `failedApproaches` across all handoffs in `session.metadata` | `src/server/agent.js`, `src/server/sessions.js` |
|
|
270
|
-
| Zero-progress handoff detection via `previousRemaining` comparison | `src/server/agent.js` |
|
|
271
|
-
| Shell script writing guidance (`printf`/heredoc over `echo -e`) | `docs/system-prompt.md` |
|
|
@@ -1,118 +0,0 @@
|
|
|
1
|
-
# Finding 008: exec Timeout Architecture — Agent Cannot Increase Its Own Timeout
|
|
2
|
-
|
|
3
|
-
**Date:** 2026-02-28
|
|
4
|
-
**Severity:** Medium — caused 5 wasted user interactions and agent confusion; no crashes or data loss
|
|
5
|
-
**Status:** Fixed
|
|
6
|
-
|
|
7
|
-
---
|
|
8
|
-
|
|
9
|
-
## What Happened
|
|
10
|
-
|
|
11
|
-
A user asked Jarvis to run a cybersecurity scan script (`nuclei` + `nmap`) against `https://dviet.de`. The script ran via the `exec` tool and timed out after 60 seconds. The user then asked the agent to "increase the timeout to 5 minutes."
|
|
12
|
-
|
|
13
|
-
The agent attempted this in two ways:
|
|
14
|
-
|
|
15
|
-
1. **Run 11**: Used `exec` to run `sed -i 's/"timeout": 60000/"timeout": 300000/g' tools.json` — changing the `timeout` value inside the exec tool's `code` string from 60s to 300s.
|
|
16
|
-
2. **Run 13**: Called `save_tool` to recreate the exec tool with a "5-minute timeout" description and modified code.
|
|
17
|
-
|
|
18
|
-
Both attempts failed. The scan timed out at 60s in every subsequent run. The agent and user concluded "the platform enforces a 60-second cap" — true, but neither understood why the agent's changes had no effect.
|
|
19
|
-
|
|
20
|
-
---
|
|
21
|
-
|
|
22
|
-
## Root Cause
|
|
23
|
-
|
|
24
|
-
### The Two-Level Timeout Architecture
|
|
25
|
-
|
|
26
|
-
Every tool execution is governed by two independent timeouts:
|
|
27
|
-
|
|
28
|
-
**Layer 1 — Outer wrapper** (`executeTool` in `src/server/tools.js`):
|
|
29
|
-
|
|
30
|
-
```js
|
|
31
|
-
const timeoutMs = tool.timeout || TOOL_TIMEOUT_MS; // TOOL_TIMEOUT_MS = 60_000
|
|
32
|
-
const timeout = new Promise((_, reject) =>
|
|
33
|
-
setTimeout(() => reject(new Error(`Tool '${name}' timed out after ${timeoutMs / 1000}s`)), timeoutMs)
|
|
34
|
-
);
|
|
35
|
-
return await Promise.race([fn(toolArgs, ...), timeout]);
|
|
36
|
-
```
|
|
37
|
-
|
|
38
|
-
`tool.timeout` is a **top-level property on the tool registry entry** — not inside `definition` or `code`. Only `system_install` had this property set (to 300,000ms). The exec tool had no top-level `timeout`, so it always used the 60s default.
|
|
39
|
-
|
|
40
|
-
**Layer 2 — Inner timeout** (inside the tool's `code`):
|
|
41
|
-
|
|
42
|
-
The exec seed tool's code contains `execAsync(args.cmd, { timeout: 60000 })`. This is just a string stored in tools.json. Changing this number (via sed or save_tool) only affects the inner `execAsync` behavior — the outer Promise.race at 60s fires first anyway.
|
|
43
|
-
|
|
44
|
-
### Why the Agent's Fixes Had No Effect
|
|
45
|
-
|
|
46
|
-
1. **sed on the code string**: Changed the inner `execAsync` timeout from 60s to 300s. But the outer wrapper uses `tool.timeout || 60_000`, and `exec` has no `tool.timeout` property. The outer race still fires at 60s and wins.
|
|
47
|
-
|
|
48
|
-
2. **`save_tool` recreation**: `save_tool` writes `{ definition, code }` to tools.json — it had no `timeout` parameter. The exec entry still had no top-level `timeout` after `save_tool`. Same result.
|
|
49
|
-
|
|
50
|
-
### Why seedTools() Makes This Permanent
|
|
51
|
-
|
|
52
|
-
Even if a manual edit to tools.json successfully added `exec.timeout = 300000`, the next server restart would run `seedTools()`, compare against `SEED_TOOLS.exec` (which has no timeout), and restore it — losing the change.
|
|
53
|
-
|
|
54
|
-
---
|
|
55
|
-
|
|
56
|
-
## What Changed
|
|
57
|
-
|
|
58
|
-
### 1. `exec` seed tool now has a 5-minute timeout (`src/server/tools.js`)
|
|
59
|
-
|
|
60
|
-
Added `timeout: 300_000` as a top-level property on the exec seed tool — the simplest fix:
|
|
61
|
-
|
|
62
|
-
```js
|
|
63
|
-
exec: {
|
|
64
|
-
timeout: 300_000, // 5 minutes — Layer 1 outer wrapper reads this
|
|
65
|
-
definition: { ... },
|
|
66
|
-
code: `... execAsync(args.cmd, { timeout: 270000 }) ...` // 4.5 min inner, leaves headroom
|
|
67
|
-
}
|
|
68
|
-
```
|
|
69
|
-
|
|
70
|
-
The inner `execAsync` timeout was also updated from 60s to 270s (4.5 min) so it fires cleanly before the outer wrapper, giving a proper error message rather than a hard kill.
|
|
71
|
-
|
|
72
|
-
### 2. `save_tool` now accepts a `timeout` parameter (`src/server/tools.js`)
|
|
73
|
-
|
|
74
|
-
The `save_tool` tool now accepts an optional `timeout` field (in milliseconds, max 600,000 = 10 minutes). When provided, it is written as a top-level property on the tool entry:
|
|
75
|
-
|
|
76
|
-
```js
|
|
77
|
-
const entry = { definition: { ... }, code: args.code };
|
|
78
|
-
if (args.timeout !== undefined) {
|
|
79
|
-
const t = Number(args.timeout);
|
|
80
|
-
entry.timeout = Math.min(t, 600_000);
|
|
81
|
-
}
|
|
82
|
-
tools[args.name] = entry;
|
|
83
|
-
```
|
|
84
|
-
|
|
85
|
-
This allows the agent to create custom tools with explicit timeout declarations when wrapping slow operations.
|
|
86
|
-
|
|
87
|
-
### 3. System prompt updated (`docs/system-prompt.md`)
|
|
88
|
-
|
|
89
|
-
Added a `## Execution Timeouts` section documenting:
|
|
90
|
-
- `exec` = 5-minute cap (covers scans, builds, and most long-running commands)
|
|
91
|
-
- `system_install` = 5-minute cap, use for package installation
|
|
92
|
-
- Custom tools via `save_tool` = 60s default, pass `timeout` param to extend
|
|
93
|
-
- Background execution pattern for processes > 5 minutes
|
|
94
|
-
|
|
95
|
-
### 4. `agent.md` updated (`docs/agent.md`)
|
|
96
|
-
|
|
97
|
-
Added a `### Two-Level Tool Timeout Architecture` subsection under `## Limits and Timeouts` explaining:
|
|
98
|
-
- The outer wrapper and how `tool.timeout` is read
|
|
99
|
-
- The inner timeout in tool code and its relationship to the outer
|
|
100
|
-
- How to declare a custom timeout via `save_tool`
|
|
101
|
-
- Why seed tool modifications via `save_tool` don't change the outer timeout (seedTools() restores on restart)
|
|
102
|
-
|
|
103
|
-
---
|
|
104
|
-
|
|
105
|
-
## What Was Not Changed
|
|
106
|
-
|
|
107
|
-
- `TOOL_TIMEOUT_MS` constant — remains at 60,000ms (the default for tools without an explicit timeout)
|
|
108
|
-
- `system_install` — unchanged
|
|
109
|
-
- The handoff system, checkpoint memory, loop detection — all unchanged
|
|
110
|
-
- `seedTools()` update detection logic — unchanged
|
|
111
|
-
|
|
112
|
-
---
|
|
113
|
-
|
|
114
|
-
## Outcome
|
|
115
|
-
|
|
116
|
-
- `exec` now has a 5-minute timeout — long-running scans, builds, and downloads work without any workaround
|
|
117
|
-
- The agent can set custom timeouts on tools it creates via `save_tool`
|
|
118
|
-
- The system prompt and docs explain the architecture so the agent doesn't waste iterations on an unsolvable problem
|
|
@@ -1,153 +0,0 @@
|
|
|
1
|
-
# Finding 009: Non-String `response` Field Crashes Telegram Delivery
|
|
2
|
-
|
|
3
|
-
**Date:** 2026-03-01
|
|
4
|
-
**Severity:** High — caused "Sorry, something went wrong sending the response" with no useful information for the user
|
|
5
|
-
**Status:** Fixed
|
|
6
|
-
|
|
7
|
-
---
|
|
8
|
-
|
|
9
|
-
## Observed Session
|
|
10
|
-
|
|
11
|
-
The session ran 19 agent runs, all completing successfully (`ok` or `checkpoint_reached`). The crash occurred on run 19. The user asked:
|
|
12
|
-
|
|
13
|
-
> "List me all tool calls you did In this session. Tool name and args are enough to display for each entry."
|
|
14
|
-
|
|
15
|
-
The model returned valid JSON but placed the list of tool calls as a JSON **array** (not a string) in the `response` field:
|
|
16
|
-
|
|
17
|
-
```json
|
|
18
|
-
{
|
|
19
|
-
"response": [
|
|
20
|
-
{ "tool": "exec", "args": { "cmd": "find ..." } },
|
|
21
|
-
...16 entries...
|
|
22
|
-
],
|
|
23
|
-
"logSummary": "Enumerated every tool call made during the session..."
|
|
24
|
-
}
|
|
25
|
-
```
|
|
26
|
-
|
|
27
|
-
The Telegram user received: **"Sorry, something went wrong sending the response. Please try again."**
|
|
28
|
-
|
|
29
|
-
---
|
|
30
|
-
|
|
31
|
-
## Bug Chain
|
|
32
|
-
|
|
33
|
-
### Step 1 — Agent parses valid JSON, stores non-string response
|
|
34
|
-
|
|
35
|
-
`runAgentLoop` in `src/server/agent.js` successfully parsed the model's response JSON. The extraction logic had no type check:
|
|
36
|
-
|
|
37
|
-
```js
|
|
38
|
-
response = parsed.response || content;
|
|
39
|
-
```
|
|
40
|
-
|
|
41
|
-
`parsed.response` was an array (truthy) → `response` was set to the array. No validation. The array propagated through `finalResponse` all the way to the return value of `handleChat`.
|
|
42
|
-
|
|
43
|
-
### Step 2 — Telegram handler crashes calling `.trim()` on an array
|
|
44
|
-
|
|
45
|
-
In `src/channels/telegram/index.js`:
|
|
46
|
-
|
|
47
|
-
```js
|
|
48
|
-
const text = result.response?.trim()
|
|
49
|
-
|| 'The agent encountered an error...';
|
|
50
|
-
```
|
|
51
|
-
|
|
52
|
-
`?.` guards against `null` and `undefined` only — not against wrong types. Arrays do not have a `.trim()` method. This threw:
|
|
53
|
-
|
|
54
|
-
```
|
|
55
|
-
TypeError: result.response.trim is not a function
|
|
56
|
-
```
|
|
57
|
-
|
|
58
|
-
### Step 3 — Delivery catch block sends the generic error
|
|
59
|
-
|
|
60
|
-
The TypeError was caught by the outer delivery try/catch, which replied:
|
|
61
|
-
|
|
62
|
-
```
|
|
63
|
-
Sorry, something went wrong sending the response. Please try again.
|
|
64
|
-
```
|
|
65
|
-
|
|
66
|
-
The user had no idea what failed. The agent had completed successfully — only the delivery step crashed.
|
|
67
|
-
|
|
68
|
-
---
|
|
69
|
-
|
|
70
|
-
## Root Causes
|
|
71
|
-
|
|
72
|
-
**Primary**: `agent.js` never validates that `parsed.response` is a string after JSON parsing. The response contract ("Your message to the user, in plain text.") is documented but never enforced. Any JSON value — array, object, number, null — passes through silently.
|
|
73
|
-
|
|
74
|
-
**Secondary**: `telegram/index.js` assumed `result.response` would always be a string or null/undefined, and called `.trim()` without type-guarding.
|
|
75
|
-
|
|
76
|
-
The same primary bug exists in the wrap-up path (line ~315):
|
|
77
|
-
```js
|
|
78
|
-
response = parsedWrapUp.response || '';
|
|
79
|
-
```
|
|
80
|
-
This would fail identically if the wrap-up model returned a non-string `response`.
|
|
81
|
-
|
|
82
|
-
---
|
|
83
|
-
|
|
84
|
-
## What Was Not Caught Earlier
|
|
85
|
-
|
|
86
|
-
- The JSONL log stored `response: [array]` but the run status was `ok` — nothing flagged as an error on the agent side.
|
|
87
|
-
- The error only surfaces in the Telegram delivery layer, which has no visibility into the JSONL log.
|
|
88
|
-
- The model had valid intent (listing tool calls as a structured data type) — it just put the data in the wrong JSON field type.
|
|
89
|
-
|
|
90
|
-
---
|
|
91
|
-
|
|
92
|
-
## Fix
|
|
93
|
-
|
|
94
|
-
### 1. `src/server/agent.js` — normalize response to string at both sites
|
|
95
|
-
|
|
96
|
-
**Main response path:**
|
|
97
|
-
```js
|
|
98
|
-
// Before:
|
|
99
|
-
response = parsed.response || content;
|
|
100
|
-
|
|
101
|
-
// After:
|
|
102
|
-
response = typeof parsed.response === 'string'
|
|
103
|
-
? parsed.response
|
|
104
|
-
: JSON.stringify(parsed.response, null, 2);
|
|
105
|
-
```
|
|
106
|
-
|
|
107
|
-
**Wrap-up path:**
|
|
108
|
-
```js
|
|
109
|
-
// Before:
|
|
110
|
-
response = parsedWrapUp.response || '';
|
|
111
|
-
|
|
112
|
-
// After:
|
|
113
|
-
response = typeof parsedWrapUp.response === 'string'
|
|
114
|
-
? parsedWrapUp.response
|
|
115
|
-
: parsedWrapUp.response != null ? JSON.stringify(parsedWrapUp.response, null, 2) : '';
|
|
116
|
-
```
|
|
117
|
-
|
|
118
|
-
When the model returns a non-string (array, object), it is JSON-stringified with 2-space indentation. The user gets a readable representation of the intended content rather than a crash. This preserves the model's intent while enforcing the string contract.
|
|
119
|
-
|
|
120
|
-
### 2. `src/channels/telegram/index.js` — defense-in-depth type guard
|
|
121
|
-
|
|
122
|
-
```js
|
|
123
|
-
// Before:
|
|
124
|
-
const text = result.response?.trim()
|
|
125
|
-
|| 'The agent encountered an error and could not produce a response. Please try again.';
|
|
126
|
-
|
|
127
|
-
// After:
|
|
128
|
-
const rawResponse = typeof result.response === 'string'
|
|
129
|
-
? result.response
|
|
130
|
-
: result.response != null ? JSON.stringify(result.response, null, 2) : '';
|
|
131
|
-
const text = rawResponse.trim()
|
|
132
|
-
|| 'The agent encountered an error and could not produce a response. Please try again.';
|
|
133
|
-
```
|
|
134
|
-
|
|
135
|
-
### 3. `docs/system-prompt.md` — explicit type constraint on `response`
|
|
136
|
-
|
|
137
|
-
Added one sentence to the `## Response Format` section:
|
|
138
|
-
|
|
139
|
-
```
|
|
140
|
-
The `response` value must be a plain text string — never an array or object. If you need to present structured data (e.g. a list of items), format it as text within the string value.
|
|
141
|
-
```
|
|
142
|
-
|
|
143
|
-
---
|
|
144
|
-
|
|
145
|
-
## Outcome
|
|
146
|
-
|
|
147
|
-
| Fix | Files changed |
|
|
148
|
-
|-----|--------------|
|
|
149
|
-
| Coerce `parsed.response` to string in main and wrap-up paths | `src/server/agent.js` |
|
|
150
|
-
| Type guard before `.trim()` call | `src/channels/telegram/index.js` |
|
|
151
|
-
| Explicit type constraint on `response` field | `docs/system-prompt.md` |
|
|
152
|
-
|
|
153
|
-
**Effect on the debugging session**: instead of "Sorry, something went wrong sending the response", the user would have received the tool call list formatted as a readable JSON string.
|
|
@@ -1,121 +0,0 @@
|
|
|
1
|
-
# Finding 010: Non-String `checkpoint.remaining` Crashes Zero-Progress Detection
|
|
2
|
-
|
|
3
|
-
**Date:** 2026-03-01
|
|
4
|
-
**Severity:** High — caused "Sorry, something went wrong" in Telegram with no useful context; crashed the handoff loop mid-run
|
|
5
|
-
**Status:** Fixed
|
|
6
|
-
|
|
7
|
-
---
|
|
8
|
-
|
|
9
|
-
## Observed Session
|
|
10
|
-
|
|
11
|
-
The session ran 13+ agent runs working on OWASP ZAP installation. Runs 8–13 were consecutive `checkpoint_reached` handoffs. On entry 14 (immediately after entry 13), the server logged:
|
|
12
|
-
|
|
13
|
-
```
|
|
14
|
-
status: error
|
|
15
|
-
response: "An unexpected server error occurred: (run.checkpoint.remaining || "").trim is not a function"
|
|
16
|
-
```
|
|
17
|
-
|
|
18
|
-
The Telegram user received:
|
|
19
|
-
|
|
20
|
-
```
|
|
21
|
-
Sorry, something went wrong: (run.checkpoint.remaining || "").trim is not a function
|
|
22
|
-
```
|
|
23
|
-
|
|
24
|
-
---
|
|
25
|
-
|
|
26
|
-
## Bug Chain
|
|
27
|
-
|
|
28
|
-
### Step 1 — Wrap-up call returns non-string `remaining`
|
|
29
|
-
|
|
30
|
-
At iteration limit, `runAgentLoop` sends the `WRAP_UP_NOTE` and parses the model's JSON response. The model returned `checkpoint.remaining` as a non-string value (array or object) instead of a plain text string. `parsedWrapUp.checkpoint` was stored and returned with no type validation.
|
|
31
|
-
|
|
32
|
-
### Step 2 — Zero-progress detection crashes on `.trim()`
|
|
33
|
-
|
|
34
|
-
In `_runHandleChat`, finding 007 introduced zero-progress detection:
|
|
35
|
-
|
|
36
|
-
```js
|
|
37
|
-
const currentRemaining = (run.checkpoint.remaining || '').trim();
|
|
38
|
-
```
|
|
39
|
-
|
|
40
|
-
The `|| ''` guard only catches falsy values (null, undefined). A truthy non-string (array, object) passes through the `||` and `.trim()` is called on a non-string:
|
|
41
|
-
|
|
42
|
-
```
|
|
43
|
-
TypeError: (run.checkpoint.remaining || "").trim is not a function
|
|
44
|
-
```
|
|
45
|
-
|
|
46
|
-
### Step 3 — Outer catch logs the error and re-throws
|
|
47
|
-
|
|
48
|
-
The `try/catch` at the top of the handoff loop caught the TypeError, wrote an `error` status log entry, and re-threw. The Telegram handler surfaced the raw error message.
|
|
49
|
-
|
|
50
|
-
---
|
|
51
|
-
|
|
52
|
-
## Secondary Issues
|
|
53
|
-
|
|
54
|
-
**`resumeContent` (line 520)**: `run.checkpoint.remaining || 'Continue with the task.'` — if `remaining` is a truthy non-string, it would be pushed directly into `session.messages` as the next user message content. The message API expects a string, so this would produce a malformed conversation message.
|
|
55
|
-
|
|
56
|
-
**`failedApproaches` spread (lines 461–463)**: If the model returns `failedApproaches` as a non-array (string, object), `push(...value)` would spread wrong data. A string spreads individual characters; an object spreads its enumerable values.
|
|
57
|
-
|
|
58
|
-
---
|
|
59
|
-
|
|
60
|
-
## Root Cause
|
|
61
|
-
|
|
62
|
-
Same class of bug as finding 009 (non-string `response` field). Finding 009 hardened `response` and `logSummary` extraction, but the `checkpoint` sub-object fields were not included in that hardening pass. Models — especially smaller/free models under iteration-limit pressure — sometimes return structured data (arrays, objects) in fields the system prompt specifies as plain text strings.
|
|
63
|
-
|
|
64
|
-
---
|
|
65
|
-
|
|
66
|
-
## Fix
|
|
67
|
-
|
|
68
|
-
### `src/server/agent.js` — normalize checkpoint fields at source
|
|
69
|
-
|
|
70
|
-
Added a normalization block immediately inside the `if (parsedWrapUp.checkpoint)` branch, before any checkpoint field is accessed downstream:
|
|
71
|
-
|
|
72
|
-
```js
|
|
73
|
-
const cp = parsedWrapUp.checkpoint;
|
|
74
|
-
// remaining must be a string — used as the next run's resume prompt
|
|
75
|
-
if (typeof cp.remaining !== 'string') {
|
|
76
|
-
cp.remaining = Array.isArray(cp.remaining)
|
|
77
|
-
? cp.remaining.map(String).join('\n')
|
|
78
|
-
: cp.remaining != null ? JSON.stringify(cp.remaining) : '';
|
|
79
|
-
}
|
|
80
|
-
// failedApproaches must be an array of strings — spread into session metadata
|
|
81
|
-
if (!Array.isArray(cp.failedApproaches)) {
|
|
82
|
-
cp.failedApproaches = [];
|
|
83
|
-
} else {
|
|
84
|
-
cp.failedApproaches = cp.failedApproaches.map(item =>
|
|
85
|
-
typeof item === 'string' ? item : JSON.stringify(item)
|
|
86
|
-
);
|
|
87
|
-
}
|
|
88
|
-
```
|
|
89
|
-
|
|
90
|
-
**Array coercion for `remaining`**: when the model returns an array (e.g., `["install Java", "create symlink"]`), elements are joined with newlines rather than JSON-stringified — producing a natural readable resume prompt rather than raw JSON syntax.
|
|
91
|
-
|
|
92
|
-
**Centralized normalization**: fixing at source (right after parse) rather than at each use site means lines 469 and 520 need no change. Any future use of `checkpoint.remaining` or `checkpoint.failedApproaches` is automatically safe.
|
|
93
|
-
|
|
94
|
-
### `src/server/agent.js` — update `WRAP_UP_NOTE`
|
|
95
|
-
|
|
96
|
-
Added explicit type constraints to the `remaining` field description and a trailing instruction:
|
|
97
|
-
|
|
98
|
-
```
|
|
99
|
-
"remaining": "What still needs to be done — as a plain text string, never an array or object."
|
|
100
|
-
...
|
|
101
|
-
remaining must be a plain text string. failedApproaches must be a JSON array of strings.
|
|
102
|
-
```
|
|
103
|
-
|
|
104
|
-
---
|
|
105
|
-
|
|
106
|
-
## What Was Not Changed
|
|
107
|
-
|
|
108
|
-
- `agent.js` lines 469 and 520 — no changes needed; normalization at source makes them safe
|
|
109
|
-
- `src/channels/telegram/index.js` — finding 007 and 009 already added `.catch(() => {})` and type guards on delivery
|
|
110
|
-
- `sessions.js`, `tools.js` — no changes needed
|
|
111
|
-
|
|
112
|
-
---
|
|
113
|
-
|
|
114
|
-
## Outcome
|
|
115
|
-
|
|
116
|
-
| Fix | Files changed |
|
|
117
|
-
|-----|--------------|
|
|
118
|
-
| Normalize `checkpoint.remaining` to string and `checkpoint.failedApproaches` to string array at source | `src/server/agent.js` |
|
|
119
|
-
| Add explicit type constraints to WRAP_UP_NOTE | `src/server/agent.js` |
|
|
120
|
-
|
|
121
|
-
**Effect**: instead of a `TypeError` crash mid-handoff-loop, the model's non-string `remaining` value is coerced to a readable string and used as the resume prompt. The session continues normally.
|