@ducci/jarvis 1.0.38 → 1.0.39

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (31) hide show
  1. package/docs/agent.md +43 -4
  2. package/docs/crons.md +100 -0
  3. package/docs/identity.md +38 -0
  4. package/docs/skills.md +77 -0
  5. package/docs/system-prompt.md +25 -13
  6. package/docs/telegram.md +19 -0
  7. package/package.json +2 -1
  8. package/src/server/agent.js +44 -14
  9. package/src/server/app.js +125 -2
  10. package/src/server/config.js +43 -0
  11. package/src/server/cron-scheduler.js +35 -0
  12. package/src/server/crons.js +106 -0
  13. package/src/server/tools.js +192 -71
  14. package/docs/findings/001-context-explosion.md +0 -116
  15. package/docs/findings/002-handoff-edge-cases.md +0 -84
  16. package/docs/findings/003-event-loop-blocking-and-reliability.md +0 -120
  17. package/docs/findings/004-agent-reliability-improvements.md +0 -162
  18. package/docs/findings/005-installation-timeout.md +0 -128
  19. package/docs/findings/006-malformed-tool-schema.md +0 -118
  20. package/docs/findings/007-telegram-errors-and-handoff-stalling.md +0 -271
  21. package/docs/findings/008-exec-timeout-architecture.md +0 -118
  22. package/docs/findings/009-non-string-response-field.md +0 -153
  23. package/docs/findings/010-checkpoint-field-type-safety.md +0 -121
  24. package/docs/findings/011-empty-model-response.md +0 -157
  25. package/docs/findings/012-empty-nudge-loses-recovery-text.md +0 -121
  26. package/docs/findings/013-stderr-visibility-and-truncation.md +0 -59
  27. package/docs/findings/014-exec-stderr-artifact-and-malformed-tool-args.md +0 -202
  28. package/docs/findings/015-failed-run-context-strip.md +0 -142
  29. package/docs/findings/016-file-writing-corruption-and-stderr-loop.md +0 -119
  30. package/docs/findings/017-looping-intervention-and-lossy-checkpoint.md +0 -110
  31. package/docs/findings/018-anthropic-oauth-token-support.md +0 -72
@@ -1,116 +0,0 @@
1
- # Finding 001: Context Window Explosion via Tool Output Accumulation
2
-
3
- **Date:** 2026-02-26
4
- **Severity:** High — renders session completely unusable after enough handoffs
5
- **Status:** Fixed
6
-
7
- ---
8
-
9
- ## What Happened
10
-
11
- A session was started with the question *"Hast du Zugriff auf deinen source code? Wo liegt er?"* (Does Jarvis have access to its own source code?). The agent began exploring the filesystem using `exec` and `list_dir`, running commands like `cat agent.js`, `cat tools.js`, `cat app.js`, and various `find` commands.
12
-
13
- The task required more than 10 iterations to complete, so the checkpoint/handoff mechanism fired. The agent ran 6 consecutive handoff runs before hitting `maxHandoffs` and stopping with `intervention_required`.
14
-
15
- By that point the session `conversation.json` had grown to **687KB**. On the very next user message (*"Why?"*), both the primary and fallback models returned a `400 Provider returned error`. The session was permanently broken — no further messages could be processed.
16
-
17
- ---
18
-
19
- ## Root Cause
20
-
21
- Two compounding problems:
22
-
23
- ### 1. Tool output stored verbatim, without size limit
24
-
25
- `exec` returns raw `stdout` from shell commands. When the model runs `cat agent.js` (440 lines, ~22 000 chars), that entire output gets stored in `session.messages` as a `role: "tool"` message. Every subsequent model request in that run — and in all future runs — sends this content in full.
26
-
27
- There was no cap anywhere on tool result content. A single run of 10 iterations with a few `cat` calls could easily produce 100–200 KB of tool messages.
28
-
29
- ### 2. Handoff runs accumulated on top of each other
30
-
31
- When the iteration limit is hit, the checkpoint/handoff mechanism pushes `checkpoint.remaining` as a new user message and starts a fresh agent run — but on top of the **same, growing** `session.messages` array. Each of the 6 handoff runs added another 10 iterations of tool call messages to the history. Nothing was ever removed.
32
-
33
- After 6 runs × ~10 iterations × multiple `cat` commands each, the context reached approximately 170 000 tokens — exceeding the free model's 128 000 token limit. The `400` was the provider rejecting the oversized request.
34
-
35
- ### Why the `400` appeared on the *next* user message, not during the run
36
-
37
- The session's final run hit `maxHandoffs` and stopped. At that point the context was already at or near the limit. When the user sent a new message, the full bloated history was loaded and sent again — this time slightly over the limit — causing the rejection.
38
-
39
- ---
40
-
41
- ## Model Context Windows (for reference)
42
-
43
- | Model | Context Window |
44
- |---|---|
45
- | arcee-ai/trinity-large-preview:free | ~128 000 tokens |
46
- | Claude Sonnet 4.6 | 200 000 tokens |
47
- | Gemini 2.5 Pro / 2.0 Flash | 1 000 000 tokens |
48
-
49
- A larger model would have delayed the failure, but not prevented it. The conversation would still grow unboundedly.
50
-
51
- ---
52
-
53
- ## What We Considered
54
-
55
- **Truncate tool results in `prepareMessages`** — works, but runs on every loop iteration and is the wrong place conceptually. The content is already stored in full in the session before `prepareMessages` is ever called.
56
-
57
- **Naive sliding window (drop oldest N messages)** — breaks the OpenRouter/OpenAI API contract. Every `role: "tool"` message must be paired with the assistant message containing the matching `tool_call_id`. Slicing arbitrarily through the message array orphans tool results and causes a `400` — the exact error we're trying to fix.
58
-
59
- **Token budget / summarisation** — more adaptive but significantly more complex. Requires either token counting per model or an extra LLM call. Overkill for v1.
60
-
61
- ---
62
-
63
- ## Fix
64
-
65
- Two targeted changes to `src/server/agent.js`.
66
-
67
- ### 1. Cap tool result content at write time (`MAX_TOOL_RESULT = 4000`)
68
-
69
- Right where a tool result is pushed to `session.messages`, cap the content to 4 000 characters. The full result is still passed to `runToolCalls` and therefore written to the JSONL session log — no information is lost for debugging. Only what the model sees is limited.
70
-
71
- ```js
72
- const sessionContent = resultStr.length > MAX_TOOL_RESULT
73
- ? resultStr.slice(0, MAX_TOOL_RESULT) + '\n[...truncated]'
74
- : resultStr;
75
- session.messages.push({ role: 'tool', tool_call_id: toolCall.id, content: sessionContent });
76
- ```
77
-
78
- 4 000 chars is ~80 lines of code or a full `ls -la` listing — enough for the model to reason about any output. If more detail is needed, the model should use targeted commands (`grep`, `head`, `tail`) rather than `cat`-ing entire files.
79
-
80
- ### 2. Strip intermediate tool messages before each handoff
81
-
82
- Before calling `runAgentLoop`, snapshot `session.messages.length` as `runStartIndex`. If the run ends with `checkpoint_reached`, splice out all messages added during that run *except the final wrap-up assistant response*, then push `checkpoint.remaining` as the new user message.
83
-
84
- ```js
85
- const runStartIndex = session.messages.length;
86
- const run = await runAgentLoop(...);
87
-
88
- // on checkpoint_reached, before resuming:
89
- session.messages.splice(runStartIndex, session.messages.length - runStartIndex - 1);
90
- session.messages.push({ role: 'user', content: run.checkpoint.remaining });
91
- ```
92
-
93
- **Before** (after 6 handoffs):
94
- ```
95
- [system] [user: question] [assistant/tool ×10] [wrap-up] [user: checkpoint1]
96
- [assistant/tool ×10] [wrap-up] [user: checkpoint2]
97
- [assistant/tool ×10] [wrap-up] ... → 687 KB
98
- ```
99
-
100
- **After** (after 6 handoffs):
101
- ```
102
- [system] [user: question] [wrap-up] [user: checkpoint1]
103
- [wrap-up] [user: checkpoint2]
104
- [wrap-up] ... → ~5 KB
105
- ```
106
-
107
- Each handoff now adds 2 messages instead of 20+. The wrap-up message carries the relevant state (what was done, what remains) so the model is not flying blind — it just doesn't have the raw tool noise from previous runs.
108
-
109
- ---
110
-
111
- ## Outcome
112
-
113
- - Sessions with long-running tasks no longer grow the context unboundedly.
114
- - The JSONL session log is unaffected — full tool outputs are always written there.
115
- - The model can still access previous run output via `read_session_log` if needed.
116
- - A follow-up message after a completed multi-handoff task will no longer receive a `400`.
@@ -1,84 +0,0 @@
1
- # Finding 002: Handoff Edge Cases Found During Review of Finding 001
2
-
3
- **Date:** 2026-02-26
4
- **Severity:** Medium
5
- **Status:** Fixed
6
-
7
- ---
8
-
9
- ## Context
10
-
11
- While reviewing the fix for [Finding 001](./001-context-explosion.md), two edge cases in the handoff system were found. Neither caused problems in the observed debugging session, but both could cause failures under specific conditions.
12
-
13
- ---
14
-
15
- ## Issue A: `checkpoint.remaining` could be `null`, causing a 400 on the next iteration
16
-
17
- ### What could happen
18
-
19
- When the iteration limit is hit, the agent asks the model for a wrap-up response that includes a `checkpoint` field:
20
-
21
- ```json
22
- {
23
- "response": "...",
24
- "logSummary": "...",
25
- "checkpoint": {
26
- "progress": "...",
27
- "remaining": "..."
28
- }
29
- }
30
- ```
31
-
32
- The server then pushes `checkpoint.remaining` as a user message to start the next run:
33
-
34
- ```js
35
- session.messages.push({ role: 'user', content: run.checkpoint.remaining });
36
- ```
37
-
38
- Weaker or free models occasionally omit required fields or set them to `null`. If `remaining` is `null`, the session gets a `{ role: 'user', content: null }` message. Most providers reject a null content field with a `400 Bad Request` on the next model call — the same error that surfaced in Finding 001, but from a different cause.
39
-
40
- ### Fix
41
-
42
- ```js
43
- session.messages.push({ role: 'user', content: run.checkpoint.remaining || 'Continue with the task.' });
44
- ```
45
-
46
- ---
47
-
48
- ## Issue B: `intervention_required` did not strip tool history before saving
49
-
50
- ### What could happen
51
-
52
- The tool history strip introduced in Finding 001 runs right before pushing `checkpoint.remaining` for the next run. But the `intervention_required` path (max handoffs exceeded) breaks out of the loop *before* reaching the strip:
53
-
54
- ```js
55
- if (session.metadata.handoffCount > config.maxHandoffs) {
56
- // ... log and set status ...
57
- break; // ← strip never ran
58
- }
59
-
60
- // strip only reached here, after the if-block
61
- session.messages.splice(runStartIndex, session.messages.length - runStartIndex - 1);
62
- ```
63
-
64
- This meant a session that hit the handoff limit was saved with the full tool history of the last run still in it. When the user sends a new message after `intervention_required`, the model receives all of that accumulated tool history — the same context bloat risk as before the fix in Finding 001.
65
-
66
- ### Fix
67
-
68
- Strip the tool history inside the `intervention_required` branch, before breaking:
69
-
70
- ```js
71
- if (session.metadata.handoffCount > config.maxHandoffs) {
72
- // ... log and set status ...
73
- session.messages.splice(runStartIndex, session.messages.length - runStartIndex - 1);
74
- break;
75
- }
76
- ```
77
-
78
- The wrap-up assistant message (last in the array) is preserved — it gives the model context about what was attempted when the user resumes.
79
-
80
- ---
81
-
82
- ## Why these weren't caught earlier
83
-
84
- Both issues only manifest under specific conditions (model omitting a field; hitting maxHandoffs exactly). The debugging session in Finding 001 stopped at `intervention_required` after 6 handoffs, but the 400 error on the next message was attributed to the overall context size, masking the fact that the strip hadn't run for that final run.
@@ -1,120 +0,0 @@
1
- # Finding 003: Event Loop Blocking, Async File I/O, and Session Reliability
2
-
3
- **Date:** 2026-02-27
4
- **Severity:** High — caused observed 100% CPU and server unresponsiveness in production
5
- **Status:** Fixed
6
-
7
- ---
8
-
9
- ## What Happened
10
-
11
- A session was started with the question *"Kannst du deinen source code finden und anschauen mittels Tools?"*. The agent used the `exec` tool to run two full-filesystem scans:
12
-
13
- ```
14
- find / -type f \( -iname "*.js" -o -iname "*.ts" -o -iname "*.py" \) 2>/dev/null | head -20
15
- find / -type d -name "jarvis" 2>/dev/null
16
- ```
17
-
18
- Both commands start from filesystem root `/`. The second has no output limit and scans everything: real disk filesystems, `/proc`, `/sys`, `/dev`, and any network mounts. On the affected Linux server this caused the CPU to reach 100% and the server became unresponsive. The server had to be shut down manually.
19
-
20
- ---
21
-
22
- ## Root Cause
23
-
24
- ### 1. `execSync` blocks the entire Node.js event loop
25
-
26
- Both `exec` and `list_dir` used `execSync` from `child_process`. `execSync` is a synchronous call that blocks the event loop for its entire duration. While any shell command runs:
27
-
28
- - Express cannot process incoming HTTP requests
29
- - The Telegram bot cannot receive or process new messages
30
- - All timers and async callbacks are frozen (including the Telegram `typingInterval`, so the user sees no activity indicator)
31
-
32
- The OS sees a CPU-hungry `find` child process running at full speed while Node.js sits blocked waiting for it. Combined, this presents as ~100% CPU with a completely unresponsive server.
33
-
34
- Additionally, `list_dir` used `execSync` with **no timeout at all**. A hanging command (e.g. `ls` on an NFS mount or a blocked `/proc` entry) would freeze the server permanently.
35
-
36
- ### 2. All file I/O was synchronous
37
-
38
- `loadSession`, `saveSession`, `appendLog`, and `loadTools` all used `fs.*Sync` variants. In an async Node.js server these block the event loop on every request. For small files the impact is measured in microseconds, but the pattern is architecturally incorrect and accumulates under load.
39
-
40
- ### 3. Session not saved on unexpected error
41
-
42
- In `handleChat`, `saveSession` was called unconditionally after the `try/catch` block. If the catch re-threw an unexpected error, `saveSession` was never reached. The user message had already been appended to the in-memory session but the on-disk version did not reflect it — leaving the session in an inconsistent state for the next request.
43
-
44
- ### 4. No concurrency protection per session
45
-
46
- The Telegram channel uses `@grammyjs/runner`, which processes updates concurrently. If a user sent two messages in quick succession, both `handleChat` calls could load the same session simultaneously, run independent agent loops, and then overwrite each other's `saveSession` call. The second write would silently discard the first response.
47
-
48
- ### 5. Seed tools never updated after initial creation
49
-
50
- `seedTools()` used `if (!existing[name])` — it only wrote a seed tool on first run. Any update to `exec` or `list_dir` in the source code would never propagate to an existing installation. This blocked the async fix for `exec` and `list_dir` from taking effect.
51
-
52
- ---
53
-
54
- ## Fixes
55
-
56
- ### 1. `exec` and `list_dir` → async (`src/server/tools.js`)
57
-
58
- **`exec`**: replaced `execSync` with `promisify(exec)`. The event loop is now free during shell command execution. Timeout (60s) and maxBuffer (2MB) are preserved.
59
-
60
- **`list_dir`**: replaced `execSync` with `promisify(execFile)`. `execFile` does not use a shell interpreter, which is safer against special characters in paths. Added a 10-second timeout (previously none).
61
-
62
- ### 2. `executeTool` global timeout (`src/server/tools.js`)
63
-
64
- All tool executions — both built-in and AI-created — are now wrapped in `Promise.race` against a 60-second timeout. This protects against AI-created tools that hang on async operations (network requests, file I/O). The timeout matches the `exec` tool's own limit for consistency.
65
-
66
- ```js
67
- const timeout = new Promise((_, reject) =>
68
- setTimeout(() => reject(new Error(`Tool '${name}' timed out after 60s`)), 60_000)
69
- );
70
- return await Promise.race([fn(toolArgs, fs, path, process, _require), timeout]);
71
- ```
72
-
73
- Note: this does not protect against synchronous CPU loops without `await` points — that would require Worker Threads. Such code is unlikely to be generated accidentally.
74
-
75
- ### 3. Seed tools always updated (`src/server/tools.js`)
76
-
77
- `seedTools()` now compares the serialized content of each seed tool against the stored version and overwrites only when there is a difference. Updates to built-in tools propagate on the next server start without touching user-created tools.
78
-
79
- ### 4. All file I/O → async (`src/server/sessions.js`, `src/server/logging.js`, `src/server/tools.js`)
80
-
81
- `loadSession`, `saveSession`, `appendLog`, and `loadTools` now use `fs.promises.*`. All callers in `agent.js` are updated to `await` these calls.
82
-
83
- ### 5. `saveSession` moved to `finally` block (`src/server/agent.js`)
84
-
85
- The session is now always persisted — on success, on model error, and on unexpected errors. A failed save is caught and logged without masking the original error.
86
-
87
- ```js
88
- } finally {
89
- try {
90
- await saveSession(sessionId, session);
91
- } catch (saveErr) {
92
- console.error(`Failed to save session ${sessionId}:`, saveErr);
93
- }
94
- }
95
- ```
96
-
97
- ### 6. Session queue for concurrency control (`src/server/agent.js`)
98
-
99
- A module-level `Map<sessionId, Promise>` serializes concurrent requests for the same session. Each new request registers itself as the tail of the queue and waits for the previous request to resolve before starting. The map entry is cleaned up by whichever request is last in the chain.
100
-
101
- ```js
102
- const previous = sessionQueues.get(sessionId) ?? Promise.resolve();
103
- let releaseLock;
104
- const current = new Promise(resolve => { releaseLock = resolve; });
105
- sessionQueues.set(sessionId, current);
106
- await previous;
107
- // ... process request ...
108
- // finally: releaseLock()
109
- ```
110
-
111
- This is safe in Node.js because the event loop is single-threaded: `get`, `new Promise`, and `set` all execute synchronously before the first `await`, so there is no race between two requests reading the same `undefined` entry.
112
-
113
- ---
114
-
115
- ## What Was Not Changed
116
-
117
- - The agent loop logic, checkpoint/handoff system, loop detection, and format recovery — all unchanged.
118
- - `seedTools()` remains synchronous (called once at startup, before the server accepts requests).
119
- - `createSession()` and `getToolDefinitions()` remain synchronous (pure functions, no I/O).
120
- - No rate limiting or HTTP authentication added — the server is intended for local/personal use only.
@@ -1,162 +0,0 @@
1
- # Finding 004: Agent Reliability — Failure Loops, Checkpoint Memory, and Iteration Awareness
2
-
3
- **Date:** 2026-02-27
4
- **Severity:** High — caused observed session failure with 54 tool calls, `format_error` crash, and no useful output after 42 minutes
5
- **Status:** Fixed
6
-
7
- ---
8
-
9
- ## What Happened
10
-
11
- A session was started to build a cybersecurity project installing three tools: Nuclei, Subfinder, and Naabu. The agent went through 5 handoffs (hitting `maxHandoffs`) and crashed with a `format_error` in the final iteration. Observations:
12
-
13
- - **181 exec calls** across 19 agent iterations
14
- - **21 perplexity_search calls**, including 11+ in the final iteration alone
15
- - The agent oscillated between download strategies (Docker → `go install` → tarball → direct binary) without memory of what had already failed
16
- - The existing loop detector never fired because each failed command had slightly different arguments (different URLs, different flags), producing different `callKey` values
17
- - Each handoff resumed with only `checkpoint.remaining` — no record of what approaches had already been exhausted
18
- - The model degraded and eventually produced a non-JSON response (`format_error`), crashing the session
19
-
20
- ---
21
-
22
- ## Root Causes
23
-
24
- ### 1. Loop detection only caught exact-match repetition
25
-
26
- The existing `loopTracker` detects when the _exact same_ tool call (name + args + result) is repeated 3 times. It does not detect _semantic_ failure loops: repeated attempts to do the same thing via slightly different commands. In the session, each download attempt used a different URL or flags, so every `callKey` was unique.
27
-
28
- ### 2. Checkpoint carried no memory of failed strategies
29
-
30
- `WRAP_UP_NOTE` asked the model for `progress` and `remaining`, but not for a record of what _failed_. Each new run after a handoff started with a blank slate — the model had no way to know that curl downloads, `go install`, and tarball extraction had already been tried and failed. It repeated them.
31
-
32
- ### 3. No iteration budget awareness
33
-
34
- The model had no visibility into how many iterations remained in the current run. It kept taking exploratory steps as if budget were unlimited, then was surprised by the wrap-up call. A model that knows it has 2 iterations left will consolidate and report; one that doesn't will keep digging.
35
-
36
- ### 4. `perplexity_search` used without restraint
37
-
38
- The tool description had no guidance on usage limits. The model searched Perplexity 21 times in one session (11 in the final iteration), including redundant queries for the same version information. Each search consumed an iteration and added latency without improving outcomes.
39
-
40
- ### 5. System prompt had no "give up" rule
41
-
42
- The system prompt told the agent to "decide whether to retry with a corrected call or explain the failure to the user" — but gave no threshold for when to stop retrying. In practice, the agent always chose to retry, regardless of how many times the same approach had failed.
43
-
44
- ---
45
-
46
- ## Fixes
47
-
48
- ### 1. Consecutive failure detection (`src/server/agent.js`)
49
-
50
- Added a `consecutiveFailures` counter that tracks back-to-back failed tool calls across all iterations within a run. A tool call counts as failed if `executeTool` throws (`toolStatus === 'error'`) _or_ the result object has `status === 'error'` (catching exec failures that are returned, not thrown). A successful call resets the counter to 0.
51
-
52
- After each iteration's tool calls, if `consecutiveFailures >= CONSECUTIVE_FAILURE_THRESHOLD` (3), a system break message is injected into the session and the counter resets:
53
-
54
- ```js
55
- const resultObj = typeof result === 'object' && result !== null ? result : null;
56
- const toolFailed = toolStatus === 'error' || (resultObj && resultObj.status === 'error');
57
- if (toolFailed) {
58
- consecutiveFailures++;
59
- } else {
60
- consecutiveFailures = 0;
61
- }
62
- ```
63
-
64
- ```js
65
- if (consecutiveFailures >= CONSECUTIVE_FAILURE_THRESHOLD) {
66
- session.messages.push({
67
- role: 'user',
68
- content: '[System: You have had 3 or more consecutive tool failures. Stop retrying the same approach. Either pivot to a fundamentally different strategy or provide your final response explaining what failed and why.]',
69
- });
70
- consecutiveFailures = 0;
71
- }
72
- ```
73
-
74
- This complements the existing exact-match loop detector — together they cover both identical repetition and semantic failure loops.
75
-
76
- ### 2. `failedApproaches` in checkpoint schema (`src/server/agent.js`)
77
-
78
- Updated `WRAP_UP_NOTE` to request a `failedApproaches` array alongside `progress` and `remaining`:
79
-
80
- ```json
81
- {
82
- "checkpoint": {
83
- "progress": "...",
84
- "remaining": "...",
85
- "failedApproaches": ["downloading subfinder via curl from GitHub releases — connection reset", "..."]
86
- }
87
- }
88
- ```
89
-
90
- When resuming after a handoff, the agent loop now appends a system note to the resume message listing every failed approach from the previous run:
91
-
92
- ```js
93
- let resumeContent = run.checkpoint.remaining || 'Continue with the task.';
94
- if (run.checkpoint.failedApproaches && run.checkpoint.failedApproaches.length > 0) {
95
- resumeContent += `\n\n[System: The following approaches were tried and failed in the previous run — do not repeat them:\n${run.checkpoint.failedApproaches.map((a, i) => `${i + 1}. ${a}`).join('\n')}]`;
96
- }
97
- ```
98
-
99
- The next run starts with concrete knowledge of what not to try, rather than repeating the same failed strategies.
100
-
101
- ### 3. Iteration budget awareness (`src/server/agent.js`)
102
-
103
- Each model request now includes the remaining iteration count when 5 or fewer iterations are left in the run. The count is appended to the prepared messages (never stored in `session.messages`):
104
-
105
- ```js
106
- const iterationsLeft = config.maxIterations - iteration + 1;
107
- const preparedMessages = iterationsLeft <= 5
108
- ? [...base, { role: 'user', content: `[System: ${iterationsLeft} iteration${iterationsLeft === 1 ? '' : 's'} remaining in this run. Budget your remaining steps accordingly — if you cannot finish in time, consolidate progress and provide a checkpoint.]` }]
109
- : base;
110
- ```
111
-
112
- This gives the model enough warning to consolidate and produce a clean checkpoint rather than being cut off mid-task.
113
-
114
- ### 4. `perplexity_search` description updated (`src/server/tools.js`)
115
-
116
- Added explicit usage guidance to the tool description:
117
-
118
- > Use sparingly — at most 3 searches per topic. Do not repeat the same query with minor variations; if an initial search does not yield what you need, switch to a different approach or verify locally with exec.
119
-
120
- ### 5. "Failure Recovery" section added to system prompt (`docs/system-prompt.md`)
121
-
122
- Added a dedicated `## Failure Recovery` section making the give-up rule explicit:
123
-
124
- - Retry at most once with a meaningfully different approach; if it fails again, report to the user
125
- - Never repeat a failed strategy with minor variations
126
- - Use `perplexity_search` at most 3 times per topic
127
- - Escalate cleanly with a useful failure report rather than looping
128
-
129
- ---
130
-
131
- ## What Was Not Changed
132
-
133
- - The existing exact-match loop detector (`loopTracker`) — it remains and now works alongside the new consecutive failure detector
134
- - The checkpoint/handoff system, `maxHandoffs` limit, or tool history strip logic — unchanged
135
- - The `format_error` recovery path (fallback model + nudge retry) — unchanged
136
- - No per-session Perplexity call counter was added server-side; the guidance is model-facing
137
-
138
- ---
139
-
140
- ## Note: Homebrew as an Alternative for Binary Tool Installation
141
-
142
- During the session that exposed these issues, the agent struggled to install Nuclei, Subfinder, and Naabu from GitHub releases. All three tools from [ProjectDiscovery](https://projectdiscovery.io/) are available via Homebrew:
143
-
144
- ```
145
- brew install nuclei
146
- brew install subfinder
147
- brew install naabu
148
- ```
149
-
150
- Homebrew handles binary verification, PATH setup, and version management automatically — far more reliably than manual `curl` downloads or `go install`. On macOS (and Linux via Linuxbrew), this is the recommended installation method. The agent could discover this via `perplexity_search` or by checking `brew search projectdiscovery` — but only if it knows to try Homebrew _before_ attempting manual downloads.
151
-
152
- This is a guidance gap rather than a code bug: the system prompt doesn't mention package managers as a preferred strategy for binary installation. A future improvement could add: "When installing CLI tools, check for a package manager installation first (`brew install`, `apt install`, `snap install`) before attempting manual downloads."
153
-
154
- ---
155
-
156
- ## Outcome
157
-
158
- - Failure loops are now intercepted after 3 consecutive failures instead of running to exhaustion
159
- - Handoff runs start with knowledge of what has already failed, enabling genuine strategy pivots
160
- - The model receives iteration budget warnings before hitting the wrap-up call
161
- - Perplexity search overuse is constrained by both tool description and system prompt guidance
162
- - The system prompt now has an explicit rule for when to stop retrying and report
@@ -1,128 +0,0 @@
1
- # Finding 005: 60-Second Exec Timeout Breaks Package Installation
2
-
3
- **Date:** 2026-02-27
4
- **Severity:** High — any package installation via exec will timeout, regardless of which package manager is used
5
- **Status:** Fixed
6
-
7
- ---
8
-
9
- ## What Happened
10
-
11
- In the session analysed in Finding 004, the agent attempted to install Nuclei, Subfinder, and Naabu using several strategies — `go install`, direct `curl` downloads, tarball extraction. All either timed out or failed with network errors.
12
-
13
- The natural follow-up question was: would switching to a proper package manager (`brew`, `apt-get`) solve the problem? The answer is no — not without fixing the timeout first.
14
-
15
- ---
16
-
17
- ## Root Cause
18
-
19
- ### The `exec` tool has a hard 60-second timeout
20
-
21
- ```js
22
- // exec seed tool
23
- const { stdout, stderr } = await execAsync(args.cmd, {
24
- encoding: 'utf8',
25
- timeout: 60000, // ← 60 seconds
26
- maxBuffer: 2 * 1024 * 1024,
27
- });
28
- ```
29
-
30
- On top of that, `executeTool` wraps every tool in its own `Promise.race` against `TOOL_TIMEOUT_MS = 60_000`. This means even if a tool's internal `execAsync` had a longer timeout, the outer race would kill it at 60 seconds anyway.
31
-
32
- Package installation routinely takes longer than 60 seconds:
33
-
34
- | Operation | Typical duration |
35
- |---|---|
36
- | `apt-get update` | 15–60s (varies with server speed, mirror load) |
37
- | `apt-get install nuclei` | 10–60s (binary download + extraction) |
38
- | `apt-get update && apt-get install nuclei` | 30–120s combined |
39
- | `brew install nuclei` | 20–90s |
40
- | `go install github.com/...` | 60–180s (compilation) |
41
-
42
- So `go install` was almost guaranteed to timeout. `apt-get` with an update step would regularly exceed 60s too. Swapping the install method without fixing the timeout would not have solved the problem.
43
-
44
- ### The `npm_install` tool has the same problem
45
-
46
- `npm_install` also uses a 60-second timeout — it would fail for large packages or slow networks for the same reason.
47
-
48
- ### No per-tool timeout mechanism existed
49
-
50
- All tools shared the same `TOOL_TIMEOUT_MS` constant. There was no way to declare that a specific tool legitimately needed more time.
51
-
52
- ---
53
-
54
- ## Fix
55
-
56
- ### 1. Per-tool timeout override (`src/server/tools.js`)
57
-
58
- `executeTool` now checks for a `timeout` property on the tool definition and uses it instead of the global `TOOL_TIMEOUT_MS`:
59
-
60
- ```js
61
- const timeoutMs = tool.timeout || TOOL_TIMEOUT_MS;
62
-
63
- const timeout = new Promise((_, reject) =>
64
- setTimeout(
65
- () => reject(new Error(`Tool '${name}' timed out after ${timeoutMs / 1000}s`)),
66
- timeoutMs
67
- )
68
- );
69
- ```
70
-
71
- Tools that don't declare a timeout continue to use the 60s default — no behaviour change for existing tools.
72
-
73
- ### 2. `system_install` seed tool with 5-minute timeout (`src/server/tools.js`)
74
-
75
- A new built-in tool handles system binary installation:
76
-
77
- ```js
78
- system_install: {
79
- timeout: 300_000, // 5 minutes
80
- ...
81
- }
82
- ```
83
-
84
- Behaviour:
85
- - **Already installed check**: runs `which <package>` first. If the binary is already on PATH, returns immediately without installing — no wasted time.
86
- - **Auto-detection**: tries `brew`, `apt-get`, `snap` in order. Uses the first one found. Can be overridden with an explicit `packageManager` argument.
87
- - **apt-get**: always runs `apt-get update -qq` before `apt-get install -y` to avoid stale package list failures. Uses `DEBIAN_FRONTEND=noninteractive` to suppress interactive prompts.
88
- - **Inner timeout**: 4.5 minutes on the internal `execAsync`, leaving headroom before the outer 5-minute tool timeout fires.
89
- - **Structured errors**: returns `exitCode`, `stdout`, `stderr` on failure so the agent can read the actual error without guessing.
90
-
91
- ### 3. System prompt updated (`docs/system-prompt.md`)
92
-
93
- Added explicit guidance in the "Tool Creation" section:
94
-
95
- > **Installing a system binary** (e.g. nuclei, jq, ffmpeg, git): use the `system_install` tool — never use exec for this. It auto-detects the available package manager (brew/apt-get/snap) and has a 5-minute timeout sized for real downloads.
96
-
97
- This mirrors the existing `npm_install` guidance and gives the model an explicit directive to reach for `system_install` over `exec`.
98
-
99
- ---
100
-
101
- ## Why `system_install` Instead of Increasing the Global Timeout
102
-
103
- Increasing `TOOL_TIMEOUT_MS` globally would make every tool — including `exec` — hang for much longer before failing. A runaway `exec` command (e.g. `find / -name "*.js"`) that produces no output would block the event loop for 5 minutes instead of 60 seconds. The per-tool timeout keeps the safe default for general tools while letting installation tools declare a legitimate need for more time.
104
-
105
- ---
106
-
107
- ## Notes on Package Availability
108
-
109
- Not all tools are available in every package manager. For ProjectDiscovery tools (nuclei, subfinder, naabu) specifically:
110
-
111
- - **macOS (brew)**: `brew install nuclei`, `brew install subfinder`, `brew install naabu` — all available via the official Homebrew formulae
112
- - **Linux (apt-get)**: ProjectDiscovery does not maintain an official apt repository. `apt-get install nuclei` will likely fail with "package not found"
113
- - **Linux (snap)**: `snap install nuclei` is available on Ubuntu/Debian
114
-
115
- On Linux, if `system_install` fails because the package isn't in the apt repository, the agent should try snap, or fall back to the ProjectDiscovery install script:
116
- ```sh
117
- curl -sL https://github.com/projectdiscovery/nuclei/releases/latest/download/nuclei_linux_amd64.zip -o nuclei.zip
118
- ```
119
- This would still use `exec` for the download, but with the failure recovery guidance from Finding 004 the agent should report the failure rather than looping.
120
-
121
- ---
122
-
123
- ## Outcome
124
-
125
- - Package installation no longer races against a 60-second wall
126
- - The agent has a clear, named tool to reach for when installing system binaries
127
- - The system prompt actively directs the agent away from `exec` for this use case
128
- - Per-tool timeouts are now supported for any future tools that need them