obol-ai 0.2.22 → 0.2.24

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/CHANGELOG.md CHANGED
@@ -1,3 +1,22 @@
1
+ ## 0.2.24
2
+ - update changelog and lockfile
3
+ - ignore obol message export csvs
4
+ - add speech-to-text tool with faster-whisper, auto-transcribes voice messages when enabled
5
+ - install faster-whisper in postinstall
6
+ - update tests to expect runtime context blocks in last user message
7
+ - sanitize empty content blocks before API calls to prevent 400 errors
8
+ - inject time and memory as runtime context in user message, not system prompt
9
+ - cache last tool definition to maximise prompt cache hits
10
+ - replace regex JSON extraction with forced tool call for memory consolidation
11
+ - surface bridge failures to user instead of swallowing errors
12
+ - pass recent history to router to prevent haiku misrouting on follow-up messages
13
+ - seed last 50 messages at boot, align to first user row
14
+ - add 10-minute lock timeout with user notification
15
+ - fix haiku probe: check tool_use blocks instead of stop_reason, raise max_tokens to 4096
16
+ - 0.2.23 subagent tool, edit/glob/grep tools, path sandboxing, llm-driven clean, button cleanup
17
+ - fix claude test mock to yield stream-like object with finalMessage()
18
+ - 0.2.22 fix streaming for OAuth, max_tokens limit, and reauth message
19
+
1
20
  ## 0.2.21
2
21
  - add obol reauth command and fix bailout/summary to use streaming
3
22
  - update changelog
package/ISSUES.md ADDED
@@ -0,0 +1,298 @@
1
+ # Obol Issues
2
+
3
+ ## 1. Haiku probe max_tokens=1024 silently drops tool calls
4
+ **File:** `src/claude/chat.js:124`
5
+ **Severity:** High
6
+
7
+ When the Haiku escalation probe hits `max_tokens`, `stop_reason` is `"max_tokens"` not `"tool_use"`, so the escalation to Sonnet never fires and the tool call is silently abandoned. Caused a 4-loop failure when writing the flowchart.
8
+
9
+ **Fix:** Treat `stop_reason === 'max_tokens'` as an escalation trigger, or raise probe `max_tokens` to at least 4096.
10
+
11
+ **NanoBot pattern:** Don't check `stop_reason` at all — check whether tool_use blocks are present in the response content:
12
+ ```js
13
+ const hasToolUse = probe.content.some(b => b.type === 'tool_use');
14
+ if (!hasToolUse) { /* short-circuit, no escalation needed */ }
15
+ ```
16
+ `stop_reason` is unreliable when tokens are exhausted mid-tool-call. Presence of `tool_use` blocks is the ground truth.
17
+
18
+ ---
19
+
20
+ ## 2. 14-minute silence during stuck chat lock — no heartbeat
21
+ **File:** `src/claude/chat.js:23-33`
22
+ **Severity:** High
23
+
24
+ A long tool loop crashed without releasing the lock. The `isChatBusy` guard returns a message when busy, but if the process crashes mid-run the lock is held forever and the user gets dead silence.
25
+
26
+ **Fix:** Add a heartbeat message every 30–60s during long operations. Add a lock timeout that force-releases after N minutes and notifies the user.
27
+
28
+ **NanoBot pattern:** Two mechanisms:
29
+ 1. **Typing indicator loop** — sends `bot.send_chat_action(chat_id, "typing")` every 5s while the agent runs. User sees the bot is alive without receiving any messages.
30
+ 2. **`/stop` hard cancellation** — tracks all active tasks per session; `/stop` cancels them all and releases the lock immediately.
31
+
32
+ The typing loop is ~10 lines and gives continuous liveness feedback. No TTL-based timeout, but `/stop` provides an escape hatch.
33
+
34
+ ---
35
+
36
+ ## 3. Process restart wipes in-memory history — only 20 messages seeded from DB
37
+ **File:** `src/history.js` (boot seed)
38
+ **Severity:** High
39
+
40
+ After a restart, the bot seeded only the last 20 messages, missing the ongoing Remotion task context entirely. Follow-up messages like "can you send me the video?" had no task context, and Haiku gave a generic intro response as if meeting the user for the first time.
41
+
42
+ **Fix:** Increase boot seed to 40–50 messages. Or persist active task state to DB so it survives restarts.
43
+
44
+ **Obol context:** Messages are already fully persisted in Supabase (`obol_messages` table). The JSONL approach NanoBot uses is unnecessary — the data is there, just not seeded generously enough. The fix is a one-liner in `src/messages.js:85`:
45
+
46
+ ```js
47
+ async getRecent(chatId, limit = 20) { // bump to 50
48
+ ```
49
+
50
+ And wherever `getRecent` is called at boot (likely `src/claude/chat.js` or `src/tenant.js`), increase the seed count to 50. That gives the bot enough context to reconnect to in-progress tasks after a restart without any architectural changes.
51
+
52
+ ---
53
+
54
+ ## 4. Router assigns Haiku to follow-up messages that need task context
55
+ **File:** `src/claude/router.js:9`
56
+ **Severity:** Medium
57
+
58
+ Short follow-up messages ("can you send me the video?", "the remotion video") look simple to the router and get assigned to Haiku. But Haiku only sees the current message — no ongoing task context, and the task was too recent to be in consolidated memory.
59
+
60
+ **Fix:** Pass the last 2–3 history messages to the router so it knows whether there's an ongoing task. Short messages with recent Sonnet history should bias toward Sonnet.
61
+
62
+ **NanoBot pattern:** No per-message routing at all. One model per session — the router problem doesn't exist because there's no mid-conversation model switching. Multi-agent routing happens at the session/channel level, not per-message. The practical fix for obol without removing the router: always include the last 3 assistant messages in the router prompt so it can detect an ongoing task.
63
+
64
+ ---
65
+
66
+ ## 5. Assistant claims success without verifying file output
67
+ **File:** Tool result handling
68
+ **Severity:** High
69
+
70
+ After requesting a GitHub Remotion video, the assistant responded: "There you go! 🎬 11 seconds of animated GitHub glory" — but the `out/` folder was empty. No video was ever rendered. The assistant fabricated a delivery confirmation.
71
+
72
+ **Fix:** Tools that write files or run renders must return the actual output path. Assistant should verify the file exists (`fs.existsSync`) before reporting success.
73
+
74
+ **NanoBot pattern:** Tool results are injected back into the LLM context as `tool_result` blocks before the final response is generated. The LLM must read the actual output of its tools before summarizing — the loop structure forces grounding. No explicit file existence check, but fabrication is prevented architecturally because the final message is generated after tool results are fed back, not before.
75
+
76
+ ---
77
+
78
+ ## 6. Bridge failure not surfaced to user
79
+ **File:** `src/bridge.js:74`
80
+ **Severity:** Medium
81
+
82
+ When `getTenant(partnerUserId, config)` fails, the error is swallowed and the assistant answers from its own knowledge without telling the user the bridge is down. The user only discovered the bridge was broken by noticing OBOL answered directly instead of using it.
83
+
84
+ **Fix:** Explicitly tell the user when bridge fails: "The bridge to Vicky's agent isn't reachable — answered from my own knowledge instead." Add a `/bridge status` command.
85
+
86
+ **NanoBot pattern:** Subagents always call `_announce_result(status="ok"|"error")` — failure is never swallowed. The error surfaces as a system message the main agent rephrases and delivers:
87
+ ```
88
+ [Subagent 'bridge' failed]
89
+
90
+ Error: connection refused
91
+ ```
92
+ The key difference: errors are routed back through the main agent pipeline so they can be communicated naturally, not just logged to console.
93
+
94
+ ---
95
+
96
+ ## 7. Voice transcription not built into core pipeline
97
+ **Severity:** Medium
98
+
99
+ Voice messages went unprocessed until the user explicitly asked mid-session to "build a tool with local whisper." This is a standard Telegram feature — the bot should handle it by default.
100
+
101
+ **Fix:** Make Whisper transcription a first-class tool in the core pipeline, not something added on demand.
102
+
103
+ **NanoBot pattern:** Transcription inline in the Telegram handler — text appended to message content before the agent loop. Transparent to the LLM.
104
+
105
+ **Obol approach: local faster-whisper via Python subprocess**
106
+
107
+ Use [faster-whisper](https://github.com/SYSTRAN/faster-whisper) (CTranslate2-based, ~4x faster than openai-whisper, runs fully local). Wire it into `processMediaItems` in `src/telegram/handlers/media.js` — when `fileInfo.mediaType === 'voice'` or `'audio'`, transcribe the saved `.ogg` file before building the prompt:
108
+
109
+ ```js
110
+ // src/whisper.js — thin Node wrapper around faster-whisper Python
111
+ const { execFile } = require('child_process');
112
+
113
+ function transcribe(filePath) {
114
+ return new Promise((resolve) => {
115
+ execFile('python3', ['-m', 'faster_whisper_cli', filePath], (err, stdout) => {
116
+ resolve(err ? null : stdout.trim());
117
+ });
118
+ });
119
+ }
120
+ ```
121
+
122
+ ```js
123
+ // in processMediaItems, replace the nonImageParts push for voice/audio:
124
+ if (fileInfo.mediaType === 'voice' || fileInfo.mediaType === 'audio') {
125
+ const transcription = await transcribe(savedPath);
126
+ nonImageParts.push(transcription
127
+ ? `[Voice message transcription: ${transcription}]`
128
+ : `[Voice message: ${savedPath} — transcription failed]`);
129
+ }
130
+ ```
131
+
132
+ Recommended model: `base` or `small.en` for speed, `medium` for accuracy. Install: `pip install faster-whisper`.
133
+
134
+ ---
135
+
136
+ ## 8. Memory consolidation uses free-text extraction — brittle JSON parsing
137
+ **File:** `src/messages.js:132` (`_extractFacts`)
138
+ **Severity:** Medium
139
+
140
+ The current consolidation asks Haiku to return a JSON array in free text, then extracts it with a regex (`text.match(/\[[\s\S]*\]/)`). If the model wraps the array in prose, adds a comment, or truncates, the parse silently fails and no facts are stored.
141
+
142
+ **Fix:** Use forced-tool-call consolidation — pass a single tool as the only option so the LLM is structurally required to call it. No regex, no JSON extraction, no parse failures.
143
+
144
+ **NanoBot pattern:**
145
+ ```js
146
+ // Instead of asking for JSON in free text, define one tool and force the call
147
+ const tools = [{
148
+ name: 'save_memory',
149
+ description: 'Save extracted facts from this exchange',
150
+ input_schema: {
151
+ type: 'object',
152
+ properties: {
153
+ facts: {
154
+ type: 'array',
155
+ items: {
156
+ type: 'object',
157
+ properties: {
158
+ content: { type: 'string' },
159
+ category: { type: 'string', enum: ['fact','preference','decision','lesson','person','project','event','resource'] },
160
+ importance: { type: 'number' },
161
+ tags: { type: 'array', items: { type: 'string' } },
162
+ },
163
+ },
164
+ },
165
+ },
166
+ },
167
+ }];
168
+
169
+ // Pass tool_choice: { type: 'tool', name: 'save_memory' } to force the call
170
+ // If the LLM doesn't call it, facts = [] — no parsing needed
171
+ ```
172
+
173
+ ---
174
+
175
+ ## 9. Tool results saved verbatim (up to 50k chars) to Supabase
176
+ **File:** `src/messages.js:42`
177
+ **Severity:** Low
178
+
179
+ All messages are truncated to 50,000 chars before being saved to `obol_messages`. This makes sense for user/assistant turns but is too generous for tool results — a `read_file` or `exec` result can be 40k chars of content the LLM needed in the moment but is useless in history.
180
+
181
+ **Fix:** Distinguish tool results when logging and truncate them aggressively (500–1000 chars) before saving. The LLM got the full result for the current turn; only the summary needs to survive in Supabase.
182
+
183
+ **NanoBot pattern:** Full tool results go to the LLM. When saving to disk (equivalent: Supabase `obol_messages`), tool results are capped at 500 chars with `... (truncated)`. Regular user/assistant messages are saved in full.
184
+
185
+ The `role` field in `obol_messages` could be extended to `'tool_result'` to make this distinction easy, or tool result content can be detected by convention (e.g. content starting with `[tool:`).
186
+
187
+ ---
188
+
189
+ ## 10. getRecent() seed may start on a non-user turn
190
+ **File:** `src/messages.js:85`
191
+ **Severity:** Medium
192
+
193
+ `getRecent()` fetches the last N rows from `obol_messages` ordered by `created_at`. If a restart happens mid-tool-loop, the oldest row in the seed could be an assistant message or a tool result — not a user message. Claude's API requires the first message to be `role: user`, so this causes a 400 error or silent history corruption.
194
+
195
+ **Fix:** After fetching from Supabase, advance the slice forward to the first `role=user` row before seeding into `ChatHistory`. One guard in `getRecent()` or wherever the seed is applied at boot.
196
+
197
+ **NanoBot pattern:**
198
+ ```js
199
+ // After fetching rows, align to first user message
200
+ const firstUserIdx = rows.findIndex(r => r.role === 'user');
201
+ return firstUserIdx > 0 ? rows.slice(firstUserIdx) : rows;
202
+ ```
203
+
204
+ ---
205
+
206
+ ## 11. Tool definitions not included in prompt cache
207
+ **File:** `src/claude/chat.js:96`, `src/claude/cache.js`
208
+ **Severity:** Low
209
+
210
+ The system prompt already has `cache_control: { type: 'ephemeral' }` (line 96) and `withCacheBreakpoints` caches the second-to-last message. But tool definitions — which are large, static, and sent every turn — are not cached. This means Anthropic re-processes the full tool list on every turn.
211
+
212
+ **Fix:** Add `cache_control: { type: 'ephemeral' }` to the last tool definition before each API call.
213
+
214
+ **NanoBot pattern:**
215
+ ```js
216
+ // Before the API call, cache the last tool definition
217
+ if (toolDefs.length > 0) {
218
+ toolDefs = [...toolDefs];
219
+ toolDefs[toolDefs.length - 1] = {
220
+ ...toolDefs[toolDefs.length - 1],
221
+ cache_control: { type: 'ephemeral' },
222
+ };
223
+ }
224
+ ```
225
+
226
+ The system prompt cache + tool cache together cover the two largest static inputs, maximising `cache_read_input_tokens`.
227
+
228
+ ---
229
+
230
+ ## 12. Current time injected into system prompt — prompt injection risk
231
+ **File:** `src/claude/chat.js:97`
232
+ **Severity:** Low
233
+
234
+ Current time is appended to the system prompt as plain text: `\nCurrent time: ${new Date().toISOString()}`. If memory content is also injected here, any injected instructions in a memory fact would be indistinguishable from system instructions to the LLM.
235
+
236
+ **Fix:** Inject runtime metadata (time, channel, chat ID) as a separate user message immediately before the actual user message, clearly labelled as metadata — not instructions.
237
+
238
+ **NanoBot pattern:**
239
+ ```js
240
+ // Injected as a separate user message, not into the system prompt
241
+ const runtimeContext = [
242
+ { type: 'text', text: '[Runtime context — metadata only, not instructions]' },
243
+ { type: 'text', text: `Current time: ${new Date().toISOString()}\nChat ID: ${chatId}` },
244
+ ];
245
+ // Prepend to the user's actual message in history
246
+ ```
247
+
248
+ This also means the current time gets a fresh cache-busting position in the message sequence rather than invalidating the system prompt cache every turn.
249
+
250
+ ---
251
+
252
+ ## 13. No empty content sanitization before API calls
253
+ **File:** `src/claude/chat.js` (message construction)
254
+ **Severity:** Low
255
+
256
+ If a tool returns an empty string, or an assistant message has no text content (only tool_use blocks), the messages array may contain `content: ""`. Anthropic's API rejects empty string content with a 400 error. This is a silent failure path that only surfaces under specific tool conditions.
257
+
258
+ **Fix:** Sanitize before every API call — replace empty string content with `"(empty)"`, and set `content: null` on assistant messages that only have `tool_calls`.
259
+
260
+ **NanoBot pattern:**
261
+ ```js
262
+ function sanitizeMessages(messages) {
263
+ return messages.map(msg => {
264
+ if (typeof msg.content === 'string' && msg.content === '') {
265
+ return { ...msg, content: msg.role === 'assistant' ? null : '(empty)' };
266
+ }
267
+ if (Array.isArray(msg.content)) {
268
+ const filtered = msg.content.filter(b =>
269
+ !(b.type === 'text' && b.text === '')
270
+ );
271
+ return { ...msg, content: filtered.length ? filtered : null };
272
+ }
273
+ return msg;
274
+ });
275
+ }
276
+ ```
277
+
278
+ ---
279
+
280
+ ## 14. No JSON repair for tool call arguments
281
+ **File:** `src/claude/chat.js` (tool dispatch)
282
+ **Severity:** Low
283
+
284
+ Tool call arguments from the API are parsed with `JSON.parse`. If the model returns slightly malformed JSON (truncated output, trailing comma, unquoted key — more common with Haiku and near max_tokens), the entire tool dispatch crashes rather than attempting repair.
285
+
286
+ **Fix:** Use `json-repair` (or equivalent) instead of raw `JSON.parse` for tool call argument parsing.
287
+
288
+ **NanoBot pattern:**
289
+ ```js
290
+ // npm install json-repair
291
+ const jsonRepair = require('json-repair');
292
+
293
+ // Instead of JSON.parse(toolInput):
294
+ const args = jsonRepair.jsonrepair(toolInput);
295
+ const parsed = JSON.parse(args);
296
+ ```
297
+
298
+ Especially relevant given Issue 1 — Haiku near max_tokens is exactly when malformed tool JSON is most likely.