openclacky 1.0.2 → 1.0.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (47) hide show
  1. checksums.yaml +4 -4
  2. data/CHANGELOG.md +31 -0
  3. data/benchmark/fixtures/sample_project/Gemfile +3 -0
  4. data/benchmark/fixtures/sample_project/lib/api_handler.rb +32 -0
  5. data/benchmark/fixtures/sample_project/lib/order_calculator.rb +23 -0
  6. data/benchmark/fixtures/sample_project/lib/user_renderer.rb +20 -0
  7. data/benchmark/fixtures/sample_project/spec/order_calculator_spec.rb +20 -0
  8. data/benchmark/results/EVALUATION_REPORT.md +165 -0
  9. data/benchmark/results/baseline_20260511_174424.json +128 -0
  10. data/benchmark/results/report_20260511_175256.json +271 -0
  11. data/benchmark/results/report_20260511_175444.json +271 -0
  12. data/benchmark/results/treatment_20260511_175103.json +130 -0
  13. data/benchmark/runner.rb +441 -0
  14. data/docs/proposals/2026-05-11-system-prompt-alignment.md +325 -0
  15. data/docs/proposals/2026-05-12-memory-mechanism-optimization.md +89 -0
  16. data/lib/clacky/agent/cost_tracker.rb +8 -2
  17. data/lib/clacky/agent/llm_caller.rb +218 -0
  18. data/lib/clacky/agent/memory_updater.rb +41 -30
  19. data/lib/clacky/agent/message_compressor.rb +15 -4
  20. data/lib/clacky/agent/message_compressor_helper.rb +41 -2
  21. data/lib/clacky/agent/skill_manager.rb +5 -2
  22. data/lib/clacky/agent/skill_reflector.rb +10 -1
  23. data/lib/clacky/agent/tool_registry.rb +109 -0
  24. data/lib/clacky/agent.rb +20 -0
  25. data/lib/clacky/agent_config.rb +17 -0
  26. data/lib/clacky/cli.rb +65 -0
  27. data/lib/clacky/client.rb +15 -0
  28. data/lib/clacky/default_agents/base_prompt.md +20 -20
  29. data/lib/clacky/default_agents/coding/system_prompt.md +51 -1
  30. data/lib/clacky/default_skills/channel-setup/SKILL.md +113 -5
  31. data/lib/clacky/default_skills/channel-setup/import_lark_skills.rb +97 -0
  32. data/lib/clacky/default_skills/onboard/SKILL.md +1 -1
  33. data/lib/clacky/default_skills/persist-memory/SKILL.md +59 -0
  34. data/lib/clacky/providers.rb +48 -6
  35. data/lib/clacky/server/channel/adapters/weixin/adapter.rb +7 -0
  36. data/lib/clacky/server/channel/channel_manager.rb +91 -0
  37. data/lib/clacky/server/discover.rb +77 -0
  38. data/lib/clacky/server/epipe_safe_io.rb +105 -0
  39. data/lib/clacky/server/http_server.rb +121 -41
  40. data/lib/clacky/server/server_master.rb +6 -0
  41. data/lib/clacky/skill.rb +30 -0
  42. data/lib/clacky/utils/file_processor.rb +71 -0
  43. data/lib/clacky/version.rb +1 -1
  44. data/lib/clacky/web/app.css +58 -22
  45. data/lib/clacky/web/i18n.js +4 -2
  46. data/lib/clacky/web/sessions.js +29 -17
  47. metadata +33 -2
@@ -0,0 +1,325 @@
1
+ # Proposal: System Prompt Alignment with Claude Code
2
+
3
+ **Author:** Claude (assistant)
4
+ **Date:** 2026-05-11
5
+ **Branch:** `feat/system-prompt-alignment`
6
+ **Status:** Proposal
7
+ **Scope:** `lib/clacky/default_agents/base_prompt.md`, `lib/clacky/default_agents/coding/system_prompt.md`
8
+
9
+ ---
10
+
11
+ ## 1. Background & Motivation
12
+
13
+ OpenClacky's positioning is **"最省 Token 的开源 AI Agent,能力对齐 Claude Code"**. While the Harness layer (cache, compression, tool registry) achieves parity or better on cost metrics, the **system prompt layer** remains a significant gap.
14
+
15
+ The system prompt is the behavioral contract between the Agent and the LLM. A weak system prompt causes:
16
+ - **Suboptimal tool selection** (e.g., using `Write` for a 2-line change instead of `Edit`)
17
+ - **Token waste** (verbose explanations, unnecessary comments, redundant narration)
18
+ - **Safety issues** (destructive git operations, overly broad file staging)
19
+ - **Lower task completion rate** on complex multi-step tasks
20
+
21
+ This proposal targets the system prompt as a **high-leverage, low-risk improvement** that directly impacts both cost (fewer tokens per task) and capability (higher task completion rate).
22
+
23
+ ---
24
+
25
+ ## 2. Current State Analysis
26
+
27
+ ### 2.1 `base_prompt.md` (Universal behavioral rules)
28
+
29
+ ```
30
+ Lines: 36
31
+ Coverage: General behavior, Tool usage rules, TODO manager rules, Long-term memory
32
+ ```
33
+
34
+ **What it does well:**
35
+ - TODO manager workflow is explicit and actionable
36
+ - "USE TOOLS to create/modify files" is correctly emphasized
37
+ - "glob > find" rule is present
38
+
39
+ **Critical gaps:**
40
+
41
+ | Gap | Impact | Evidence |
42
+ |-----|--------|----------|
43
+ | No `Edit > Write` priority rule | Agent rewrites entire files for small changes, wasting tokens | Common user complaint in complex refactoring tasks |
44
+ | No comment/response style rules | Verbose responses, unnecessary explanations, emoji usage | Inflates token count on every turn |
45
+ | No Git safety protocol | `git add -A`, `git commit --amend`, force push risks | Potential data loss, security issues |
46
+ | No code style guidelines | Multi-line docstrings, "added for X flow" comments | Code quality degradation over time |
47
+ | No error handling philosophy | Validates impossible scenarios, overly defensive code | Unnecessary complexity, more tokens |
48
+ | No response structure rules | "Let me..." prefixes, trailing summaries, diff narration | Poor UX, token waste |
49
+ | No task tracking discipline | Multiple in-progress tasks, missing TodoWrite updates | Task state confusion |
50
+
51
+ ### 2.2 `coding/system_prompt.md` (Role definition)
52
+
53
+ ```
54
+ Lines: 18
55
+ Coverage: Role description, working process
56
+ ```
57
+
58
+ **What it does well:**
59
+ - Clear role definition ("AI coding assistant and technical co-founder")
60
+ - "Read existing code before making changes" is correct
61
+
62
+ **Critical gaps:**
63
+
64
+ | Gap | Impact |
65
+ |-----|--------|
66
+ | No explicit "Claude Code alignment" goal | Agent doesn't know it's competing with Claude Code on behavior |
67
+ | No file modification priorities | Same as base_prompt gap |
68
+ | No security awareness | Agent unaware of OWASP risks, injection vulnerabilities |
69
+ | No testing expectation | Agent often skips running tests after changes |
70
+ | No UI/frontend-specific rules | For fullstack tasks, lacks guidance on testing UI changes |
71
+
72
+ ### 2.3 `general/system_prompt.md` (Non-coding agent)
73
+
74
+ ```
75
+ Lines: 17
76
+ Coverage: General digital employee role
77
+ ```
78
+
79
+ **Gaps:** Similar to coding agent but for general tasks — lacks tool usage priorities, response style, and safety guidelines.
80
+
81
+ ---
82
+
83
+ ## 3. Target State (Claude Code Reference)
84
+
85
+ Claude Code's system prompt is approximately **800-1200 lines** of dense behavioral rules, covering:
86
+
87
+ 1. **Doing tasks** — How to interpret instructions, when to ask questions
88
+ 2. **Code style** — Comment rules, naming, error handling, no emoji
89
+ 3. **Tool usage** — Priorities, fallbacks, when to use which tool
90
+ 4. **Git safety** — Explicit do's and don'ts
91
+ 5. **Response style** — Conciseness rules, formatting, no trailing summaries
92
+ 6. **Task tracking** — TodoWrite discipline, ONE in_progress rule
93
+ 7. **Security** — XSS, SQL injection, command injection prevention
94
+ 8. **UI/frontend** — Test before claiming success
95
+
96
+ **Key insight:** Claude Code's system prompt is not "more verbose" — it's **more precise**. Every rule is designed to reduce token waste and improve task completion rate.
97
+
98
+ ---
99
+
100
+ ## 4. Proposed Changes
101
+
102
+ ### 4.1 `base_prompt.md` — Major Rewrite
103
+
104
+ **Keep (existing good rules):**
105
+ - "USE TOOLS to create/modify files"
106
+ - "ALWAYS use `glob` tool — NEVER use shell `find`"
107
+ - "All operations default to working directory"
108
+ - TODO manager workflow (with refinements)
109
+ - Long-term memory rules
110
+
111
+ **Add (new sections):**
112
+
113
+ #### Section: Code Style
114
+
115
+ ```markdown
116
+ ## Code Style
117
+
118
+ - **Default to writing no comments.** Only add one when the WHY is non-obvious: a hidden constraint, a subtle invariant, a workaround for a specific bug, or behavior that would surprise a reader.
119
+ - Don't explain WHAT the code does — well-named identifiers already do that.
120
+ - Don't reference the current task, fix, or callers ("used by X", "added for Y flow", "handles case from issue #123"). These belong in the PR description and rot as the codebase evolves.
121
+ - Never write multi-paragraph docstrings or multi-line comment blocks — one short line max.
122
+ - Only use emojis if the user explicitly requests it. Avoid emojis in all communication unless asked.
123
+ ```
124
+
125
+ #### Section: File Modification Rules
126
+
127
+ ```markdown
128
+ ## File Modification Rules
129
+
130
+ - **ALWAYS prefer `edit` over `write`.** Use `write` only for creating entirely new files.
131
+ - When editing text from `file_reader` output, preserve the exact indentation (tabs/spaces) as it appears AFTER the line number prefix.
132
+ - Ensure `old_string` is unique in the file. If not, provide a larger string with more surrounding context.
133
+ - Use `replace_all` only when you genuinely need to change every occurrence.
134
+ ```
135
+
136
+ #### Section: Response Style
137
+
138
+ ```markdown
139
+ ## Response Style
140
+
141
+ - Keep responses short and concise. One sentence per update is almost always enough.
142
+ - When referencing specific functions or code, include `file_path:line_number`.
143
+ - Do not use a colon before tool calls (e.g., "Let me read the file:" → "Let me read the file.")
144
+ - Don't narrate your internal deliberation. User-facing text should be relevant communication, not a running commentary.
145
+ - Don't summarize what you just did at the end of every response. The user can read the diff.
146
+ - End-of-turn summary: one or two sentences. What changed and what's next. Nothing else.
147
+ ```
148
+
149
+ #### Section: Git Safety Protocol
150
+
151
+ ```markdown
152
+ ## Git Safety Protocol
153
+
154
+ - NEVER update git config (user.name, user.email, etc.)
155
+ - NEVER run destructive commands: `git push --force`, `git reset --hard`, `git checkout .`, `git clean -f`
156
+ - NEVER skip hooks (`--no-verify`, `--no-gpg-sign`)
157
+ - When staging files, prefer `git add <specific-file>` over `git add -A` or `git add .`
158
+ - Always create NEW commits rather than amending existing ones
159
+ - Never amend published commits
160
+ ```
161
+
162
+ #### Section: Error Handling Philosophy
163
+
164
+ ```markdown
165
+ ## Error Handling
166
+
167
+ - Don't add error handling, fallbacks, or validation for scenarios that can't happen. Trust internal code and framework guarantees.
168
+ - Only validate at system boundaries (user input, external APIs).
169
+ - Don't use feature flags or backwards-compatibility shims when you can just change the code.
170
+ ```
171
+
172
+ #### Section: Task Tracking Discipline
173
+
174
+ ```markdown
175
+ ## Task Tracking
176
+
177
+ - Use `todo_manager` to plan and track work on complex tasks (3+ steps).
178
+ - Exactly ONE task must be `in_progress` at any time.
179
+ - Mark tasks complete IMMEDIATELY after finishing — don't batch completions.
180
+ - Complete current tasks before starting new ones.
181
+ ```
182
+
183
+ ### 4.2 `coding/system_prompt.md` — Enhancements
184
+
185
+ **Keep:** Role definition, "read existing code before making changes"
186
+
187
+ **Add:**
188
+
189
+ ```markdown
190
+ ## Security
191
+
192
+ - Be careful not to introduce security vulnerabilities such as command injection, XSS, SQL injection, and other OWASP top 10 vulnerabilities.
193
+ - If you notice insecure code, immediately fix it.
194
+ - Prioritize writing safe, secure, and correct code.
195
+
196
+ ## Testing
197
+
198
+ - For UI or frontend changes, start the dev server and verify in a browser before reporting the task as complete.
199
+ - Type checking and test suites verify code correctness, not feature correctness — if you can't test the UI, say so explicitly rather than claiming success.
200
+
201
+ ## Code Quality
202
+
203
+ - Don't add features, refactor, or introduce abstractions beyond what the task requires.
204
+ - A bug fix doesn't need surrounding cleanup; a one-shot operation doesn't need a helper.
205
+ - Three similar lines is better than a premature abstraction.
206
+ - No half-finished implementations either.
207
+ ```
208
+
209
+ ### 4.3 `general/system_prompt.md` — Add Tool Priorities
210
+
211
+ The general agent also needs tool usage priorities and response style rules, as it handles file operations too.
212
+
213
+ ---
214
+
215
+ ## 5. Evaluation Framework
216
+
217
+ This is critical: **we must prove the changes work**. The evaluation has two dimensions:
218
+
219
+ ### 5.1 Quantitative Metrics
220
+
221
+ | Metric | Baseline (Current) | Target | Measurement Method |
222
+ |--------|-------------------|--------|-------------------|
223
+ | **Avg tokens per task** | TBD (measure on benchmark tasks) | -10% to -20% | Run identical prompts before/after, compare OpenRouter bill CSV |
224
+ | **Task completion rate** | TBD | +5% to +10% | Manual evaluation on 20-task benchmark suite |
225
+ | **Avg tool calls per task** | TBD | -5% to -15% | Fewer unnecessary tool calls (e.g., Write→Edit optimization) |
226
+ | **Response verbosity** | TBD | -20% to -30% | Character count of assistant messages per task |
227
+
228
+ ### 5.2 Qualitative Checklist
229
+
230
+ For each benchmark task, evaluate:
231
+
232
+ - [ ] **Tool choice correctness**: Did it use Edit for small changes, Write only for new files?
233
+ - [ ] **No unnecessary comments**: Did it add explanatory comments only when WHY is non-obvious?
234
+ - [ ] **Concise responses**: Are assistant messages short and to-the-point?
235
+ - [ ] **Git safety**: Did it use `git add <file>` instead of `git add -A`?
236
+ - [ ] **No trailing summaries**: Does it avoid "In summary, I did X, Y, Z"?
237
+ - [ ] **Security awareness**: Did it catch/fix potential injection vulnerabilities?
238
+ - [ ] **Task tracking**: For complex tasks, did it use todo_manager correctly with ONE in_progress?
239
+
240
+ ### 5.3 Evaluation Tasks
241
+
242
+ We will use **5 benchmark tasks** spanning different scenarios:
243
+
244
+ 1. **Simple edit**: Rename a method across 3 files (tests Edit vs Write preference)
245
+ 2. **Feature addition**: Add a new API endpoint with tests (tests code style, error handling philosophy)
246
+ 3. **Refactoring**: Extract a helper method (tests abstraction judgment)
247
+ 4. **Bug fix**: Fix an XSS vulnerability in a template (tests security awareness)
248
+ 5. **Git workflow**: Make changes and prepare for commit (tests git safety)
249
+
250
+ ### 5.4 A/B Test Protocol
251
+
252
+ ```
253
+ For each task:
254
+ 1. Run with CURRENT system prompt (baseline)
255
+ 2. Run with NEW system prompt (treatment)
256
+ 3. Record: tokens, tool calls, completion status, qualitative score
257
+ 4. Compare metrics
258
+
259
+ Control variables:
260
+ - Same model (claude-opus-4-7)
261
+ - Same temperature (default)
262
+ - Same working directory
263
+ - Fresh session for each run
264
+ ```
265
+
266
+ ---
267
+
268
+ ## 6. Implementation Plan
269
+
270
+ ### Phase 1: Write Proposal (this document)
271
+ - [x] Analyze current system prompts
272
+ - [x] Identify gaps against Claude Code
273
+ - [x] Draft new content
274
+ - [x] Design evaluation framework
275
+
276
+ ### Phase 2: Implement Changes
277
+ - [ ] Update `base_prompt.md`
278
+ - [ ] Update `coding/system_prompt.md`
279
+ - [ ] Update `general/system_prompt.md`
280
+ - [ ] Review and refine wording
281
+ - [ ] Ensure no contradictions with existing rules
282
+
283
+ ### Phase 3: Evaluate
284
+ - [ ] Run 5 benchmark tasks with current prompt (baseline)
285
+ - [ ] Run 5 benchmark tasks with new prompt (treatment)
286
+ - [ ] Compile metrics comparison
287
+ - [ ] Document qualitative findings
288
+ - [ ] Decide: merge or iterate
289
+
290
+ ### Phase 4: Merge or Iterate
291
+ - [ ] If metrics improve: merge to main
292
+ - [ ] If metrics don't improve: analyze why, revise, re-test
293
+
294
+ ---
295
+
296
+ ## 7. Risks & Mitigation
297
+
298
+ | Risk | Likelihood | Impact | Mitigation |
299
+ |------|-----------|--------|-----------|
300
+ | **Over-constrained prompt** | Medium | High (Agent becomes rigid) | Review by multiple humans; test on diverse tasks |
301
+ | **Conflict with existing rules** | Low | Medium | Full text search for overlapping concepts before merge |
302
+ | **Non-English user confusion** | Medium | Low | Keep rules simple; test with Chinese prompts |
303
+ | **Token savings < expected** | Medium | Low | Evaluate anyway; even small savings compound |
304
+ | **Breaking change for existing users** | Low | Medium | System prompt updates transparently; no user action needed |
305
+
306
+ ---
307
+
308
+ ## 8. Success Criteria
309
+
310
+ This proposal is **approved for implementation** if:
311
+
312
+ 1. At least **3 out of 5 benchmark tasks** show improved qualitative scores
313
+ 2. **Average tokens per task** decreases by ≥ 5%
314
+ 3. No **regressions** in task completion rate
315
+ 4. Code review approval from at least one maintainer
316
+
317
+ ---
318
+
319
+ ## 9. Appendix: Full Proposed `base_prompt.md`
320
+
321
+ See attached file in PR.
322
+
323
+ ---
324
+
325
+ *End of Proposal*
@@ -0,0 +1,89 @@
1
+ # Proposal: Memory Mechanism Optimization
2
+
3
+ **Author:** Claude (assistant)
4
+ **Date:** 2026-05-12
5
+ **Branch:** `feat/memory-optimization` (待创建)
6
+ **Status:** Proposal
7
+
8
+ ---
9
+
10
+ ## 1. 问题
11
+
12
+ OpenClacky 的 memory 系统有三层,但只有前两层在正常工作。
13
+
14
+ `~/.clacky/memories/` 这个目录是完全空的。long-term memory 从来没写过东西,因为触发条件太苛刻(迭代 >= 10),而且子 agent 的白名单检查过于保守,几乎每次都判定"不需要更新"。
15
+
16
+ 即使 memory 里有内容,agent 也不知道什么时候该用。base_prompt 说"Do NOT recall proactively",但 agent 根本判断不了什么算"genuinely needed"。结果就是 memory 存在但从不使用。
17
+
18
+ 对比 Claude Code 的做法:
19
+ - 自动加载 CLAUDE.md 到 system prompt
20
+ - project-level memory 和当前工作目录绑定
21
+ - agent 不需要主动调用工具,相关内容已经在 prompt 里了
22
+
23
+ 我们缺的是"自动注入"的机制。
24
+
25
+ ---
26
+
27
+ ## 2. 要做什么
28
+
29
+ 让 agent **自动**获得它需要知道的上下文,而不是**被动等待**它去 recall。
30
+
31
+ 具体两件事:
32
+
33
+ ### 2.1 自动 Memory 注入
34
+
35
+ 在 system prompt 构建时,自动从 `~/.clacky/memories/` 中选择相关文件注入。agent 不需要主动 recall,memory 会"推"到它面前。
36
+
37
+ 匹配逻辑:基于 working directory 名称 + 当前任务关键词,做简单的关键词匹配。选择最相关的 1-3 个文件注入。
38
+
39
+ 注入位置:在 Project rules 之后,SOUL.md 之前。
40
+
41
+ ### 2.2 项目级动态 Memory
42
+
43
+ 在 working directory 下维护一个 `.clacky/CLAUDE.md`,记录项目特定的知识。
44
+
45
+ SystemPromptBuilder 自动检测并加载这个文件。MemoryUpdater 在任务结束时自动更新它。用户也可以手动编辑。
46
+
47
+ 这个文件支持 git 版本控制,项目切换时自动加载,比 `~/.clacky/memories/` 更贴近实际工作。
48
+
49
+ ### 2.3 降低 Memory Update 门槛
50
+
51
+ - 迭代阈值从 10 降到 5
52
+ - 简化 memory update 子 agent 的白名单判断
53
+ - 添加 `/remember` 用户命令,手动触发 memory save
54
+
55
+ ---
56
+
57
+ ## 3. 为什么做
58
+
59
+ 现在的 memory 系统形同虚设:
60
+ - `~/.clacky/memories/` 为空,没有积累任何知识
61
+ - agent 在跨任务时"失忆",每次都要重新了解用户偏好和项目约定
62
+ - 用户明确说过的决策(比如"不用 Redis"),下个任务 agent 就忘了
63
+ - 对比 Claude Code,差了一个 automatic context loading 的层级
64
+
65
+ 自动注入的好处:
66
+ - 零额外 LLM 调用,利用现有 prompt caching
67
+ - agent 不需要学习"什么时候 recall",相关内容已经在 prompt 里
68
+ - 项目级 memory 让多项目切换时上下文不混淆
69
+
70
+ ---
71
+
72
+ ## 4. 准备怎么做
73
+
74
+ 改动集中在三个模块:
75
+
76
+ 1. **SystemPromptBuilder** — 添加 `load_relevant_memories` 方法,构建 prompt 时自动注入相关 memory 内容
77
+ 2. **MemoryUpdater** — 降低迭代阈值,简化白名单,添加 `.clacky/CLAUDE.md` 写入逻辑
78
+ 3. **base_prompt.md** — 更新 memory 相关规则(从"不要主动 recall"改为"相关 memory 已自动注入")
79
+
80
+ 文件范围:
81
+ - `lib/clacky/agent/system_prompt_builder.rb`
82
+ - `lib/clacky/agent/memory_updater.rb`
83
+ - `lib/clacky/agent/skill_manager.rb`
84
+ - `lib/clacky/agent.rb`(添加 `/remember` 命令)
85
+ - `lib/clacky/default_agents/base_prompt.md`
86
+
87
+ ---
88
+
89
+ *End of Proposal*
@@ -47,8 +47,14 @@ module Clacky
47
47
  # Collect token usage data for this iteration (returned to caller for deferred display)
48
48
  token_data = collect_iteration_tokens(usage, iteration_cost)
49
49
 
50
- # Update session bar cost in real-time (don't wait for agent.run to finish)
51
- @ui&.update_sessionbar(cost: @total_cost, cost_source: @cost_source)
50
+ # Update session bar cost in real-time (don't wait for agent.run to finish).
51
+ # Subagents must NOT push their own (small, restarting-from-zero) cost into the
52
+ # shared UI — that would clobber the parent's accumulated total and cause the
53
+ # session bar to "jump back to ~$0" while a subagent is running, then snap back
54
+ # to the real total once the parent merges the subagent's cost. The parent agent
55
+ # is responsible for surfacing the merged cost after fork_subagent returns
56
+ # (see SkillManager#execute_skill_with_subagent and MemoryUpdater).
57
+ @ui&.update_sessionbar(cost: @total_cost, cost_source: @cost_source) unless @is_subagent
52
58
 
53
59
  # Track cache usage statistics (global)
54
60
  @cache_stats[:total_requests] += 1
@@ -79,6 +79,14 @@ module Clacky
79
79
  # the error is something else and we let it propagate.
80
80
  force_reasoning_content_pad = false
81
81
  thinking_retry_attempted = false
82
+ # One-shot flag for context-overflow recovery. When the server complains
83
+ # the input exceeds the model's context window, we run a forced
84
+ # compression with pull_back_from_tail: 1 (preserves the model's
85
+ # two-checkpoint prompt cache) and retry the original request once.
86
+ # We retry at most once — if still overflowing afterward, the issue is
87
+ # something else (e.g. tool schemas alone exceed the window) and we let
88
+ # the error propagate.
89
+ context_overflow_retry_attempted = false
82
90
 
83
91
  begin
84
92
  begin
@@ -220,6 +228,55 @@ module Clacky
220
228
  end
221
229
 
222
230
  rescue Clacky::BadRequestError => e
231
+ # One-shot recovery for "context too long" errors. The model's
232
+ # context window is exceeded by the current history+tools+system
233
+ # prompt. We run a forced compression with pull_back_from_tail: 1
234
+ # (preserves the two-checkpoint prompt cache so the compression
235
+ # call itself still hits cache#A on the second-to-last position),
236
+ # then retry the original request once.
237
+ if !context_overflow_retry_attempted &&
238
+ !@compressing_for_overflow &&
239
+ context_too_long_error?(e) &&
240
+ respond_to?(:compress_messages_if_needed, true)
241
+ context_overflow_retry_attempted = true
242
+ Clacky::Logger.info(
243
+ "[context-overflow] caught BadRequestError, attempting forced compression with pull-back",
244
+ error_message: e.message[0, 200],
245
+ history_size: @history.size,
246
+ previous_total_tokens: @previous_total_tokens
247
+ )
248
+ # Layer 1: standard cache-preserving compression (pull_back: 1).
249
+ # Handles 99% of real overflow cases (newest message tipped the
250
+ # request just past the window).
251
+ if perform_context_overflow_compression(mode: :standard)
252
+ retry
253
+ end
254
+
255
+ # Layer 2: aggressive fallback. The Layer 1 compression call
256
+ # itself overflowed — happens when a single newly-appended
257
+ # message is enormous (huge tool_result, pasted file, etc.) so
258
+ # popping just K=1 didn't bring the request below the window.
259
+ # Pop ~half the history this time; sacrifices prompt cache to
260
+ # guarantee the compression call fits.
261
+ Clacky::Logger.warn(
262
+ "[context-overflow] standard compression failed, escalating to aggressive mode"
263
+ )
264
+ if perform_context_overflow_compression(mode: :aggressive)
265
+ retry
266
+ end
267
+
268
+ # Both layers exhausted. Let the original error propagate so the
269
+ # user sees the underlying provider message. This should be
270
+ # extremely rare — would require both halves of the history to
271
+ # individually exceed the window, which is essentially impossible
272
+ # under the "previous turn succeeded" invariant.
273
+ Clacky::Logger.error(
274
+ "[context-overflow] both standard and aggressive compression failed; " \
275
+ "propagating original error"
276
+ )
277
+ raise
278
+ end
279
+
223
280
  # One-shot recovery for thinking-mode providers (DeepSeek V4, Kimi K2)
224
281
  # that require every assistant message in the history to carry a
225
282
  # reasoning_content field. The history-evidence heuristic in
@@ -342,6 +399,101 @@ module Clacky
342
399
  )
343
400
  end
344
401
 
402
+ # Run a forced compression to recover from a context-overflow error.
403
+ # Called by the BadRequestError rescue when context_too_long_error?
404
+ # returns true.
405
+ #
406
+ # Two-layer defence:
407
+ # ────────────────────────────────────────────────────────────────────
408
+ # Layer 1 (mode: :standard, default) — preserves prompt cache.
409
+ # Pop K=1 message from @history tail, then run compression. This
410
+ # frees just enough token budget for the compression LLM call
411
+ # itself to fit, while preserving the model's two-checkpoint prompt
412
+ # cache (cache#A at second-to-last position is still hit). The
413
+ # popped message is reattached to the rebuilt history's tail by
414
+ # handle_compression_response, so recent task progress is not lost.
415
+ # Handles 99% of real-world cases where overflow is caused by the
416
+ # newest message pushing total just past the window.
417
+ #
418
+ # Layer 2 (mode: :aggressive) — sacrifices prompt cache to survive.
419
+ # Pop ~half the history (capped) from the tail. This dramatically
420
+ # shrinks the compression call's input regardless of how big any
421
+ # single message is. Used as a fallback when Layer 1 itself raises
422
+ # context_too_long — i.e. a single newly-appended message is so
423
+ # large (e.g. >50K-token tool_result, pasted huge file) that even
424
+ # removing it didn't bring the request under the window, OR the
425
+ # popped message was small but earlier history grew past the limit.
426
+ # Pulled-back messages are still reattached after compression so no
427
+ # user content is silently dropped.
428
+ #
429
+ # @param mode [Symbol] :standard or :aggressive
430
+ # @return [Boolean] true if compression succeeded (caller should retry
431
+ # the original request), false if compression was unable to run
432
+ # (compression disabled, history too short, etc.) or itself failed
433
+ # — caller decides whether to escalate to the next layer or
434
+ # propagate the original error.
435
+ private def perform_context_overflow_compression(mode: :standard)
436
+ return false unless respond_to?(:compress_messages_if_needed, true)
437
+
438
+ # Compute pull-back count.
439
+ # Standard: K=1 (cache-preserving).
440
+ # Aggressive: pop ~half the history, but never less than 4 and never
441
+ # more than (history_size - 2) so we always keep system + at least
442
+ # one recent message. Capped at 64 to bound the worst case (an
443
+ # enormous history that should never realistically occur).
444
+ pull_back =
445
+ if mode == :aggressive
446
+ half = @history.size / 2
447
+ [[half, 4].max, [@history.size - 2, 64].min].min
448
+ else
449
+ 1
450
+ end
451
+
452
+ @compressing_for_overflow = true
453
+ compression_context = nil
454
+
455
+ begin
456
+ compression_context = compress_messages_if_needed(
457
+ force: true,
458
+ pull_back_from_tail: pull_back
459
+ )
460
+ return false if compression_context.nil?
461
+
462
+ compression_message = compression_context[:compression_message]
463
+ @history.append(compression_message)
464
+
465
+ response = call_llm # recursive — guarded by @compressing_for_overflow
466
+ handle_compression_response(response, compression_context)
467
+ Clacky::Logger.info(
468
+ "[context-overflow] compression succeeded",
469
+ mode: mode,
470
+ pull_back: pull_back
471
+ )
472
+ true
473
+ rescue => e
474
+ # Compression failed mid-flight. Restore @history to a sensible state:
475
+ # roll back the compression instruction we appended, and re-append the
476
+ # pulled-back messages so the user's recent work isn't silently lost.
477
+ if compression_context
478
+ cm = compression_context[:compression_message]
479
+ @history.rollback_before(cm) if cm
480
+ (compression_context[:pulled_back_messages] || []).each do |m|
481
+ @history.append(m)
482
+ end
483
+ end
484
+ Clacky::Logger.warn(
485
+ "[context-overflow] compression failed during overflow recovery",
486
+ mode: mode,
487
+ pull_back: pull_back,
488
+ error_class: e.class.name,
489
+ error_message: e.message[0, 200]
490
+ )
491
+ false
492
+ ensure
493
+ @compressing_for_overflow = false
494
+ end
495
+ end
496
+
345
497
  # True when a 400 BadRequestError is specifically about a missing
346
498
  # reasoning_content field in thinking mode (DeepSeek V4, Kimi K2 thinking).
347
499
  # We require TWO distinct substrings to avoid false positives — a generic
@@ -358,6 +510,72 @@ module Clacky
358
510
  msg.include?("must be provided"))
359
511
  end
360
512
 
513
+ # True when a 400 BadRequestError indicates the request exceeded the
514
+ # model's context window (i.e. the conversation history is too long).
515
+ #
516
+ # We deliberately favour broad detection over narrow precision:
517
+ # - False positive cost: one extra (no-op) compression cycle.
518
+ # - False negative cost: user is stuck — every retry hits the same wall.
519
+ # So the matcher is intentionally permissive.
520
+ #
521
+ # Coverage (verified against real production error strings):
522
+ #
523
+ # OpenAI:
524
+ # "This model's maximum context length is 128000 tokens. However
525
+ # you requested ... Please reduce the length of the messages."
526
+ # error.code == "context_length_exceeded"
527
+ #
528
+ # Anthropic:
529
+ # "prompt is too long: 218849 tokens > 200000 maximum"
530
+ #
531
+ # Qwen / Alibaba (DashScope):
532
+ # "You passed 117345 input tokens and requested 8192 output tokens.
533
+ # However the model's context length is only 125536 tokens, resulting
534
+ # in a maximum input length of 117344 tokens. Please reduce the length
535
+ # of the input prompt. (parameter=input_tokens, value=117345)"
536
+ #
537
+ # Qwen / Alibaba (DashScope) — newer/terser format (qwen3.6 series):
538
+ # "InternalError.Algo.InvalidParameter: Range of input length should be [1, 229376]"
539
+ #
540
+ # DeepSeek / Kimi / MiniMax / most OpenAI-compatible relays:
541
+ # Variants of OpenAI-style "context length" / "tokens exceeds" wording.
542
+ #
543
+ # Generic gateways (Portkey, OpenRouter):
544
+ # "The total number of tokens exceeds the model's maximum context length"
545
+ private def context_too_long_error?(err)
546
+ return false unless err.is_a?(Clacky::BadRequestError)
547
+
548
+ msg = err.message.to_s.downcase
549
+
550
+ # Strong phrases — any one of these is conclusive on its own.
551
+ # Each phrase is two-or-more semantic words to avoid single-word noise.
552
+ strong_phrases = [
553
+ "context length", # OpenAI / Qwen / many compat APIs
554
+ "context_length_exceeded", # OpenAI error.code
555
+ "maximum context", # OpenAI variant
556
+ "maximum input length", # Qwen
557
+ "prompt is too long", # Anthropic
558
+ "input is too long", # Anthropic-compat relays
559
+ "exceeds the maximum context", # Portkey & generic gateways
560
+ "exceeds the model's context", # Generic
561
+ "exceeds the model's maximum", # Generic
562
+ "reduce the length of the input", # Qwen action hint
563
+ "reduce the length of the messages", # OpenAI action hint
564
+ "reduce the length of your", # Generic action hint
565
+ "reduce the length of the prompt", # Generic action hint
566
+ "range of input length" # Qwen DashScope qwen3.6+ terse format
567
+ ]
568
+ return true if strong_phrases.any? { |p| msg.include?(p) }
569
+
570
+ # Pattern 1: Anthropic-style "<N> tokens > <N> maximum"
571
+ return true if msg =~ /\d+\s*tokens?\s*>\s*\d+/
572
+
573
+ # Pattern 2: Qwen-style structured field "parameter=input_tokens"
574
+ return true if msg.include?("parameter=input_tokens")
575
+
576
+ false
577
+ end
578
+
361
579
  # Detect upstream tool-call truncation and raise UpstreamTruncatedError
362
580
  # so the standard RetryableError rescue (with fallback model support)
363
581
  # handles retry identically to 5xx/429.