RubyGems - openclacky - Versions diffs - 1.0.2 → 1.0.4 - Mend

openclacky 1.0.2 → 1.0.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (47) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +31 -0
data/benchmark/fixtures/sample_project/Gemfile +3 -0
data/benchmark/fixtures/sample_project/lib/api_handler.rb +32 -0
data/benchmark/fixtures/sample_project/lib/order_calculator.rb +23 -0
data/benchmark/fixtures/sample_project/lib/user_renderer.rb +20 -0
data/benchmark/fixtures/sample_project/spec/order_calculator_spec.rb +20 -0
data/benchmark/results/EVALUATION_REPORT.md +165 -0
data/benchmark/results/baseline_20260511_174424.json +128 -0
data/benchmark/results/report_20260511_175256.json +271 -0
data/benchmark/results/report_20260511_175444.json +271 -0
data/benchmark/results/treatment_20260511_175103.json +130 -0
data/benchmark/runner.rb +441 -0
data/docs/proposals/2026-05-11-system-prompt-alignment.md +325 -0
data/docs/proposals/2026-05-12-memory-mechanism-optimization.md +89 -0
data/lib/clacky/agent/cost_tracker.rb +8 -2
data/lib/clacky/agent/llm_caller.rb +218 -0
data/lib/clacky/agent/memory_updater.rb +41 -30
data/lib/clacky/agent/message_compressor.rb +15 -4
data/lib/clacky/agent/message_compressor_helper.rb +41 -2
data/lib/clacky/agent/skill_manager.rb +5 -2
data/lib/clacky/agent/skill_reflector.rb +10 -1
data/lib/clacky/agent/tool_registry.rb +109 -0
data/lib/clacky/agent.rb +20 -0
data/lib/clacky/agent_config.rb +17 -0
data/lib/clacky/cli.rb +65 -0
data/lib/clacky/client.rb +15 -0
data/lib/clacky/default_agents/base_prompt.md +20 -20
data/lib/clacky/default_agents/coding/system_prompt.md +51 -1
data/lib/clacky/default_skills/channel-setup/SKILL.md +113 -5
data/lib/clacky/default_skills/channel-setup/import_lark_skills.rb +97 -0
data/lib/clacky/default_skills/onboard/SKILL.md +1 -1
data/lib/clacky/default_skills/persist-memory/SKILL.md +59 -0
data/lib/clacky/providers.rb +48 -6
data/lib/clacky/server/channel/adapters/weixin/adapter.rb +7 -0
data/lib/clacky/server/channel/channel_manager.rb +91 -0
data/lib/clacky/server/discover.rb +77 -0
data/lib/clacky/server/epipe_safe_io.rb +105 -0
data/lib/clacky/server/http_server.rb +121 -41
data/lib/clacky/server/server_master.rb +6 -0
data/lib/clacky/skill.rb +30 -0
data/lib/clacky/utils/file_processor.rb +71 -0
data/lib/clacky/version.rb +1 -1
data/lib/clacky/web/app.css +58 -22
data/lib/clacky/web/i18n.js +4 -2
data/lib/clacky/web/sessions.js +29 -17
metadata +33 -2

data/docs/proposals/2026-05-11-system-prompt-alignment.md ADDED Viewed

@@ -0,0 +1,325 @@
+# Proposal: System Prompt Alignment with Claude Code
+**Author:** Claude (assistant)
+**Date:** 2026-05-11
+**Branch:** `feat/system-prompt-alignment`
+**Status:** Proposal
+**Scope:** `lib/clacky/default_agents/base_prompt.md`, `lib/clacky/default_agents/coding/system_prompt.md`
+---
+## 1. Background & Motivation
+OpenClacky's positioning is **"最省 Token 的开源 AI Agent，能力对齐 Claude Code"**. While the Harness layer (cache, compression, tool registry) achieves parity or better on cost metrics, the **system prompt layer** remains a significant gap.
+The system prompt is the behavioral contract between the Agent and the LLM. A weak system prompt causes:
+- **Suboptimal tool selection** (e.g., using `Write` for a 2-line change instead of `Edit`)
+- **Token waste** (verbose explanations, unnecessary comments, redundant narration)
+- **Safety issues** (destructive git operations, overly broad file staging)
+- **Lower task completion rate** on complex multi-step tasks
+This proposal targets the system prompt as a **high-leverage, low-risk improvement** that directly impacts both cost (fewer tokens per task) and capability (higher task completion rate).
+---
+## 2. Current State Analysis
+### 2.1 `base_prompt.md` (Universal behavioral rules)
+```
+Lines: 36
+Coverage: General behavior, Tool usage rules, TODO manager rules, Long-term memory
+```
+**What it does well:**
+- TODO manager workflow is explicit and actionable
+- "USE TOOLS to create/modify files" is correctly emphasized
+- "glob > find" rule is present
+**Critical gaps:**
+| Gap | Impact | Evidence |
+|-----|--------|----------|
+| No `Edit > Write` priority rule | Agent rewrites entire files for small changes, wasting tokens | Common user complaint in complex refactoring tasks |
+| No comment/response style rules | Verbose responses, unnecessary explanations, emoji usage | Inflates token count on every turn |
+| No Git safety protocol | `git add -A`, `git commit --amend`, force push risks | Potential data loss, security issues |
+| No code style guidelines | Multi-line docstrings, "added for X flow" comments | Code quality degradation over time |
+| No error handling philosophy | Validates impossible scenarios, overly defensive code | Unnecessary complexity, more tokens |
+| No response structure rules | "Let me..." prefixes, trailing summaries, diff narration | Poor UX, token waste |
+| No task tracking discipline | Multiple in-progress tasks, missing TodoWrite updates | Task state confusion |
+### 2.2 `coding/system_prompt.md` (Role definition)
+```
+Lines: 18
+Coverage: Role description, working process
+```
+**What it does well:**
+- Clear role definition ("AI coding assistant and technical co-founder")
+- "Read existing code before making changes" is correct
+**Critical gaps:**
+| Gap | Impact |
+|-----|--------|
+| No explicit "Claude Code alignment" goal | Agent doesn't know it's competing with Claude Code on behavior |
+| No file modification priorities | Same as base_prompt gap |
+| No security awareness | Agent unaware of OWASP risks, injection vulnerabilities |
+| No testing expectation | Agent often skips running tests after changes |
+| No UI/frontend-specific rules | For fullstack tasks, lacks guidance on testing UI changes |
+### 2.3 `general/system_prompt.md` (Non-coding agent)
+```
+Lines: 17
+Coverage: General digital employee role
+```
+**Gaps:** Similar to coding agent but for general tasks — lacks tool usage priorities, response style, and safety guidelines.
+---
+## 3. Target State (Claude Code Reference)
+Claude Code's system prompt is approximately **800-1200 lines** of dense behavioral rules, covering:
+1. **Doing tasks** — How to interpret instructions, when to ask questions
+2. **Code style** — Comment rules, naming, error handling, no emoji
+3. **Tool usage** — Priorities, fallbacks, when to use which tool
+4. **Git safety** — Explicit do's and don'ts
+5. **Response style** — Conciseness rules, formatting, no trailing summaries
+6. **Task tracking** — TodoWrite discipline, ONE in_progress rule
+7. **Security** — XSS, SQL injection, command injection prevention
+8. **UI/frontend** — Test before claiming success
+**Key insight:** Claude Code's system prompt is not "more verbose" — it's **more precise**. Every rule is designed to reduce token waste and improve task completion rate.
+---
+## 4. Proposed Changes
+### 4.1 `base_prompt.md` — Major Rewrite
+**Keep (existing good rules):**
+- "USE TOOLS to create/modify files"
+- "ALWAYS use `glob` tool — NEVER use shell `find`"
+- "All operations default to working directory"
+- TODO manager workflow (with refinements)
+- Long-term memory rules
+**Add (new sections):**
+#### Section: Code Style
+```markdown
+## Code Style
+- **Default to writing no comments.** Only add one when the WHY is non-obvious: a hidden constraint, a subtle invariant, a workaround for a specific bug, or behavior that would surprise a reader.
+- Don't explain WHAT the code does — well-named identifiers already do that.
+- Don't reference the current task, fix, or callers ("used by X", "added for Y flow", "handles case from issue #123"). These belong in the PR description and rot as the codebase evolves.
+- Never write multi-paragraph docstrings or multi-line comment blocks — one short line max.
+- Only use emojis if the user explicitly requests it. Avoid emojis in all communication unless asked.
+```
+#### Section: File Modification Rules
+```markdown
+## File Modification Rules
+- **ALWAYS prefer `edit` over `write`.** Use `write` only for creating entirely new files.
+- When editing text from `file_reader` output, preserve the exact indentation (tabs/spaces) as it appears AFTER the line number prefix.
+- Ensure `old_string` is unique in the file. If not, provide a larger string with more surrounding context.
+- Use `replace_all` only when you genuinely need to change every occurrence.
+```
+#### Section: Response Style
+```markdown
+## Response Style
+- Keep responses short and concise. One sentence per update is almost always enough.
+- When referencing specific functions or code, include `file_path:line_number`.
+- Do not use a colon before tool calls (e.g., "Let me read the file:" → "Let me read the file.")
+- Don't narrate your internal deliberation. User-facing text should be relevant communication, not a running commentary.
+- Don't summarize what you just did at the end of every response. The user can read the diff.
+- End-of-turn summary: one or two sentences. What changed and what's next. Nothing else.
+```
+#### Section: Git Safety Protocol
+```markdown
+## Git Safety Protocol
+- NEVER update git config (user.name, user.email, etc.)
+- NEVER run destructive commands: `git push --force`, `git reset --hard`, `git checkout .`, `git clean -f`
+- NEVER skip hooks (`--no-verify`, `--no-gpg-sign`)
+- When staging files, prefer `git add <specific-file>` over `git add -A` or `git add .`
+- Always create NEW commits rather than amending existing ones
+- Never amend published commits
+```
+#### Section: Error Handling Philosophy
+```markdown
+## Error Handling
+- Don't add error handling, fallbacks, or validation for scenarios that can't happen. Trust internal code and framework guarantees.
+- Only validate at system boundaries (user input, external APIs).
+- Don't use feature flags or backwards-compatibility shims when you can just change the code.
+```
+#### Section: Task Tracking Discipline
+```markdown
+## Task Tracking
+- Use `todo_manager` to plan and track work on complex tasks (3+ steps).
+- Exactly ONE task must be `in_progress` at any time.
+- Mark tasks complete IMMEDIATELY after finishing — don't batch completions.
+- Complete current tasks before starting new ones.
+```
+### 4.2 `coding/system_prompt.md` — Enhancements
+**Keep:** Role definition, "read existing code before making changes"
+**Add:**
+```markdown
+## Security
+- Be careful not to introduce security vulnerabilities such as command injection, XSS, SQL injection, and other OWASP top 10 vulnerabilities.
+- If you notice insecure code, immediately fix it.
+- Prioritize writing safe, secure, and correct code.
+## Testing
+- For UI or frontend changes, start the dev server and verify in a browser before reporting the task as complete.
+- Type checking and test suites verify code correctness, not feature correctness — if you can't test the UI, say so explicitly rather than claiming success.
+## Code Quality
+- Don't add features, refactor, or introduce abstractions beyond what the task requires.
+- A bug fix doesn't need surrounding cleanup; a one-shot operation doesn't need a helper.
+- Three similar lines is better than a premature abstraction.
+- No half-finished implementations either.
+```
+### 4.3 `general/system_prompt.md` — Add Tool Priorities
+The general agent also needs tool usage priorities and response style rules, as it handles file operations too.
+---
+## 5. Evaluation Framework
+This is critical: **we must prove the changes work**. The evaluation has two dimensions:
+### 5.1 Quantitative Metrics
+| Metric | Baseline (Current) | Target | Measurement Method |
+|--------|-------------------|--------|-------------------|
+| **Avg tokens per task** | TBD (measure on benchmark tasks) | -10% to -20% | Run identical prompts before/after, compare OpenRouter bill CSV |
+| **Task completion rate** | TBD | +5% to +10% | Manual evaluation on 20-task benchmark suite |
+| **Avg tool calls per task** | TBD | -5% to -15% | Fewer unnecessary tool calls (e.g., Write→Edit optimization) |
+| **Response verbosity** | TBD | -20% to -30% | Character count of assistant messages per task |
+### 5.2 Qualitative Checklist
+For each benchmark task, evaluate:
+- [ ] **Tool choice correctness**: Did it use Edit for small changes, Write only for new files?
+- [ ] **No unnecessary comments**: Did it add explanatory comments only when WHY is non-obvious?
+- [ ] **Concise responses**: Are assistant messages short and to-the-point?
+- [ ] **Git safety**: Did it use `git add <file>` instead of `git add -A`?
+- [ ] **No trailing summaries**: Does it avoid "In summary, I did X, Y, Z"?
+- [ ] **Security awareness**: Did it catch/fix potential injection vulnerabilities?
+- [ ] **Task tracking**: For complex tasks, did it use todo_manager correctly with ONE in_progress?
+### 5.3 Evaluation Tasks
+We will use **5 benchmark tasks** spanning different scenarios:
+1. **Simple edit**: Rename a method across 3 files (tests Edit vs Write preference)
+2. **Feature addition**: Add a new API endpoint with tests (tests code style, error handling philosophy)
+3. **Refactoring**: Extract a helper method (tests abstraction judgment)
+4. **Bug fix**: Fix an XSS vulnerability in a template (tests security awareness)
+5. **Git workflow**: Make changes and prepare for commit (tests git safety)
+### 5.4 A/B Test Protocol
+```
+For each task:
+  1. Run with CURRENT system prompt (baseline)
+  2. Run with NEW system prompt (treatment)
+  3. Record: tokens, tool calls, completion status, qualitative score
+  4. Compare metrics
+Control variables:
+  - Same model (claude-opus-4-7)
+  - Same temperature (default)
+  - Same working directory
+  - Fresh session for each run
+```
+---
+## 6. Implementation Plan
+### Phase 1: Write Proposal (this document)
+- [x] Analyze current system prompts
+- [x] Identify gaps against Claude Code
+- [x] Draft new content
+- [x] Design evaluation framework
+### Phase 2: Implement Changes
+- [ ] Update `base_prompt.md`
+- [ ] Update `coding/system_prompt.md`
+- [ ] Update `general/system_prompt.md`
+- [ ] Review and refine wording
+- [ ] Ensure no contradictions with existing rules
+### Phase 3: Evaluate
+- [ ] Run 5 benchmark tasks with current prompt (baseline)
+- [ ] Run 5 benchmark tasks with new prompt (treatment)
+- [ ] Compile metrics comparison
+- [ ] Document qualitative findings
+- [ ] Decide: merge or iterate
+### Phase 4: Merge or Iterate
+- [ ] If metrics improve: merge to main
+- [ ] If metrics don't improve: analyze why, revise, re-test
+---
+## 7. Risks & Mitigation
+| Risk | Likelihood | Impact | Mitigation |
+|------|-----------|--------|-----------|
+| **Over-constrained prompt** | Medium | High (Agent becomes rigid) | Review by multiple humans; test on diverse tasks |
+| **Conflict with existing rules** | Low | Medium | Full text search for overlapping concepts before merge |
+| **Non-English user confusion** | Medium | Low | Keep rules simple; test with Chinese prompts |
+| **Token savings < expected** | Medium | Low | Evaluate anyway; even small savings compound |
+| **Breaking change for existing users** | Low | Medium | System prompt updates transparently; no user action needed |
+---
+## 8. Success Criteria
+This proposal is **approved for implementation** if:
+1. At least **3 out of 5 benchmark tasks** show improved qualitative scores
+2. **Average tokens per task** decreases by ≥ 5%
+3. No **regressions** in task completion rate
+4. Code review approval from at least one maintainer
+---
+## 9. Appendix: Full Proposed `base_prompt.md`
+See attached file in PR.
+---
+*End of Proposal*

data/docs/proposals/2026-05-12-memory-mechanism-optimization.md ADDED Viewed

@@ -0,0 +1,89 @@
+# Proposal: Memory Mechanism Optimization
+**Author:** Claude (assistant)
+**Date:** 2026-05-12
+**Branch:** `feat/memory-optimization` (待创建)
+**Status:** Proposal
+---
+## 1. 问题
+OpenClacky 的 memory 系统有三层，但只有前两层在正常工作。
+`~/.clacky/memories/` 这个目录是完全空的。long-term memory 从来没写过东西，因为触发条件太苛刻（迭代 >= 10），而且子 agent 的白名单检查过于保守，几乎每次都判定"不需要更新"。
+即使 memory 里有内容，agent 也不知道什么时候该用。base_prompt 说"Do NOT recall proactively"，但 agent 根本判断不了什么算"genuinely needed"。结果就是 memory 存在但从不使用。
+对比 Claude Code 的做法：
+- 自动加载 CLAUDE.md 到 system prompt
+- project-level memory 和当前工作目录绑定
+- agent 不需要主动调用工具，相关内容已经在 prompt 里了
+我们缺的是"自动注入"的机制。
+---
+## 2. 要做什么
+让 agent **自动**获得它需要知道的上下文，而不是**被动等待**它去 recall。
+具体两件事：
+### 2.1 自动 Memory 注入
+在 system prompt 构建时，自动从 `~/.clacky/memories/` 中选择相关文件注入。agent 不需要主动 recall，memory 会"推"到它面前。
+匹配逻辑：基于 working directory 名称 + 当前任务关键词，做简单的关键词匹配。选择最相关的 1-3 个文件注入。
+注入位置：在 Project rules 之后，SOUL.md 之前。
+### 2.2 项目级动态 Memory
+在 working directory 下维护一个 `.clacky/CLAUDE.md`，记录项目特定的知识。
+SystemPromptBuilder 自动检测并加载这个文件。MemoryUpdater 在任务结束时自动更新它。用户也可以手动编辑。
+这个文件支持 git 版本控制，项目切换时自动加载，比 `~/.clacky/memories/` 更贴近实际工作。
+### 2.3 降低 Memory Update 门槛
+- 迭代阈值从 10 降到 5
+- 简化 memory update 子 agent 的白名单判断
+- 添加 `/remember` 用户命令，手动触发 memory save
+---
+## 3. 为什么做
+现在的 memory 系统形同虚设：
+- `~/.clacky/memories/` 为空，没有积累任何知识
+- agent 在跨任务时"失忆"，每次都要重新了解用户偏好和项目约定
+- 用户明确说过的决策（比如"不用 Redis"），下个任务 agent 就忘了
+- 对比 Claude Code，差了一个 automatic context loading 的层级
+自动注入的好处：
+- 零额外 LLM 调用，利用现有 prompt caching
+- agent 不需要学习"什么时候 recall"，相关内容已经在 prompt 里
+- 项目级 memory 让多项目切换时上下文不混淆
+---
+## 4. 准备怎么做
+改动集中在三个模块：
+1. **SystemPromptBuilder** — 添加 `load_relevant_memories` 方法，构建 prompt 时自动注入相关 memory 内容
+2. **MemoryUpdater** — 降低迭代阈值，简化白名单，添加 `.clacky/CLAUDE.md` 写入逻辑
+3. **base_prompt.md** — 更新 memory 相关规则（从"不要主动 recall"改为"相关 memory 已自动注入"）
+文件范围：
+- `lib/clacky/agent/system_prompt_builder.rb`
+- `lib/clacky/agent/memory_updater.rb`
+- `lib/clacky/agent/skill_manager.rb`
+- `lib/clacky/agent.rb`（添加 `/remember` 命令）
+- `lib/clacky/default_agents/base_prompt.md`
+---
+*End of Proposal*

data/lib/clacky/agent/cost_tracker.rb CHANGED Viewed

@@ -47,8 +47,14 @@ module Clacky
         # Collect token usage data for this iteration (returned to caller for deferred display)
         token_data = collect_iteration_tokens(usage, iteration_cost)
-        # Update session bar cost in real-time (don't wait for agent.run to finish)
-        @ui&.update_sessionbar(cost: @total_cost, cost_source: @cost_source)
+        # Update session bar cost in real-time (don't wait for agent.run to finish).
+        # Subagents must NOT push their own (small, restarting-from-zero) cost into the
+        # shared UI — that would clobber the parent's accumulated total and cause the
+        # session bar to "jump back to ~$0" while a subagent is running, then snap back
+        # to the real total once the parent merges the subagent's cost. The parent agent
+        # is responsible for surfacing the merged cost after fork_subagent returns
+        # (see SkillManager#execute_skill_with_subagent and MemoryUpdater).
+        @ui&.update_sessionbar(cost: @total_cost, cost_source: @cost_source) unless @is_subagent
         # Track cache usage statistics (global)
         @cache_stats[:total_requests] += 1

data/lib/clacky/agent/llm_caller.rb CHANGED Viewed

@@ -79,6 +79,14 @@ module Clacky
         # the error is something else and we let it propagate.
         force_reasoning_content_pad = false
         thinking_retry_attempted = false
+        # One-shot flag for context-overflow recovery. When the server complains
+        # the input exceeds the model's context window, we run a forced
+        # compression with pull_back_from_tail: 1 (preserves the model's
+        # two-checkpoint prompt cache) and retry the original request once.
+        # We retry at most once — if still overflowing afterward, the issue is
+        # something else (e.g. tool schemas alone exceed the window) and we let
+        # the error propagate.
+        context_overflow_retry_attempted = false
         begin
           begin
@@ -220,6 +228,55 @@ module Clacky
         end
         rescue Clacky::BadRequestError => e
+          # One-shot recovery for "context too long" errors. The model's
+          # context window is exceeded by the current history+tools+system
+          # prompt. We run a forced compression with pull_back_from_tail: 1
+          # (preserves the two-checkpoint prompt cache so the compression
+          # call itself still hits cache#A on the second-to-last position),
+          # then retry the original request once.
+          if !context_overflow_retry_attempted &&
+              !@compressing_for_overflow &&
+              context_too_long_error?(e) &&
+              respond_to?(:compress_messages_if_needed, true)
+            context_overflow_retry_attempted = true
+            Clacky::Logger.info(
+              "[context-overflow] caught BadRequestError, attempting forced compression with pull-back",
+              error_message: e.message[0, 200],
+              history_size: @history.size,
+              previous_total_tokens: @previous_total_tokens
+            )
+            # Layer 1: standard cache-preserving compression (pull_back: 1).
+            # Handles 99% of real overflow cases (newest message tipped the
+            # request just past the window).
+            if perform_context_overflow_compression(mode: :standard)
+              retry
+            end
+            # Layer 2: aggressive fallback. The Layer 1 compression call
+            # itself overflowed — happens when a single newly-appended
+            # message is enormous (huge tool_result, pasted file, etc.) so
+            # popping just K=1 didn't bring the request below the window.
+            # Pop ~half the history this time; sacrifices prompt cache to
+            # guarantee the compression call fits.
+            Clacky::Logger.warn(
+              "[context-overflow] standard compression failed, escalating to aggressive mode"
+            )
+            if perform_context_overflow_compression(mode: :aggressive)
+              retry
+            end
+            # Both layers exhausted. Let the original error propagate so the
+            # user sees the underlying provider message. This should be
+            # extremely rare — would require both halves of the history to
+            # individually exceed the window, which is essentially impossible
+            # under the "previous turn succeeded" invariant.
+            Clacky::Logger.error(
+              "[context-overflow] both standard and aggressive compression failed; " \
+              "propagating original error"
+            )
+            raise
+          end
           # One-shot recovery for thinking-mode providers (DeepSeek V4, Kimi K2)
           # that require every assistant message in the history to carry a
           # reasoning_content field. The history-evidence heuristic in
@@ -342,6 +399,101 @@ module Clacky
         )
       end
+      # Run a forced compression to recover from a context-overflow error.
+      # Called by the BadRequestError rescue when context_too_long_error?
+      # returns true.
+      #
+      # Two-layer defence:
+      # ────────────────────────────────────────────────────────────────────
+      # Layer 1 (mode: :standard, default) — preserves prompt cache.
+      #   Pop K=1 message from @history tail, then run compression. This
+      #   frees just enough token budget for the compression LLM call
+      #   itself to fit, while preserving the model's two-checkpoint prompt
+      #   cache (cache#A at second-to-last position is still hit). The
+      #   popped message is reattached to the rebuilt history's tail by
+      #   handle_compression_response, so recent task progress is not lost.
+      #   Handles 99% of real-world cases where overflow is caused by the
+      #   newest message pushing total just past the window.
+      #
+      # Layer 2 (mode: :aggressive) — sacrifices prompt cache to survive.
+      #   Pop ~half the history (capped) from the tail. This dramatically
+      #   shrinks the compression call's input regardless of how big any
+      #   single message is. Used as a fallback when Layer 1 itself raises
+      #   context_too_long — i.e. a single newly-appended message is so
+      #   large (e.g. >50K-token tool_result, pasted huge file) that even
+      #   removing it didn't bring the request under the window, OR the
+      #   popped message was small but earlier history grew past the limit.
+      #   Pulled-back messages are still reattached after compression so no
+      #   user content is silently dropped.
+      #
+      # @param mode [Symbol] :standard or :aggressive
+      # @return [Boolean] true if compression succeeded (caller should retry
+      #   the original request), false if compression was unable to run
+      #   (compression disabled, history too short, etc.) or itself failed
+      #   — caller decides whether to escalate to the next layer or
+      #   propagate the original error.
+      private def perform_context_overflow_compression(mode: :standard)
+        return false unless respond_to?(:compress_messages_if_needed, true)
+        # Compute pull-back count.
+        # Standard: K=1 (cache-preserving).
+        # Aggressive: pop ~half the history, but never less than 4 and never
+        #   more than (history_size - 2) so we always keep system + at least
+        #   one recent message. Capped at 64 to bound the worst case (an
+        #   enormous history that should never realistically occur).
+        pull_back =
+          if mode == :aggressive
+            half = @history.size / 2
+            [[half, 4].max, [@history.size - 2, 64].min].min
+          else
+            1
+          end
+        @compressing_for_overflow = true
+        compression_context = nil
+        begin
+          compression_context = compress_messages_if_needed(
+            force: true,
+            pull_back_from_tail: pull_back
+          )
+          return false if compression_context.nil?
+          compression_message = compression_context[:compression_message]
+          @history.append(compression_message)
+          response = call_llm  # recursive — guarded by @compressing_for_overflow
+          handle_compression_response(response, compression_context)
+          Clacky::Logger.info(
+            "[context-overflow] compression succeeded",
+            mode: mode,
+            pull_back: pull_back
+          )
+          true
+        rescue => e
+          # Compression failed mid-flight. Restore @history to a sensible state:
+          # roll back the compression instruction we appended, and re-append the
+          # pulled-back messages so the user's recent work isn't silently lost.
+          if compression_context
+            cm = compression_context[:compression_message]
+            @history.rollback_before(cm) if cm
+            (compression_context[:pulled_back_messages] || []).each do |m|
+              @history.append(m)
+            end
+          end
+          Clacky::Logger.warn(
+            "[context-overflow] compression failed during overflow recovery",
+            mode: mode,
+            pull_back: pull_back,
+            error_class: e.class.name,
+            error_message: e.message[0, 200]
+          )
+          false
+        ensure
+          @compressing_for_overflow = false
+        end
+      end
       # True when a 400 BadRequestError is specifically about a missing
       # reasoning_content field in thinking mode (DeepSeek V4, Kimi K2 thinking).
       # We require TWO distinct substrings to avoid false positives — a generic
@@ -358,6 +510,72 @@ module Clacky
            msg.include?("must be provided"))
       end
+      # True when a 400 BadRequestError indicates the request exceeded the
+      # model's context window (i.e. the conversation history is too long).
+      #
+      # We deliberately favour broad detection over narrow precision:
+      #   - False positive cost: one extra (no-op) compression cycle.
+      #   - False negative cost: user is stuck — every retry hits the same wall.
+      # So the matcher is intentionally permissive.
+      #
+      # Coverage (verified against real production error strings):
+      #
+      #   OpenAI:
+      #     "This model's maximum context length is 128000 tokens. However
+      #      you requested ... Please reduce the length of the messages."
+      #     error.code == "context_length_exceeded"
+      #
+      #   Anthropic:
+      #     "prompt is too long: 218849 tokens > 200000 maximum"
+      #
+      #   Qwen / Alibaba (DashScope):
+      #     "You passed 117345 input tokens and requested 8192 output tokens.
+      #      However the model's context length is only 125536 tokens, resulting
+      #      in a maximum input length of 117344 tokens. Please reduce the length
+      #      of the input prompt. (parameter=input_tokens, value=117345)"
+      #
+      #   Qwen / Alibaba (DashScope) — newer/terser format (qwen3.6 series):
+      #     "InternalError.Algo.InvalidParameter: Range of input length should be [1, 229376]"
+      #
+      #   DeepSeek / Kimi / MiniMax / most OpenAI-compatible relays:
+      #     Variants of OpenAI-style "context length" / "tokens exceeds" wording.
+      #
+      #   Generic gateways (Portkey, OpenRouter):
+      #     "The total number of tokens exceeds the model's maximum context length"
+      private def context_too_long_error?(err)
+        return false unless err.is_a?(Clacky::BadRequestError)
+        msg = err.message.to_s.downcase
+        # Strong phrases — any one of these is conclusive on its own.
+        # Each phrase is two-or-more semantic words to avoid single-word noise.
+        strong_phrases = [
+          "context length",                 # OpenAI / Qwen / many compat APIs
+          "context_length_exceeded",        # OpenAI error.code
+          "maximum context",                # OpenAI variant
+          "maximum input length",           # Qwen
+          "prompt is too long",             # Anthropic
+          "input is too long",              # Anthropic-compat relays
+          "exceeds the maximum context",    # Portkey & generic gateways
+          "exceeds the model's context",    # Generic
+          "exceeds the model's maximum",    # Generic
+          "reduce the length of the input", # Qwen action hint
+          "reduce the length of the messages", # OpenAI action hint
+          "reduce the length of your",      # Generic action hint
+          "reduce the length of the prompt", # Generic action hint
+          "range of input length"           # Qwen DashScope qwen3.6+ terse format
+        ]
+        return true if strong_phrases.any? { |p| msg.include?(p) }
+        # Pattern 1: Anthropic-style "<N> tokens > <N> maximum"
+        return true if msg =~ /\d+\s*tokens?\s*>\s*\d+/
+        # Pattern 2: Qwen-style structured field "parameter=input_tokens"
+        return true if msg.include?("parameter=input_tokens")
+        false
+      end
       # Detect upstream tool-call truncation and raise UpstreamTruncatedError
       # so the standard RetryableError rescue (with fallback model support)
       # handles retry identically to 5xx/429.