octo-agent 0.11.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.clacky/skills/commit/SKILL.md +423 -0
- data/.clacky/skills/gem-release/SKILL.md +199 -0
- data/.clacky/skills/gem-release/scripts/release.sh +304 -0
- data/.clacky/skills/oss-upload/SKILL.md +47 -0
- data/.octorules +106 -0
- data/.rspec +3 -0
- data/.rubocop.yml +8 -0
- data/CHANGELOG.md +76 -0
- data/CODE_OF_CONDUCT.md +132 -0
- data/CONTRIBUTING.md +92 -0
- data/Dockerfile +28 -0
- data/LICENSE.txt +22 -0
- data/POSITIONING.md +46 -0
- data/README.md +134 -0
- data/README_CN.md +134 -0
- data/Rakefile +34 -0
- data/benchmark/fixtures/sample_project/Gemfile +3 -0
- data/benchmark/fixtures/sample_project/lib/api_handler.rb +32 -0
- data/benchmark/fixtures/sample_project/lib/order_calculator.rb +23 -0
- data/benchmark/fixtures/sample_project/lib/user_renderer.rb +20 -0
- data/benchmark/fixtures/sample_project/spec/order_calculator_spec.rb +20 -0
- data/benchmark/results/EVALUATION_REPORT.md +165 -0
- data/benchmark/results/baseline_20260511_174424.json +128 -0
- data/benchmark/results/report_20260511_175256.json +271 -0
- data/benchmark/results/report_20260511_175444.json +271 -0
- data/benchmark/results/treatment_20260511_175103.json +130 -0
- data/benchmark/runner.rb +441 -0
- data/bin/octo +7 -0
- data/docs/agent-first-ui-design.md +77 -0
- data/docs/billing-system.md +318 -0
- data/docs/channel-architecture.md +235 -0
- data/docs/engineering-article.md +343 -0
- data/docs/session-skill-invocation.md +69 -0
- data/docs/time_machine_design.md +247 -0
- data/docs/ui2-architecture.md +124 -0
- data/homebrew/README.md +96 -0
- data/homebrew/openocto.rb +24 -0
- data/lib/octo/agent/hook_manager.rb +61 -0
- data/lib/octo/agent/llm_caller.rb +800 -0
- data/lib/octo/agent/memory_updater.rb +246 -0
- data/lib/octo/agent/message_compressor.rb +225 -0
- data/lib/octo/agent/message_compressor_helper.rb +869 -0
- data/lib/octo/agent/next_message_suggester.rb +215 -0
- data/lib/octo/agent/session_serializer.rb +685 -0
- data/lib/octo/agent/skill_auto_creator.rb +114 -0
- data/lib/octo/agent/skill_evolution.rb +61 -0
- data/lib/octo/agent/skill_manager.rb +466 -0
- data/lib/octo/agent/skill_reflector.rb +89 -0
- data/lib/octo/agent/system_prompt_builder.rb +101 -0
- data/lib/octo/agent/time_machine.rb +214 -0
- data/lib/octo/agent/tool_executor.rb +454 -0
- data/lib/octo/agent/tool_registry.rb +150 -0
- data/lib/octo/agent.rb +2180 -0
- data/lib/octo/agent_config.rb +989 -0
- data/lib/octo/agent_profile.rb +112 -0
- data/lib/octo/anthropic_stream_aggregator.rb +137 -0
- data/lib/octo/background_task_registry.rb +324 -0
- data/lib/octo/banner.rb +34 -0
- data/lib/octo/bedrock_stream_aggregator.rb +137 -0
- data/lib/octo/block_font.rb +331 -0
- data/lib/octo/cli.rb +968 -0
- data/lib/octo/client.rb +623 -0
- data/lib/octo/default_agents/SOUL.md +3 -0
- data/lib/octo/default_agents/USER.md +1 -0
- data/lib/octo/default_agents/base_prompt.md +66 -0
- data/lib/octo/default_agents/coding/profile.yml +2 -0
- data/lib/octo/default_agents/coding/system_prompt.md +67 -0
- data/lib/octo/default_agents/general/profile.yml +2 -0
- data/lib/octo/default_agents/general/system_prompt.md +16 -0
- data/lib/octo/default_parsers/doc_parser.rb +69 -0
- data/lib/octo/default_parsers/docx_parser.rb +188 -0
- data/lib/octo/default_parsers/pdf_parser.rb +120 -0
- data/lib/octo/default_parsers/pdf_parser_ocr.py +103 -0
- data/lib/octo/default_parsers/pdf_parser_plumber.py +62 -0
- data/lib/octo/default_parsers/pptx_parser.rb +140 -0
- data/lib/octo/default_parsers/xlsx_parser.rb +121 -0
- data/lib/octo/default_skills/browser-setup/SKILL.md +426 -0
- data/lib/octo/default_skills/channel-manager/SKILL.md +623 -0
- data/lib/octo/default_skills/channel-manager/dingtalk_setup.rb +191 -0
- data/lib/octo/default_skills/channel-manager/discord_setup.rb +199 -0
- data/lib/octo/default_skills/channel-manager/feishu_setup.rb +574 -0
- data/lib/octo/default_skills/channel-manager/import_lark_skills.rb +97 -0
- data/lib/octo/default_skills/channel-manager/install_feishu_skills.rb +105 -0
- data/lib/octo/default_skills/channel-manager/weixin_setup.rb +274 -0
- data/lib/octo/default_skills/code-explorer/SKILL.md +36 -0
- data/lib/octo/default_skills/cron-task-creator/SKILL.md +257 -0
- data/lib/octo/default_skills/cron-task-creator/evals/evals.json +38 -0
- data/lib/octo/default_skills/onboard/SKILL.md +578 -0
- data/lib/octo/default_skills/onboard/scripts/import_external_skills.rb +413 -0
- data/lib/octo/default_skills/onboard/scripts/install_builtin_skills.rb +97 -0
- data/lib/octo/default_skills/persist-memory/SKILL.md +59 -0
- data/lib/octo/default_skills/personal-website/SKILL.md +113 -0
- data/lib/octo/default_skills/personal-website/publish.rb +235 -0
- data/lib/octo/default_skills/product-help/SKILL.md +123 -0
- data/lib/octo/default_skills/product-help/docs/agent-config.md +74 -0
- data/lib/octo/default_skills/product-help/docs/best-practices.md +49 -0
- data/lib/octo/default_skills/product-help/docs/browser-tool.md +53 -0
- data/lib/octo/default_skills/product-help/docs/built-in-skills.md +43 -0
- data/lib/octo/default_skills/product-help/docs/cli-reference.md +82 -0
- data/lib/octo/default_skills/product-help/docs/create-your-first-skill.md +47 -0
- data/lib/octo/default_skills/product-help/docs/faq.md +98 -0
- data/lib/octo/default_skills/product-help/docs/how-to-use-a-skill.md +58 -0
- data/lib/octo/default_skills/product-help/docs/installation.md +59 -0
- data/lib/octo/default_skills/product-help/docs/memory-system.md +61 -0
- data/lib/octo/default_skills/product-help/docs/octorules.md +62 -0
- data/lib/octo/default_skills/product-help/docs/session-management.md +63 -0
- data/lib/octo/default_skills/product-help/docs/skill-basics.md +55 -0
- data/lib/octo/default_skills/product-help/docs/skill-frontmatter.md +61 -0
- data/lib/octo/default_skills/product-help/docs/web-server.md +49 -0
- data/lib/octo/default_skills/product-help/docs/what-is-octo.md +37 -0
- data/lib/octo/default_skills/product-help/docs/windows-installation.md +36 -0
- data/lib/octo/default_skills/product-help/docs/writing-tips.md +53 -0
- data/lib/octo/default_skills/recall-memory/SKILL.md +65 -0
- data/lib/octo/default_skills/skill-add/SKILL.md +59 -0
- data/lib/octo/default_skills/skill-add/scripts/install_from_zip.rb +295 -0
- data/lib/octo/default_skills/skill-creator/SKILL.md +602 -0
- data/lib/octo/default_skills/skill-creator/agents/analyzer.md +274 -0
- data/lib/octo/default_skills/skill-creator/agents/comparator.md +202 -0
- data/lib/octo/default_skills/skill-creator/agents/grader.md +223 -0
- data/lib/octo/default_skills/skill-creator/eval-viewer/generate_review.py +471 -0
- data/lib/octo/default_skills/skill-creator/eval-viewer/viewer.html +1325 -0
- data/lib/octo/default_skills/skill-creator/references/schemas.md +430 -0
- data/lib/octo/default_skills/skill-creator/scripts/__init__.py +0 -0
- data/lib/octo/default_skills/skill-creator/scripts/aggregate_benchmark.py +401 -0
- data/lib/octo/default_skills/skill-creator/scripts/generate_report.py +326 -0
- data/lib/octo/default_skills/skill-creator/scripts/improve_description.py +310 -0
- data/lib/octo/default_skills/skill-creator/scripts/quick_validate.py +103 -0
- data/lib/octo/default_skills/skill-creator/scripts/run_eval.py +317 -0
- data/lib/octo/default_skills/skill-creator/scripts/run_loop.py +331 -0
- data/lib/octo/default_skills/skill-creator/scripts/utils.py +47 -0
- data/lib/octo/default_skills/skill-creator/scripts/validate_skill_frontmatter.rb +143 -0
- data/lib/octo/idle_compression_timer.rb +115 -0
- data/lib/octo/json_ui_controller.rb +204 -0
- data/lib/octo/message_format/anthropic.rb +409 -0
- data/lib/octo/message_format/bedrock.rb +361 -0
- data/lib/octo/message_format/open_ai.rb +222 -0
- data/lib/octo/message_history.rb +373 -0
- data/lib/octo/openai_stream_aggregator.rb +130 -0
- data/lib/octo/plain_ui_controller.rb +166 -0
- data/lib/octo/providers.rb +534 -0
- data/lib/octo/server/browser_manager.rb +397 -0
- data/lib/octo/server/channel/adapters/base.rb +82 -0
- data/lib/octo/server/channel/adapters/dingtalk/adapter.rb +314 -0
- data/lib/octo/server/channel/adapters/dingtalk/api_client.rb +391 -0
- data/lib/octo/server/channel/adapters/dingtalk/stream_client.rb +203 -0
- data/lib/octo/server/channel/adapters/discord/adapter.rb +229 -0
- data/lib/octo/server/channel/adapters/discord/api_client.rb +107 -0
- data/lib/octo/server/channel/adapters/discord/gateway_client.rb +270 -0
- data/lib/octo/server/channel/adapters/feishu/adapter.rb +320 -0
- data/lib/octo/server/channel/adapters/feishu/bot.rb +478 -0
- data/lib/octo/server/channel/adapters/feishu/file_processor.rb +36 -0
- data/lib/octo/server/channel/adapters/feishu/message_parser.rb +129 -0
- data/lib/octo/server/channel/adapters/feishu/ws_client.rb +423 -0
- data/lib/octo/server/channel/adapters/telegram/adapter.rb +375 -0
- data/lib/octo/server/channel/adapters/telegram/api_client.rb +205 -0
- data/lib/octo/server/channel/adapters/wecom/adapter.rb +148 -0
- data/lib/octo/server/channel/adapters/wecom/media_downloader.rb +115 -0
- data/lib/octo/server/channel/adapters/wecom/ws_client.rb +395 -0
- data/lib/octo/server/channel/adapters/weixin/adapter.rb +692 -0
- data/lib/octo/server/channel/adapters/weixin/api_client.rb +402 -0
- data/lib/octo/server/channel/channel_config.rb +178 -0
- data/lib/octo/server/channel/channel_manager.rb +468 -0
- data/lib/octo/server/channel/channel_ui_controller.rb +224 -0
- data/lib/octo/server/channel.rb +33 -0
- data/lib/octo/server/discover.rb +77 -0
- data/lib/octo/server/epipe_safe_io.rb +105 -0
- data/lib/octo/server/http_server.rb +3554 -0
- data/lib/octo/server/scheduler.rb +317 -0
- data/lib/octo/server/server_master.rb +325 -0
- data/lib/octo/server/session_registry.rb +431 -0
- data/lib/octo/server/web_ui_controller.rb +487 -0
- data/lib/octo/session_manager.rb +385 -0
- data/lib/octo/skill.rb +466 -0
- data/lib/octo/skill_loader.rb +328 -0
- data/lib/octo/tools/base.rb +118 -0
- data/lib/octo/tools/browser.rb +625 -0
- data/lib/octo/tools/edit.rb +165 -0
- data/lib/octo/tools/file_reader.rb +549 -0
- data/lib/octo/tools/glob.rb +162 -0
- data/lib/octo/tools/grep.rb +356 -0
- data/lib/octo/tools/invoke_skill.rb +96 -0
- data/lib/octo/tools/list_tasks.rb +54 -0
- data/lib/octo/tools/redo_task.rb +41 -0
- data/lib/octo/tools/request_user_feedback.rb +84 -0
- data/lib/octo/tools/security.rb +333 -0
- data/lib/octo/tools/terminal/output_cleaner.rb +63 -0
- data/lib/octo/tools/terminal/persistent_session.rb +268 -0
- data/lib/octo/tools/terminal/safe_rm.sh +106 -0
- data/lib/octo/tools/terminal/session_manager.rb +213 -0
- data/lib/octo/tools/terminal.rb +1828 -0
- data/lib/octo/tools/todo_manager.rb +374 -0
- data/lib/octo/tools/trash_manager.rb +388 -0
- data/lib/octo/tools/undo_task.rb +35 -0
- data/lib/octo/tools/web_fetch.rb +242 -0
- data/lib/octo/tools/web_search.rb +260 -0
- data/lib/octo/tools/write.rb +77 -0
- data/lib/octo/ui2/block_font.rb +10 -0
- data/lib/octo/ui2/components/base_component.rb +163 -0
- data/lib/octo/ui2/components/command_suggestions.rb +290 -0
- data/lib/octo/ui2/components/common_component.rb +96 -0
- data/lib/octo/ui2/components/inline_input.rb +226 -0
- data/lib/octo/ui2/components/input_area.rb +1338 -0
- data/lib/octo/ui2/components/message_component.rb +99 -0
- data/lib/octo/ui2/components/modal_component.rb +419 -0
- data/lib/octo/ui2/components/todo_area.rb +149 -0
- data/lib/octo/ui2/components/tool_component.rb +107 -0
- data/lib/octo/ui2/components/welcome_banner.rb +139 -0
- data/lib/octo/ui2/layout_manager.rb +807 -0
- data/lib/octo/ui2/line_editor.rb +363 -0
- data/lib/octo/ui2/markdown_renderer.rb +100 -0
- data/lib/octo/ui2/output_buffer.rb +370 -0
- data/lib/octo/ui2/progress_handle.rb +362 -0
- data/lib/octo/ui2/progress_indicator.rb +55 -0
- data/lib/octo/ui2/screen_buffer.rb +273 -0
- data/lib/octo/ui2/terminal_detector.rb +119 -0
- data/lib/octo/ui2/theme_manager.rb +85 -0
- data/lib/octo/ui2/themes/base_theme.rb +105 -0
- data/lib/octo/ui2/themes/hacker_theme.rb +62 -0
- data/lib/octo/ui2/themes/minimal_theme.rb +56 -0
- data/lib/octo/ui2/thinking_verbs.rb +26 -0
- data/lib/octo/ui2/ui_controller.rb +1625 -0
- data/lib/octo/ui2/view_renderer.rb +177 -0
- data/lib/octo/ui2.rb +40 -0
- data/lib/octo/ui_interface.rb +154 -0
- data/lib/octo/utils/arguments_parser.rb +191 -0
- data/lib/octo/utils/browser_detector.rb +195 -0
- data/lib/octo/utils/encoding.rb +92 -0
- data/lib/octo/utils/environment_detector.rb +140 -0
- data/lib/octo/utils/file_ignore_helper.rb +170 -0
- data/lib/octo/utils/file_processor.rb +601 -0
- data/lib/octo/utils/gitignore_parser.rb +154 -0
- data/lib/octo/utils/limit_stack.rb +152 -0
- data/lib/octo/utils/logger.rb +124 -0
- data/lib/octo/utils/login_shell.rb +72 -0
- data/lib/octo/utils/model_pricing.rb +646 -0
- data/lib/octo/utils/parser_manager.rb +165 -0
- data/lib/octo/utils/path_helper.rb +15 -0
- data/lib/octo/utils/scripts_manager.rb +59 -0
- data/lib/octo/utils/string_matcher.rb +158 -0
- data/lib/octo/utils/trash_directory.rb +112 -0
- data/lib/octo/utils/workspace_rules.rb +46 -0
- data/lib/octo/version.rb +5 -0
- data/lib/octo/web/app.css +7141 -0
- data/lib/octo/web/app.js +543 -0
- data/lib/octo/web/apple-touch-icon.png +0 -0
- data/lib/octo/web/auth.js +150 -0
- data/lib/octo/web/channels.js +276 -0
- data/lib/octo/web/datepicker.js +205 -0
- data/lib/octo/web/favicon.png +0 -0
- data/lib/octo/web/i18n.js +1073 -0
- data/lib/octo/web/icon-512.png +0 -0
- data/lib/octo/web/icon-dark.svg +25 -0
- data/lib/octo/web/icon.svg +29 -0
- data/lib/octo/web/index.html +871 -0
- data/lib/octo/web/marked.min.js +69 -0
- data/lib/octo/web/onboard.js +491 -0
- data/lib/octo/web/profile.js +442 -0
- data/lib/octo/web/sessions.js +4421 -0
- data/lib/octo/web/settings.js +913 -0
- data/lib/octo/web/sidebar.js +32 -0
- data/lib/octo/web/skills.js +885 -0
- data/lib/octo/web/tasks.js +297 -0
- data/lib/octo/web/theme.js +105 -0
- data/lib/octo/web/trash.js +343 -0
- data/lib/octo/web/vendor/hljs/highlight.min.js +1244 -0
- data/lib/octo/web/vendor/hljs/hljs-theme.css +95 -0
- data/lib/octo/web/vendor/katex/auto-render.min.js +1 -0
- data/lib/octo/web/vendor/katex/fonts/KaTeX_AMS-Regular.woff2 +0 -0
- data/lib/octo/web/vendor/katex/fonts/KaTeX_Caligraphic-Bold.woff2 +0 -0
- data/lib/octo/web/vendor/katex/fonts/KaTeX_Caligraphic-Regular.woff2 +0 -0
- data/lib/octo/web/vendor/katex/fonts/KaTeX_Fraktur-Bold.woff2 +0 -0
- data/lib/octo/web/vendor/katex/fonts/KaTeX_Fraktur-Regular.woff2 +0 -0
- data/lib/octo/web/vendor/katex/fonts/KaTeX_Main-Bold.woff2 +0 -0
- data/lib/octo/web/vendor/katex/fonts/KaTeX_Main-BoldItalic.woff2 +0 -0
- data/lib/octo/web/vendor/katex/fonts/KaTeX_Main-Italic.woff2 +0 -0
- data/lib/octo/web/vendor/katex/fonts/KaTeX_Main-Regular.woff2 +0 -0
- data/lib/octo/web/vendor/katex/fonts/KaTeX_Math-BoldItalic.woff2 +0 -0
- data/lib/octo/web/vendor/katex/fonts/KaTeX_Math-Italic.woff2 +0 -0
- data/lib/octo/web/vendor/katex/fonts/KaTeX_SansSerif-Bold.woff2 +0 -0
- data/lib/octo/web/vendor/katex/fonts/KaTeX_SansSerif-Italic.woff2 +0 -0
- data/lib/octo/web/vendor/katex/fonts/KaTeX_SansSerif-Regular.woff2 +0 -0
- data/lib/octo/web/vendor/katex/fonts/KaTeX_Script-Regular.woff2 +0 -0
- data/lib/octo/web/vendor/katex/fonts/KaTeX_Size1-Regular.woff2 +0 -0
- data/lib/octo/web/vendor/katex/fonts/KaTeX_Size2-Regular.woff2 +0 -0
- data/lib/octo/web/vendor/katex/fonts/KaTeX_Size3-Regular.woff2 +0 -0
- data/lib/octo/web/vendor/katex/fonts/KaTeX_Size4-Regular.woff2 +0 -0
- data/lib/octo/web/vendor/katex/fonts/KaTeX_Typewriter-Regular.woff2 +0 -0
- data/lib/octo/web/vendor/katex/katex.min.css +1 -0
- data/lib/octo/web/vendor/katex/katex.min.js +1 -0
- data/lib/octo/web/version.js +449 -0
- data/lib/octo/web/weixin-qr.html +209 -0
- data/lib/octo/web/ws-dispatcher.js +357 -0
- data/lib/octo/web/ws.js +128 -0
- data/lib/octo.rb +145 -0
- data/scripts/build/build.sh +329 -0
- data/scripts/build/lib/apt.sh +56 -0
- data/scripts/build/lib/brew.sh +89 -0
- data/scripts/build/lib/colors.sh +17 -0
- data/scripts/build/lib/gem.sh +95 -0
- data/scripts/build/lib/mise.sh +125 -0
- data/scripts/build/lib/network.sh +157 -0
- data/scripts/build/lib/os.sh +57 -0
- data/scripts/build/lib/shell.sh +37 -0
- data/scripts/build/src/install.sh.cc +174 -0
- data/scripts/build/src/install_browser.sh.cc +101 -0
- data/scripts/build/src/install_full.sh.cc +290 -0
- data/scripts/build/src/install_rails_deps.sh.cc +145 -0
- data/scripts/build/src/install_system_deps.sh.cc +123 -0
- data/scripts/build/src/uninstall.sh.cc +101 -0
- data/scripts/install.ps1 +532 -0
- data/scripts/install.sh +567 -0
- data/scripts/install_browser.sh +479 -0
- data/scripts/install_full.sh +838 -0
- data/scripts/install_rails_deps.sh +746 -0
- data/scripts/install_system_deps.sh +518 -0
- data/scripts/uninstall.sh +287 -0
- data/sig/octo.rbs +4 -0
- metadata +614 -0
|
@@ -0,0 +1,343 @@
|
|
|
1
|
+
# Every AI Agent Feature Is a Cache Invalidation Surface
|
|
2
|
+
|
|
3
|
+
*May 19, 2026 · Yafei Lee / Founder of Octo*
|
|
4
|
+
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
I'm Yafei Lee, founder of [Octo](https://github.com/octo-ai/octo), an open-source AI Agent written in Ruby. We wanted an agent with skills, memory, sub-agents, browser automation, dynamic model switching, and long-running sessions. Each of those features made prompt caching worse in a different way.
|
|
8
|
+
|
|
9
|
+
That was the real architecture problem. Not how to call an LLM, not how to add another tool, not how to orchestrate more agents — how to keep the cache prefix stable while the product keeps changing.
|
|
10
|
+
|
|
11
|
+
**Every agent feature is also a cache invalidation surface.** Skills load new system context. Peer-agent workflows fork the prefix. Browser automation adds volatile tool output. Compression rewrites history. Model switching can fragment the cache namespace unless model-specific state stays out of the system prompt. If you're building a capable agent and your cache hit rate is much lower than expected, this is probably why.
|
|
12
|
+
|
|
13
|
+
Over two years and three architecture generations (the first two failed), we converged on seven engineering decisions that let us hit 90%+ cache rates across real tasks — while keeping all those features intact. What follows is the complete story: what broke, what we tried, and what actually worked.
|
|
14
|
+
|
|
15
|
+
---
|
|
16
|
+
|
|
17
|
+
## Generation 1: RAG Everything (2024 – early 2025)
|
|
18
|
+
|
|
19
|
+
Our first agent was a textbook RAG system. We embedded the user's codebase, docs, and conversation history into a vector store. Every query went through hybrid retrieval, re-ranking, and query rewriting before the LLM saw anything.
|
|
20
|
+
|
|
21
|
+
It sounded right. It wasn't.
|
|
22
|
+
|
|
23
|
+
The index was always behind the repo. Every codebase update required re-embedding, and real-time sync was unreliable enough that we kept paying to search context that was sometimes stale.
|
|
24
|
+
|
|
25
|
+
The bigger problem was recall. 90% sounds high until an agent chains multiple steps. A wrong file in step 2 becomes a wrong edit in step 3 and a wasted retry in step 4. We guessed that something closer to 97% recall might be the minimum for an agent to be net-positive, and we were not close.
|
|
26
|
+
|
|
27
|
+
For coding agents working over local repos, we killed RAG entirely. No embeddings, no vector store, no retrieval pipeline. If the agent needs context, it reads files directly or searches with `grep`. If your documentation needs to be accessible to an agent, make it readable on a website. Don't shred it into embeddings.
|
|
28
|
+
|
|
29
|
+
---
|
|
30
|
+
|
|
31
|
+
## Generation 2: Multi-Agent Orchestration (mid-2025)
|
|
32
|
+
|
|
33
|
+
The next idea came from the SWEBench leaderboard playbook: a Planner agent, a Coder agent, a Reviewer agent, and a Tester agent, coordinated through a message bus with role-specific prompts.
|
|
34
|
+
|
|
35
|
+
We got decent SWEBench scores. The product was terrible.
|
|
36
|
+
|
|
37
|
+
Every handoff was a cache miss. Each agent had its own system prompt and cache namespace, and passing context between agents meant serializing rich state into a smaller message. Useful context was lost at the boundary, and the receiving agent had to rebuild its own prefix.
|
|
38
|
+
|
|
39
|
+
The overhead was not subtle. A task that one agent could finish in 4 minutes took 14 minutes with four. Cost was roughly 6× higher. Agents waited for each other, re-read context the previous agent had already processed, and sometimes contradicted each other. When the final output was wrong, tracing the failure through Planner → Coder → Reviewer took longer than debugging a single conversation.
|
|
40
|
+
|
|
41
|
+
SWEBench scores didn't predict user satisfaction. The failures that annoyed real users — slow iteration, lost context across handoffs, inconsistent code style — were not what the benchmark measured.
|
|
42
|
+
|
|
43
|
+
We killed role-based multi-agent orchestration. One main agent, one conversation, one cache namespace. Sub-agents survived only as isolated skill execution contexts, invoked through a single stable tool.
|
|
44
|
+
|
|
45
|
+
Two generations, same conclusion: the model is already smart enough. What it needs isn't more models, it's a better harness.
|
|
46
|
+
|
|
47
|
+
---
|
|
48
|
+
|
|
49
|
+
## The Seven Decisions
|
|
50
|
+
|
|
51
|
+
Generation 3 started from a question: *what if we optimized everything around a single agent's cache hit rate?* Not as a cost hack, but as an architectural principle. High cache hits mean the model sees consistent context, responds faster, and costs less. Every decision below serves that goal.
|
|
52
|
+
|
|
53
|
+
(The code is open source. Links to the exact files implementing each decision are at the end of this post.)
|
|
54
|
+
|
|
55
|
+
---
|
|
56
|
+
|
|
57
|
+
### Decision 1: History Growth Breaks Prefix Matching → Double Cache Markers
|
|
58
|
+
|
|
59
|
+
Prompt caching works by prefix matching. The LLM provider stores a hash of the message prefix; if your next request shares that prefix, you get the cached rate (depending on the provider, cached tokens are priced at a fraction of normal input tokens). The way you tell the provider where to cache is by placing `cache_control` markers on specific messages.
|
|
60
|
+
|
|
61
|
+
The naive approach is one marker on the last message. It breaks in three ways:
|
|
62
|
+
|
|
63
|
+
1. **History grows monotonically.** You mark message N. Next turn, message N+1 is appended. The content at the position of your old marker has changed, so it's a cache miss on the entire history.
|
|
64
|
+
2. **Tool call retries.** The model's last tool call errors out, or the user hits Ctrl-C. The "last message" gets discarded, and your marker vanishes with it.
|
|
65
|
+
3. **Mid-session model switches.** The user switches from Sonnet to Opus. You want to share as much prefix as possible across models. Any unnecessary marker movement becomes a cache miss event.
|
|
66
|
+
|
|
67
|
+
We hit problem (1) first. The fix progression is visible in our git log:
|
|
68
|
+
|
|
69
|
+
```
|
|
70
|
+
8ff66cc fix: cache
|
|
71
|
+
6ea99fe fix: prompt cache
|
|
72
|
+
e9a3602 feat: prompt cache works fine
|
|
73
|
+
7734c97 feat: try 2 point cache
|
|
74
|
+
```
|
|
75
|
+
|
|
76
|
+
The first three commits were incremental patches. The last one was the structural fix: **two markers instead of one.**
|
|
77
|
+
|
|
78
|
+
#### How double markers work
|
|
79
|
+
|
|
80
|
+
Every turn, we mark **two** consecutive messages, not one:
|
|
81
|
+
|
|
82
|
+
```
|
|
83
|
+
Turn N: [..., msg_A, msg_B(*), msg_C(*)]
|
|
84
|
+
↑ ↑
|
|
85
|
+
marker 1 marker 2
|
|
86
|
+
|
|
87
|
+
Turn N+1: [..., msg_A, msg_B(*), msg_C(*), msg_D(*)]
|
|
88
|
+
↑ ↑ ↑
|
|
89
|
+
(still there) (still there) new marker
|
|
90
|
+
```
|
|
91
|
+
|
|
92
|
+
On turn N+1, the provider tries to match the marker on `msg_C` and hits everything before it (system prompt + tools + full history minus the last message). We place a new marker on `msg_D` for the next turn.
|
|
93
|
+
|
|
94
|
+
This is a **rolling double buffer**: at any moment we hold two breakpoints — one being "read" (from the previous turn) and one being "written" (at the current tail). Next turn, the old "write" becomes the new "read," and we write a fresh one at the new tail. There's never a moment where both buffers are invalid simultaneously.
|
|
95
|
+
|
|
96
|
+
#### Why exactly 2, not 3 or 4
|
|
97
|
+
|
|
98
|
+
Each additional marker costs a cache write at write-tier pricing. The only failure boundary we need to cover is the "old tail / new tail" edge, and two markers is exactly the minimum for that. A third marker lands further back in the prefix, writing a segment that will never be read independently. 2 covers the boundary. 3 is redundant.
|
|
99
|
+
|
|
100
|
+
#### Surviving tool call retries
|
|
101
|
+
|
|
102
|
+
This is the second benefit, and the actual motivation behind commit `7734c97`. When the model retries a tool call (error, Ctrl-C, broken stream), the last message gets discarded. With a single marker, that's an immediate cache miss. With double markers, the second-to-last marker usually survives, so single-step rollback still hits cache. Three markers would survive two-step rollbacks, but the cost doesn't justify the edge case.
|
|
103
|
+
|
|
104
|
+
#### Messages that must never be marked
|
|
105
|
+
|
|
106
|
+
Our marker selection logic has one hard rule: skip any message tagged `system_injected: true`. These are ephemeral messages (session context blocks, compression instructions) that won't exist in the same form next turn. A marker on them is a write that will never be read back. The selector walks backward from the tail, skips `system_injected` messages, and stops when it has two real conversation messages.
|
|
107
|
+
|
|
108
|
+
---
|
|
109
|
+
|
|
110
|
+
### Decision 2: Dynamic Session State Breaks System Prompts → Frozen System Prompt
|
|
111
|
+
|
|
112
|
+
Engineering discipline: our agent's system prompt is built once at session start, then byte-frozen. Any requirement to put dynamic information in the system prompt gets redirected elsewhere.
|
|
113
|
+
|
|
114
|
+
This is the foundation of the entire cache strategy. If the system prompt changes, every subsequent cache entry is invalidated. There is no partial fix.
|
|
115
|
+
|
|
116
|
+
But at least four kinds of information naturally "want" to live in the system prompt:
|
|
117
|
+
|
|
118
|
+
1. **Current date, working directory, OS** — the model needs these for correct commands.
|
|
119
|
+
2. **Current model ID** — helpful for self-adaptive behavior.
|
|
120
|
+
3. **Newly installed skills** — the model needs to see skill names to invoke them.
|
|
121
|
+
4. **Updated user preferences** (USER.md / SOUL.md) — the agent's personality and user context.
|
|
122
|
+
|
|
123
|
+
All four can change mid-session. If any of them is in the system prompt, a single change invalidates everything.
|
|
124
|
+
|
|
125
|
+
#### The [session context] block
|
|
126
|
+
|
|
127
|
+
Instead of the system prompt, we inject this information as a regular `user` message in the conversation history:
|
|
128
|
+
|
|
129
|
+
```
|
|
130
|
+
[Session context: Today is 2026-05-13, Tuesday. Current model: claude-sonnet-4-6.
|
|
131
|
+
OS: macOS. Working directory: /Users/.../project]
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
This message is tagged `system_injected: true`. It won't be selected by cache markers (Decision 1), won't count as a real user turn, and gets discarded during compression. Injection is date-gated: one per day, plus one on model switch. Most sessions see exactly one.
|
|
135
|
+
|
|
136
|
+
#### A bug that took a day to find
|
|
137
|
+
|
|
138
|
+
Our first implementation of `inject_session_context` was eager. It fired during agent construction, before the system prompt was built. This meant `@history.empty?` returned `false`, so `run()` skipped system prompt construction entirely. The agent sent its first request with a "today is Tuesday" message but no system prompt. Behavior was subtly broken for a day before we traced it.
|
|
139
|
+
|
|
140
|
+
The fix was one line: inject after the system prompt is built. The code comment that survived:
|
|
141
|
+
|
|
142
|
+
```ruby
|
|
143
|
+
# IMPORTANT: Skip injection when the system prompt hasn't been built yet.
|
|
144
|
+
# Otherwise, appending a user message to an empty history makes
|
|
145
|
+
# @history.empty? false, which causes run() to skip building the
|
|
146
|
+
# system prompt entirely.
|
|
147
|
+
```
|
|
148
|
+
|
|
149
|
+
Assembly order matters more than content. You can spend weeks designing each piece of the prefix, but if the assembly sequence is wrong by one step, the entire cache strategy is void.
|
|
150
|
+
|
|
151
|
+
#### How skill discovery works without touching the system prompt
|
|
152
|
+
|
|
153
|
+
Skills are rendered into the system prompt at session start, then frozen. A skill installed mid-session won't appear until the next session. We accept this friction. Re-rendering the system prompt on every skill install would invalidate the cache for all users on all sessions on every turn. Skill installation is low-frequency; cache hits are per-turn. The tradeoff is clear.
|
|
154
|
+
|
|
155
|
+
That said, `invoke_skill` reads each SKILL.md at call time, not at session start. So if a user explicitly asks for a newly installed skill, the system can still find and execute it, though it won't auto-discover it from the skill listing.
|
|
156
|
+
|
|
157
|
+
---
|
|
158
|
+
|
|
159
|
+
### Decision 3: Skills and Sub-Agents Bloat History → One Meta-Tool
|
|
160
|
+
|
|
161
|
+
`invoke_skill` is one of our 16 tools and does more work than any other. It provides skill hot-loading, sub-agent architecture, memory recall, and skill self-evolution, all in under 200 tokens of system prompt.
|
|
162
|
+
|
|
163
|
+
It spawns a sub-agent with its own conversation history but the same 16 tools. When the sub-agent finishes, the main agent only sees `invoke_skill → result`. All intermediate steps stay in the sub-agent's isolated session.
|
|
164
|
+
|
|
165
|
+
This matters for caching: a code review skill might read dozens of files and produce a long analysis. Without isolation, all that intermediate work would inflate the main agent's history, triggering compression earlier and costing more. With `invoke_skill`, the main agent's history stays clean.
|
|
166
|
+
|
|
167
|
+
And for extensibility: need a new capability? Drop a SKILL.md in `~/.octo/skills/`. The `invoke_skill` tool is always present in the schema; it doesn't need to know about specific skills at compile time. The SKILL.md is read at invocation time. This one tool replaces what would otherwise be ~20 specialized tools, each bloating the schema and increasing the cache invalidation surface.
|
|
168
|
+
|
|
169
|
+
---
|
|
170
|
+
|
|
171
|
+
### Decision 4: Tool Growth Destabilizes Schema → Exactly 16 Tools
|
|
172
|
+
|
|
173
|
+
Tool schemas sit right after the system prompt in the cache prefix. If the schema changes, everything after it is invalidated. Every additional tool isn't just extra schema tokens; it's extra risk surface for cache invalidation the next time you change any tool.
|
|
174
|
+
|
|
175
|
+
But too few tools also cost money. If the model has to take three steps for something that one well-designed tool could handle in one step, you're paying for extra turns.
|
|
176
|
+
|
|
177
|
+
Our answer after months of iteration: 16 tools. File I/O (3), search (2), execution (1), browser (1), web (2), task management (4), interaction (1), extension (1), safety (1).
|
|
178
|
+
|
|
179
|
+
The design principles are simple: minimize parameters per tool (fewer ways for the model to get it wrong), no overlap between tools, and heavy RSpec coverage on every tool. A tool bug cascades: wrong observation → wrong decision → wasted retries.
|
|
180
|
+
|
|
181
|
+
If we ever need a 17th tool, we'll add it. Four months in, we haven't. The capabilities that didn't become tools became skills instead: code analysis, memory, scheduling, sub-agent orchestration. Each routed through `invoke_skill`, invisible to the tool schema.
|
|
182
|
+
|
|
183
|
+
---
|
|
184
|
+
|
|
185
|
+
### Decision 5: Long Sessions Exceed Context Limits → Insert-Then-Compress
|
|
186
|
+
|
|
187
|
+
Context windows are finite. Long tasks will fill them. Compression is the single biggest threat to cache hit rates: replacing old messages with a summary changes the prefix, guaranteeing a cache miss. So the question is how to minimize the damage.
|
|
188
|
+
|
|
189
|
+
#### Don't use a separate model for compression
|
|
190
|
+
|
|
191
|
+
Many agents compress by spawning an independent LLM call with a cheap/fast model and a "you are a summarization assistant" system prompt.
|
|
192
|
+
|
|
193
|
+
The problems:
|
|
194
|
+
|
|
195
|
+
- The compression call's system prompt doesn't match the main session. It has zero shared prefix with the main cache, so it's a 100% miss on the compression call itself.
|
|
196
|
+
- After compression, the main session's history has changed (old messages replaced by summary), so the main session's cache is also invalidated. You're running cold for the next 4-5 turns.
|
|
197
|
+
|
|
198
|
+
You pay twice for every compression event: once for the compression call's miss, and once for the main session's cold-to-warm recovery.
|
|
199
|
+
|
|
200
|
+
Our approach: **Insert-then-Compress.** Instead of a separate call, we insert the compression instruction as a `system_injected` message at the end of the current conversation, then send a normal request.
|
|
201
|
+
|
|
202
|
+
The effect:
|
|
203
|
+
|
|
204
|
+
- The compression call hits the existing cache. Same system prompt, same tools, same history prefix. Only the tail instruction (~500 tokens) is cold.
|
|
205
|
+
- After compression, we rebuild history as `[system_prompt, summary, last_N_messages]`. This does miss once, but only once. From the second turn onward, double markers take over again.
|
|
206
|
+
|
|
207
|
+
| | Separate model | Insert-then-Compress |
|
|
208
|
+
|---|---|---|
|
|
209
|
+
| Compression call cache hit | 0% | **~95%** |
|
|
210
|
+
| Cold tokens during compression | ~50,000 | **~500** |
|
|
211
|
+
| Main session cold turns after | 4–5 | **1** |
|
|
212
|
+
|
|
213
|
+
*Comparison for a 50K-token session compression event.*
|
|
214
|
+
|
|
215
|
+
#### The sweet spot: 200K–300K tokens
|
|
216
|
+
|
|
217
|
+
We tested multiple thresholds. 200K–300K tokens is where quality and cost balance. The model still effectively uses the context, with enough headroom to complete compression itself. After compression, history is always reduced to under 10K tokens, controlling the baseline cost of every subsequent turn.
|
|
218
|
+
|
|
219
|
+
#### Compress at idle, not at the next message
|
|
220
|
+
|
|
221
|
+
LLM providers expire prompt caches after ~5 minutes of inactivity. Once expired, the next turn is fully cold: 10× the cached price.
|
|
222
|
+
|
|
223
|
+
We run an idle timer (`idle_compression_timer.rb`): when the user stops typing for 90 seconds and history is approaching the threshold, we compress immediately, while the cache is still warm. The new short history establishes a fresh cache breakpoint before TTL expiry.
|
|
224
|
+
|
|
225
|
+
When the user comes back after a few minutes of thinking, the session is already compressed and warm. Without this, they'd face a cache-expired 300K-token history at full price. This single behavior saves roughly 10× on long-pause sessions.
|
|
226
|
+
|
|
227
|
+
#### The million-token context trap
|
|
228
|
+
|
|
229
|
+
"Million-token context" sounds impressive, but the model re-reads the entire context every turn. 1M tokens of input, even at 100% cache hit (0.1× price), costs the equivalent of 100K full-price tokens per turn. One cache miss and you pay for 1M tokens at full rate. Add the well-documented attention degradation in ultra-long contexts, and the math is clear.
|
|
230
|
+
|
|
231
|
+
Our strategy is the opposite of "fill up the context window": compress aggressively, keep history short. 10K tokens of compressed history at 95% cache hit is cheaper and more effective than 1M tokens of raw history at 99% cache hit.
|
|
232
|
+
|
|
233
|
+
---
|
|
234
|
+
|
|
235
|
+
### Decision 6: File Parsing Wants More Tools → Self-Maintained Scripts
|
|
236
|
+
|
|
237
|
+
PDF, Excel, Word, and PowerPoint parsing are common agent needs. Built-in tools would bloat the schema (violates Decision 4) and require C extensions (breaks zero-dependency install). Requiring users to install skills first is bad UX.
|
|
238
|
+
|
|
239
|
+
Our third path: on first install, copy a set of Python parsing scripts to `~/.octo/scripts/`, then let the agent maintain them.
|
|
240
|
+
|
|
241
|
+
When the agent needs to read a PDF, it runs `python3 ~/.octo/scripts/read_pdf.py <file>` via the `terminal` tool. The tool list doesn't grow. If a script fails (missing dependency, format edge case), the agent can fix the script and `pip install` whatever's needed. The capability isn't hard-coded in the gem. It lives in user-space scripts that the agent itself maintains and improves over time.
|
|
242
|
+
|
|
243
|
+
Why Python for scripts when the agent is Ruby? Pragmatism. Python's document processing ecosystem (`pdfplumber`, `openpyxl`, `python-docx`) is the most mature. We use the best tool for each layer.
|
|
244
|
+
|
|
245
|
+
---
|
|
246
|
+
|
|
247
|
+
### Decision 7: Browser Automation Wants Many MCP Tools → One Stable Browser Tool
|
|
248
|
+
|
|
249
|
+
Browser automation matters for agents, but the mainstream approaches have problems. Headless browsers (Puppeteer/Playwright) are invisible to the user, frequently blocked by anti-bot detection, and can't access existing login sessions. External MCP services require separate installation and may expose dozens of fine-grained tools that bloat the schema.
|
|
250
|
+
|
|
251
|
+
We take over the user's actual Chrome/Edge instead. The user enables Remote Debugging once (guided by a setup skill), and our built-in MCP client connects via stdio JSON-RPC. The agent operates on the browser the user can see — same cookies, same login sessions, same page state. When the agent clicks a button, the user watches it happen.
|
|
252
|
+
|
|
253
|
+
To the model, `browser` is one tool out of 16 with a stable schema. The complexity of daemon lifecycle management (startup, heartbeat, crash recovery) lives in `browser_manager.rb`, invisible to the cache layer.
|
|
254
|
+
|
|
255
|
+
This comes with obvious safety concerns. We keep the browser visible at all times, require explicit user-initiated setup, and treat browser automation as a high-trust local capability rather than a background cloud service. It is powerful precisely because it runs in the user's real session, so it should be used with the same caution as giving an assistant access to your logged-in browser.
|
|
256
|
+
|
|
257
|
+
---
|
|
258
|
+
|
|
259
|
+
## Why Ruby? (Yes, Really)
|
|
260
|
+
|
|
261
|
+
If you've read this far you might have noticed: this entire agent is written in Ruby. Not Python. Not TypeScript. Ruby.
|
|
262
|
+
|
|
263
|
+
On GitHub, there are about 4,700 repositories tagged "ai-agent" in Python, 2,800 in TypeScript, and **5 in Ruby.** Ruby is almost absent from the current AI agent ecosystem, which made this choice worth explaining.
|
|
264
|
+
|
|
265
|
+
We didn't choose Ruby to be contrarian. We chose it because the things an agent harness actually does — orchestrating API calls, managing cache boundaries, dynamically loading skills, maintaining tool registries — are things Ruby happens to be very good at.
|
|
266
|
+
|
|
267
|
+
Metaprogramming is a genuine advantage here. `method_missing`, `define_method`, `class_eval` — when your agent modifies its own helper scripts at runtime, when skills load dynamically without restart, when tool registration happens through introspection rather than config files, Ruby's metaprogramming pays real dividends.
|
|
268
|
+
|
|
269
|
+
Distribution is frictionless. `gem install octo` — done. Version management, dependency resolution, executable registration (`octo` command), all out of the box. No virtual environments, no `node_modules`, no build step.
|
|
270
|
+
|
|
271
|
+
**Zero C extension dependencies.** This took significant engineering effort. Look at our gemspec:
|
|
272
|
+
|
|
273
|
+
```
|
|
274
|
+
faraday, thor, tty-prompt, tty-spinner, diffy, pastel,
|
|
275
|
+
tty-screen, tty-markdown, base64, logger, websocket,
|
|
276
|
+
webrick, artii, rubyzip, rouge, chunky_png
|
|
277
|
+
```
|
|
278
|
+
|
|
279
|
+
Every dependency is pure Ruby. No `brew install libxml2`, no `apt-get install libffi-dev`, no Xcode Command Line Tools.
|
|
280
|
+
|
|
281
|
+
To achieve this, we made unusual choices: pure-Ruby `websocket` gem instead of `websocket-driver` (which needs a C extension for UTF-8 validation); LLM streaming and tool_use protocol handling from scratch with raw `faraday` HTTP — because we needed direct control over `cache_control` field injection for Decision 1; terminal UI built with ANSI escape codes instead of `curses`.
|
|
282
|
+
|
|
283
|
+
These "build from scratch" decisions would have been impractical a few years ago. But the agent is itself an AI coding agent — we used it to write itself. A bootstrapping loop: the product made itself better.
|
|
284
|
+
|
|
285
|
+
---
|
|
286
|
+
|
|
287
|
+
## A Small Sanity Check, Not a Benchmark
|
|
288
|
+
|
|
289
|
+
A note on methodology: **this is not a rigorous benchmark.** We ran three real tasks (a slide deck, a marketing strategy, a social content pipeline) through four agents (ours, Claude Code, OpenClaw, Hermes) under controlled conditions — same prompt, same underlying model (claude-opus-4-7), same skills, same time window. All cost data comes from OpenRouter's per-request CSV billing, not estimates. Single run per agent, no cherry-picking.
|
|
290
|
+
|
|
291
|
+
We did this to get a feel for where we stand, not to make definitive claims. Take the numbers as directional.
|
|
292
|
+
|
|
293
|
+
| Agent | Cost | Requests | Cache Hit Rate |
|
|
294
|
+
|---|---|---|---|
|
|
295
|
+
| **Ours** | $5.10 | 51 | 90.6% |
|
|
296
|
+
| Claude Code | $5.49 | 70 | 95.2% |
|
|
297
|
+
| OpenClaw | $15.70 | 81 | 88.7% |
|
|
298
|
+
| Hermes | $30.14 | 218 | 60.3% |
|
|
299
|
+
|
|
300
|
+
*Total cost across 3 tasks. Data from OpenRouter per-request CSV billing.*
|
|
301
|
+
|
|
302
|
+
The cost difference isn't about unit price; prompt token pricing is roughly the same across agents using the same model. The difference is fewer requests × higher cache hit rate. 51 requests at 90.6% cache hit versus 218 requests at 60.3% cache hit — that's where the 6× gap comes from.
|
|
303
|
+
|
|
304
|
+
Claude Code's cache hit rate is actually higher than ours (95.2% vs 90.6%). They achieve this partly by having fewer features that conflict with caching. Our agent supports skills, sub-agents, browser automation, dynamic model switching, and idle compression — all things that structurally threaten cache coherence. Getting to 90.6% while supporting all of that is the engineering challenge this post describes.
|
|
305
|
+
|
|
306
|
+
Full results, per-task breakdowns, and the actual deliverables from each agent are at [octo.com/benchmark](https://www.octo.com/benchmark).
|
|
307
|
+
|
|
308
|
+
---
|
|
309
|
+
|
|
310
|
+
## Reproducibility
|
|
311
|
+
|
|
312
|
+
Everything needed to verify or re-run this comparison is public:
|
|
313
|
+
|
|
314
|
+
- **Runner script** — [`benchmark/runner.rb`](https://github.com/octo-ai/octo/blob/main/benchmark/runner.rb)
|
|
315
|
+
- **OpenRouter CSV billing data** — [`benchmark/results/`](https://github.com/octo-ai/octo/tree/main/benchmark/results) (per-request cost, cache hit/miss, token counts)
|
|
316
|
+
- **Task prompts and fixtures** — [`benchmark/fixtures/`](https://github.com/octo-ai/octo/tree/main/benchmark/fixtures)
|
|
317
|
+
- **Evaluation report** — [`benchmark/results/EVALUATION_REPORT.md`](https://github.com/octo-ai/octo/blob/main/benchmark/results/EVALUATION_REPORT.md)
|
|
318
|
+
|
|
319
|
+
We did not cherry-pick runs, post-process outputs, or re-run until numbers looked good. One run per agent, published as-is. This still does not make it a benchmark; it just makes the sanity check auditable. If you find errors in the data, open an issue.
|
|
320
|
+
|
|
321
|
+
---
|
|
322
|
+
|
|
323
|
+
## What We Actually Believe
|
|
324
|
+
|
|
325
|
+
These seven decisions share one conviction: spend your engineering budget on the harness, save your intelligence budget for the model.
|
|
326
|
+
|
|
327
|
+
We ripped out RAG because the model can read files directly. We killed multi-agent workflows because one main agent with good context management was faster, cheaper, and easier to debug. We still use sub-agents, but only behind invoke_skill, where they act as isolated execution sandboxes rather than peer collaborators. We kept the tool list small because the capabilities that didn't earn their place as tools became skills instead, routed through a single meta-tool.
|
|
328
|
+
|
|
329
|
+
These aren't universal truths. If you need real-time retrieval from a billion documents, or you're coordinating physical robots, your tradeoffs will differ. But for agents that help individual humans with coding and writing and automation, we think single-agent-with-great-caching has a lot of room to run.
|
|
330
|
+
|
|
331
|
+
Models get better fast. The things that *won't* be obsoleted by better models are the things we've invested in: cache geometry, tool stability, compression strategy, install experience. Harness-layer infrastructure that stays useful regardless of which model you plug in.
|
|
332
|
+
|
|
333
|
+
---
|
|
334
|
+
|
|
335
|
+
Octo is fully open-source under the MIT license. The code behind every decision in this post:
|
|
336
|
+
|
|
337
|
+
- Cache marker logic — [`lib/octo/client.rb`](https://github.com/octo-ai/octo/blob/main/lib/octo/client.rb)
|
|
338
|
+
- Insert-then-Compress — [`lib/octo/agent/message_compressor.rb`](https://github.com/octo-ai/octo/blob/main/lib/octo/agent/message_compressor.rb)
|
|
339
|
+
- Session context injection — [`lib/octo/agent.rb`](https://github.com/octo-ai/octo/blob/main/lib/octo/agent.rb)
|
|
340
|
+
- Idle compression timer — [`lib/octo/idle_compression_timer.rb`](https://github.com/octo-ai/octo/blob/main/lib/octo/idle_compression_timer.rb)
|
|
341
|
+
- Browser tool — [`lib/octo/tools/browser.rb`](https://github.com/octo-ai/octo/blob/main/lib/octo/tools/browser.rb)
|
|
342
|
+
|
|
343
|
+
→ [github.com/octo-ai/octo](https://github.com/octo-ai/octo)
|
|
@@ -0,0 +1,69 @@
|
|
|
1
|
+
# Session + Skill Invocation Pattern
|
|
2
|
+
|
|
3
|
+
> Design pattern for launching an Agent session that immediately runs a skill.
|
|
4
|
+
> Follow this whenever a UI action needs to "open a session and do something automatically."
|
|
5
|
+
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
## The Pattern
|
|
9
|
+
|
|
10
|
+
```
|
|
11
|
+
1. POST /api/sessions → create a named session
|
|
12
|
+
2. Sessions.add(session) → register locally
|
|
13
|
+
3. Sessions.renderList() → update sidebar
|
|
14
|
+
4. _bootUI() if needed → connect WS (only on first boot)
|
|
15
|
+
5. Sessions.select(session.id) → navigate to session (triggers WS subscribe)
|
|
16
|
+
6. WS.send({ type: "message", session_id, content: "/skill-name" })
|
|
17
|
+
→ agent runs the skill immediately
|
|
18
|
+
```
|
|
19
|
+
|
|
20
|
+
The slash command (`/skill-name`) is handled by `Agent#parse_skill_command` on the
|
|
21
|
+
server side — no special API endpoint or pending-state machinery needed.
|
|
22
|
+
|
|
23
|
+
---
|
|
24
|
+
|
|
25
|
+
## Real Usages
|
|
26
|
+
|
|
27
|
+
### Create Task (`tasks.js → createInSession`)
|
|
28
|
+
```js
|
|
29
|
+
Sessions.select(session.id);
|
|
30
|
+
WS.send({ type: "message", session_id: session.id, content: "/create-task" });
|
|
31
|
+
```
|
|
32
|
+
|
|
33
|
+
### Onboard (`onboard.js → _startSoulSession`)
|
|
34
|
+
```js
|
|
35
|
+
_bootUI(); // WS.connect() + Tasks/Skills load
|
|
36
|
+
Sessions.add(session);
|
|
37
|
+
Sessions.renderList();
|
|
38
|
+
Sessions.select(session.id);
|
|
39
|
+
WS.send({ type: "message", session_id: session.id, content: "/onboard" });
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
---
|
|
43
|
+
|
|
44
|
+
## When to Use `pending_task` Instead
|
|
45
|
+
|
|
46
|
+
Use the `pending_task` registry field (and the `run_task` WS message) **only** when
|
|
47
|
+
the prompt is a large block of text read from a file (e.g. `POST /api/tasks/run`).
|
|
48
|
+
|
|
49
|
+
For slash commands, always prefer the direct `WS.send` approach above — simpler and
|
|
50
|
+
no server-side state to manage.
|
|
51
|
+
|
|
52
|
+
---
|
|
53
|
+
|
|
54
|
+
## Anti-patterns Avoided
|
|
55
|
+
|
|
56
|
+
| Anti-pattern | Why it was wrong |
|
|
57
|
+
|---|---|
|
|
58
|
+
| Store `_pendingSessionId` in module state, resolve on `session_list` | Race condition between WS connect and session_list arrival; unnecessary complexity |
|
|
59
|
+
| Custom `takePendingSession()` hook in app.js `session_list` handler | Spread logic across files; hard to trace |
|
|
60
|
+
| Send prompt via `setTimeout` after boot | Fragile timing; breaks if WS is slow |
|
|
61
|
+
|
|
62
|
+
---
|
|
63
|
+
|
|
64
|
+
## Key Insight
|
|
65
|
+
|
|
66
|
+
`Sessions.select(id)` triggers a WS `subscribe` message. Once the server confirms
|
|
67
|
+
with `subscribed`, the client is guaranteed to receive all subsequent broadcasts for
|
|
68
|
+
that session. Sending `WS.send({ type: "message" })` right after `select` is safe
|
|
69
|
+
because the WebSocket driver queues messages until the connection is open.
|
|
@@ -0,0 +1,247 @@
|
|
|
1
|
+
# Time Machine Design Documentation
|
|
2
|
+
|
|
3
|
+
## Overview
|
|
4
|
+
|
|
5
|
+
Time Machine is a feature that allows users to navigate through the agent's task execution history, providing undo/redo capabilities and branch exploration. Users can access it via ESC key or `/undo` command to view an interactive menu of past tasks.
|
|
6
|
+
|
|
7
|
+
## Core Data Structure Design
|
|
8
|
+
|
|
9
|
+
### Task History Graph
|
|
10
|
+
|
|
11
|
+
The Time Machine uses a minimal tree-based data structure to track task relationships:
|
|
12
|
+
|
|
13
|
+
**Three Core State Variables:**
|
|
14
|
+
1. **task_parents** (Hash): Maps each task_id to its parent_id
|
|
15
|
+
- Forms a tree structure where each task points to its predecessor
|
|
16
|
+
- Root tasks have parent_id = 0
|
|
17
|
+
- Enables traversal in both directions (parent→children, child→parent)
|
|
18
|
+
|
|
19
|
+
2. **current_task_id** (Integer): The latest created task ID
|
|
20
|
+
- Always increments when new tasks are created
|
|
21
|
+
- Never decreases, even during undo operations
|
|
22
|
+
- Represents the "tip" of the execution timeline
|
|
23
|
+
|
|
24
|
+
3. **active_task_id** (Integer): The current active position in history
|
|
25
|
+
- Can move backward/forward during undo/redo
|
|
26
|
+
- Determines which messages are visible to the LLM
|
|
27
|
+
- When active_task_id < current_task_id, we're viewing "past" state
|
|
28
|
+
|
|
29
|
+
### Task Metadata Structure
|
|
30
|
+
|
|
31
|
+
Each task in the history contains:
|
|
32
|
+
- **task_id**: Unique identifier (auto-incrementing integer)
|
|
33
|
+
- **summary**: Brief description (first 80 chars of user's message)
|
|
34
|
+
- **status**: One of three states
|
|
35
|
+
- `:past` - Task is before the current active position
|
|
36
|
+
- `:current` - Task is the active position (marked with `→`)
|
|
37
|
+
- `:future` - Task exists but is after active position (marked with `↯`)
|
|
38
|
+
- **has_branches**: Boolean indicating if multiple children exist (marked with `⎇`)
|
|
39
|
+
|
|
40
|
+
## Snapshot Strategy
|
|
41
|
+
|
|
42
|
+
### File State Preservation
|
|
43
|
+
|
|
44
|
+
**Complete AFTER-State Snapshots:**
|
|
45
|
+
- After each successful task execution, all modified files are saved
|
|
46
|
+
- Storage location: `~/.octo/snapshots/{session_id}/task-{id}/`
|
|
47
|
+
- Each file is stored with its full relative path from working directory
|
|
48
|
+
- Only files modified during that task are snapshotted
|
|
49
|
+
|
|
50
|
+
**Why AFTER-state instead of BEFORE-state:**
|
|
51
|
+
- Simpler restoration logic (just copy files back)
|
|
52
|
+
- No need to track "what changed" - the snapshot IS the state
|
|
53
|
+
- Easier to verify correctness (snapshot = expected state)
|
|
54
|
+
|
|
55
|
+
**File Restoration Process:**
|
|
56
|
+
- When switching to a task, iterate through all its snapshotted files
|
|
57
|
+
- Copy each file from snapshot directory to working directory
|
|
58
|
+
- File permissions and timestamps are preserved
|
|
59
|
+
|
|
60
|
+
### Message Filtering
|
|
61
|
+
|
|
62
|
+
**Active Messages Concept:**
|
|
63
|
+
- Messages array contains ALL messages (past, current, future)
|
|
64
|
+
- `active_messages()` method filters out "future" messages
|
|
65
|
+
- LLM only sees messages with `task_id <= active_task_id`
|
|
66
|
+
- This creates the illusion of time travel without data deletion
|
|
67
|
+
|
|
68
|
+
**Why Keep All Messages:**
|
|
69
|
+
- Enables redo operations (future messages preserved)
|
|
70
|
+
- Allows branch switching (alternative futures available)
|
|
71
|
+
- Simplifies session serialization (single source of truth)
|
|
72
|
+
|
|
73
|
+
## Session Persistence
|
|
74
|
+
|
|
75
|
+
### State Serialization
|
|
76
|
+
|
|
77
|
+
Time Machine state is saved under `:time_machine` key in session data:
|
|
78
|
+
- task_parents hash (complete tree structure)
|
|
79
|
+
- current_task_id (latest task number)
|
|
80
|
+
- active_task_id (current viewing position)
|
|
81
|
+
|
|
82
|
+
**Restoration Guarantees:**
|
|
83
|
+
- Complete task tree is rebuilt
|
|
84
|
+
- Active position is restored
|
|
85
|
+
- Snapshot files remain available across sessions
|
|
86
|
+
- User can continue undo/redo from where they left off
|
|
87
|
+
|
|
88
|
+
## Critical Test Scenarios
|
|
89
|
+
|
|
90
|
+
### 1. Basic Undo/Redo Flow
|
|
91
|
+
|
|
92
|
+
**Test Focus:**
|
|
93
|
+
- Sequential task creation increments task IDs correctly
|
|
94
|
+
- Undo moves active_task_id backward (current_task_id unchanged)
|
|
95
|
+
- Redo moves active_task_id forward
|
|
96
|
+
- File snapshots are correctly restored at each step
|
|
97
|
+
- Cannot undo beyond root task (task_id = 0)
|
|
98
|
+
- Cannot redo beyond current_task_id
|
|
99
|
+
|
|
100
|
+
**Edge Cases:**
|
|
101
|
+
- Undoing at root task should fail gracefully
|
|
102
|
+
- Redoing when already at tip should fail gracefully
|
|
103
|
+
- Multiple consecutive undos should work correctly
|
|
104
|
+
|
|
105
|
+
### 2. Branching Scenarios
|
|
106
|
+
|
|
107
|
+
**Test Focus:**
|
|
108
|
+
- After undo, creating new task creates a branch
|
|
109
|
+
- New branch starts from active_task_id, not current_task_id
|
|
110
|
+
- Original future branch is preserved (for potential redo)
|
|
111
|
+
- Parent task is marked with `has_branches: true`
|
|
112
|
+
- Child tasks list should include both branches
|
|
113
|
+
|
|
114
|
+
**Branch Navigation:**
|
|
115
|
+
- Switching between branches restores correct file states
|
|
116
|
+
- Each branch maintains independent history
|
|
117
|
+
- Message filtering correctly shows only relevant messages
|
|
118
|
+
|
|
119
|
+
### 3. Message Filtering and Task IDs
|
|
120
|
+
|
|
121
|
+
**Test Focus:**
|
|
122
|
+
- Every message is tagged with task_id (user, assistant, tool results)
|
|
123
|
+
- Active messages only include those with task_id <= active_task_id
|
|
124
|
+
- LLM never sees "future" messages during undo state
|
|
125
|
+
- After redo, future messages become visible again
|
|
126
|
+
- New tasks created after undo get fresh task IDs (not reused)
|
|
127
|
+
|
|
128
|
+
**Message Consistency:**
|
|
129
|
+
- Tool results are associated with correct task
|
|
130
|
+
- Multi-turn conversations maintain task association
|
|
131
|
+
- Error messages don't break task ID tagging
|
|
132
|
+
|
|
133
|
+
### 4. File Snapshot Integrity
|
|
134
|
+
|
|
135
|
+
**Test Focus:**
|
|
136
|
+
- Only modified files are snapshotted (not entire project)
|
|
137
|
+
- File content is exactly preserved (byte-for-byte)
|
|
138
|
+
- Nested directory structures are correctly recreated
|
|
139
|
+
- Multiple files in single task are all snapshotted
|
|
140
|
+
- Snapshot directory naming prevents collisions
|
|
141
|
+
|
|
142
|
+
**Restoration Accuracy:**
|
|
143
|
+
- After undo + file restore, file content matches expected state
|
|
144
|
+
- Subsequent task execution works with restored files
|
|
145
|
+
- Binary files are handled correctly (not corrupted)
|
|
146
|
+
|
|
147
|
+
### 5. Session Persistence and Recovery
|
|
148
|
+
|
|
149
|
+
**Test Focus:**
|
|
150
|
+
- Save session, restart, restore session preserves Time Machine state
|
|
151
|
+
- Task tree structure is fully rebuilt
|
|
152
|
+
- Active position is correctly restored
|
|
153
|
+
- Snapshot files are accessible after restart
|
|
154
|
+
- Undo/redo operations work identically after restore
|
|
155
|
+
|
|
156
|
+
**Persistence Edge Cases:**
|
|
157
|
+
- Empty task history (new session)
|
|
158
|
+
- Session with complex branching
|
|
159
|
+
- Session saved while in "undo" state (active_task_id < current_task_id)
|
|
160
|
+
|
|
161
|
+
### 6. AI Tool Integration
|
|
162
|
+
|
|
163
|
+
**Test Focus:**
|
|
164
|
+
- Tools are correctly registered in tool registry
|
|
165
|
+
- AI can invoke undo_task, redo_task, list_tasks
|
|
166
|
+
- Agent parameter is correctly injected (similar to TodoManager pattern)
|
|
167
|
+
- Tool execution returns success/failure messages
|
|
168
|
+
- Tools respect permission modes (confirm_all, auto_approve, etc.)
|
|
169
|
+
|
|
170
|
+
**Tool Interaction:**
|
|
171
|
+
- AI calling undo_task modifies agent state correctly
|
|
172
|
+
- Subsequent AI responses use filtered messages
|
|
173
|
+
- Tool results are included in task history
|
|
174
|
+
- Multiple tool calls in sequence work correctly
|
|
175
|
+
|
|
176
|
+
### 7. UI and User Interaction
|
|
177
|
+
|
|
178
|
+
**Test Focus:**
|
|
179
|
+
- ESC key triggers time machine menu
|
|
180
|
+
- `/undo` command works identically to ESC
|
|
181
|
+
- Menu displays correct task list with status indicators
|
|
182
|
+
- Visual markers: `→` current, `↯` future, `⎇` branches
|
|
183
|
+
- User selection triggers correct task switch
|
|
184
|
+
- Menu updates after undo/redo operations
|
|
185
|
+
|
|
186
|
+
**User Experience:**
|
|
187
|
+
- Task summaries are readable (truncated to 80 chars)
|
|
188
|
+
- Menu is responsive with large task histories
|
|
189
|
+
- Cancel/exit returns to normal operation
|
|
190
|
+
- Error messages are clear and actionable
|
|
191
|
+
|
|
192
|
+
### 8. Integration with Existing Features
|
|
193
|
+
|
|
194
|
+
**Test Focus:**
|
|
195
|
+
- Works with message compression (no dependency on tool_calls)
|
|
196
|
+
- Compatible with session serialization
|
|
197
|
+
- Doesn't interfere with cost tracking
|
|
198
|
+
- Works with both UI modes (UI1 and UI2)
|
|
199
|
+
- Subagent forking doesn't inherit Time Machine state
|
|
200
|
+
|
|
201
|
+
**Feature Compatibility:**
|
|
202
|
+
- Todo manager works normally during undo state
|
|
203
|
+
- Web search tools work correctly
|
|
204
|
+
- File tools (write, edit) trigger snapshots
|
|
205
|
+
- Shell commands can be undone via file snapshots
|
|
206
|
+
|
|
207
|
+
## Design Principles
|
|
208
|
+
|
|
209
|
+
### Minimal Invasiveness
|
|
210
|
+
- Only 3 new instance variables in Agent class
|
|
211
|
+
- No changes to core message structure (only adds task_id field)
|
|
212
|
+
- Existing tools unaware of Time Machine existence
|
|
213
|
+
- No performance impact when not in use
|
|
214
|
+
|
|
215
|
+
### Data Integrity
|
|
216
|
+
- Never delete messages or snapshots (immutable history)
|
|
217
|
+
- File restoration is idempotent (can redo multiple times)
|
|
218
|
+
- Task IDs never reused (prevents confusion)
|
|
219
|
+
- Snapshot isolation (each task has independent directory)
|
|
220
|
+
|
|
221
|
+
### User Control
|
|
222
|
+
- Explicit user action required (ESC or /undo)
|
|
223
|
+
- Clear visual feedback on current position
|
|
224
|
+
- Cannot accidentally lose work (future preserved)
|
|
225
|
+
- Can explore branches without commitment
|
|
226
|
+
|
|
227
|
+
### Developer Friendly
|
|
228
|
+
- Simple tree data structure (easy to reason about)
|
|
229
|
+
- Comprehensive test coverage (55 test cases)
|
|
230
|
+
- Clear separation of concerns (module-based design)
|
|
231
|
+
- Well-documented edge cases
|
|
232
|
+
|
|
233
|
+
## Future Enhancement Possibilities
|
|
234
|
+
|
|
235
|
+
### Potential Improvements
|
|
236
|
+
- Automatic snapshot garbage collection (old sessions)
|
|
237
|
+
- Diff view between task states
|
|
238
|
+
- Named checkpoints (user-defined bookmarks)
|
|
239
|
+
- Merge branches functionality
|
|
240
|
+
- Export task history as replay script
|
|
241
|
+
- Snapshot compression for large files
|
|
242
|
+
|
|
243
|
+
### Scalability Considerations
|
|
244
|
+
- Large file handling (incremental snapshots)
|
|
245
|
+
- Long session histories (pagination in UI)
|
|
246
|
+
- Multiple simultaneous branches (better visualization)
|
|
247
|
+
- Remote collaboration (shared task history)
|