RubyGems - octo-agent - Versions diffs - 0.11.2 - Mend

octo-agent 0.11.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (319) hide show

checksums.yaml +7 -0
data/.clacky/skills/commit/SKILL.md +423 -0
data/.clacky/skills/gem-release/SKILL.md +199 -0
data/.clacky/skills/gem-release/scripts/release.sh +304 -0
data/.clacky/skills/oss-upload/SKILL.md +47 -0
data/.octorules +106 -0
data/.rspec +3 -0
data/.rubocop.yml +8 -0
data/CHANGELOG.md +76 -0
data/CODE_OF_CONDUCT.md +132 -0
data/CONTRIBUTING.md +92 -0
data/Dockerfile +28 -0
data/LICENSE.txt +22 -0
data/POSITIONING.md +46 -0
data/README.md +134 -0
data/README_CN.md +134 -0
data/Rakefile +34 -0
data/benchmark/fixtures/sample_project/Gemfile +3 -0
data/benchmark/fixtures/sample_project/lib/api_handler.rb +32 -0
data/benchmark/fixtures/sample_project/lib/order_calculator.rb +23 -0
data/benchmark/fixtures/sample_project/lib/user_renderer.rb +20 -0
data/benchmark/fixtures/sample_project/spec/order_calculator_spec.rb +20 -0
data/benchmark/results/EVALUATION_REPORT.md +165 -0
data/benchmark/results/baseline_20260511_174424.json +128 -0
data/benchmark/results/report_20260511_175256.json +271 -0
data/benchmark/results/report_20260511_175444.json +271 -0
data/benchmark/results/treatment_20260511_175103.json +130 -0
data/benchmark/runner.rb +441 -0
data/bin/octo +7 -0
data/docs/agent-first-ui-design.md +77 -0
data/docs/billing-system.md +318 -0
data/docs/channel-architecture.md +235 -0
data/docs/engineering-article.md +343 -0
data/docs/session-skill-invocation.md +69 -0
data/docs/time_machine_design.md +247 -0
data/docs/ui2-architecture.md +124 -0
data/homebrew/README.md +96 -0
data/homebrew/openocto.rb +24 -0
data/lib/octo/agent/hook_manager.rb +61 -0
data/lib/octo/agent/llm_caller.rb +800 -0
data/lib/octo/agent/memory_updater.rb +246 -0
data/lib/octo/agent/message_compressor.rb +225 -0
data/lib/octo/agent/message_compressor_helper.rb +869 -0
data/lib/octo/agent/next_message_suggester.rb +215 -0
data/lib/octo/agent/session_serializer.rb +685 -0
data/lib/octo/agent/skill_auto_creator.rb +114 -0
data/lib/octo/agent/skill_evolution.rb +61 -0
data/lib/octo/agent/skill_manager.rb +466 -0
data/lib/octo/agent/skill_reflector.rb +89 -0
data/lib/octo/agent/system_prompt_builder.rb +101 -0
data/lib/octo/agent/time_machine.rb +214 -0
data/lib/octo/agent/tool_executor.rb +454 -0
data/lib/octo/agent/tool_registry.rb +150 -0
data/lib/octo/agent.rb +2180 -0
data/lib/octo/agent_config.rb +989 -0
data/lib/octo/agent_profile.rb +112 -0
data/lib/octo/anthropic_stream_aggregator.rb +137 -0
data/lib/octo/background_task_registry.rb +324 -0
data/lib/octo/banner.rb +34 -0
data/lib/octo/bedrock_stream_aggregator.rb +137 -0
data/lib/octo/block_font.rb +331 -0
data/lib/octo/cli.rb +968 -0
data/lib/octo/client.rb +623 -0
data/lib/octo/default_agents/SOUL.md +3 -0
data/lib/octo/default_agents/USER.md +1 -0
data/lib/octo/default_agents/base_prompt.md +66 -0
data/lib/octo/default_agents/coding/profile.yml +2 -0
data/lib/octo/default_agents/coding/system_prompt.md +67 -0
data/lib/octo/default_agents/general/profile.yml +2 -0
data/lib/octo/default_agents/general/system_prompt.md +16 -0
data/lib/octo/default_parsers/doc_parser.rb +69 -0
data/lib/octo/default_parsers/docx_parser.rb +188 -0
data/lib/octo/default_parsers/pdf_parser.rb +120 -0
data/lib/octo/default_parsers/pdf_parser_ocr.py +103 -0
data/lib/octo/default_parsers/pdf_parser_plumber.py +62 -0
data/lib/octo/default_parsers/pptx_parser.rb +140 -0
data/lib/octo/default_parsers/xlsx_parser.rb +121 -0
data/lib/octo/default_skills/browser-setup/SKILL.md +426 -0
data/lib/octo/default_skills/channel-manager/SKILL.md +623 -0
data/lib/octo/default_skills/channel-manager/dingtalk_setup.rb +191 -0
data/lib/octo/default_skills/channel-manager/discord_setup.rb +199 -0
data/lib/octo/default_skills/channel-manager/feishu_setup.rb +574 -0
data/lib/octo/default_skills/channel-manager/import_lark_skills.rb +97 -0
data/lib/octo/default_skills/channel-manager/install_feishu_skills.rb +105 -0
data/lib/octo/default_skills/channel-manager/weixin_setup.rb +274 -0
data/lib/octo/default_skills/code-explorer/SKILL.md +36 -0
data/lib/octo/default_skills/cron-task-creator/SKILL.md +257 -0
data/lib/octo/default_skills/cron-task-creator/evals/evals.json +38 -0
data/lib/octo/default_skills/onboard/SKILL.md +578 -0
data/lib/octo/default_skills/onboard/scripts/import_external_skills.rb +413 -0
data/lib/octo/default_skills/onboard/scripts/install_builtin_skills.rb +97 -0
data/lib/octo/default_skills/persist-memory/SKILL.md +59 -0
data/lib/octo/default_skills/personal-website/SKILL.md +113 -0
data/lib/octo/default_skills/personal-website/publish.rb +235 -0
data/lib/octo/default_skills/product-help/SKILL.md +123 -0
data/lib/octo/default_skills/product-help/docs/agent-config.md +74 -0
data/lib/octo/default_skills/product-help/docs/best-practices.md +49 -0
data/lib/octo/default_skills/product-help/docs/browser-tool.md +53 -0
data/lib/octo/default_skills/product-help/docs/built-in-skills.md +43 -0
data/lib/octo/default_skills/product-help/docs/cli-reference.md +82 -0
data/lib/octo/default_skills/product-help/docs/create-your-first-skill.md +47 -0
data/lib/octo/default_skills/product-help/docs/faq.md +98 -0
data/lib/octo/default_skills/product-help/docs/how-to-use-a-skill.md +58 -0
data/lib/octo/default_skills/product-help/docs/installation.md +59 -0
data/lib/octo/default_skills/product-help/docs/memory-system.md +61 -0
data/lib/octo/default_skills/product-help/docs/octorules.md +62 -0
data/lib/octo/default_skills/product-help/docs/session-management.md +63 -0
data/lib/octo/default_skills/product-help/docs/skill-basics.md +55 -0
data/lib/octo/default_skills/product-help/docs/skill-frontmatter.md +61 -0
data/lib/octo/default_skills/product-help/docs/web-server.md +49 -0
data/lib/octo/default_skills/product-help/docs/what-is-octo.md +37 -0
data/lib/octo/default_skills/product-help/docs/windows-installation.md +36 -0
data/lib/octo/default_skills/product-help/docs/writing-tips.md +53 -0
data/lib/octo/default_skills/recall-memory/SKILL.md +65 -0
data/lib/octo/default_skills/skill-add/SKILL.md +59 -0
data/lib/octo/default_skills/skill-add/scripts/install_from_zip.rb +295 -0
data/lib/octo/default_skills/skill-creator/SKILL.md +602 -0
data/lib/octo/default_skills/skill-creator/agents/analyzer.md +274 -0
data/lib/octo/default_skills/skill-creator/agents/comparator.md +202 -0
data/lib/octo/default_skills/skill-creator/agents/grader.md +223 -0
data/lib/octo/default_skills/skill-creator/eval-viewer/generate_review.py +471 -0
data/lib/octo/default_skills/skill-creator/eval-viewer/viewer.html +1325 -0
data/lib/octo/default_skills/skill-creator/references/schemas.md +430 -0
data/lib/octo/default_skills/skill-creator/scripts/__init__.py +0 -0
data/lib/octo/default_skills/skill-creator/scripts/aggregate_benchmark.py +401 -0
data/lib/octo/default_skills/skill-creator/scripts/generate_report.py +326 -0
data/lib/octo/default_skills/skill-creator/scripts/improve_description.py +310 -0
data/lib/octo/default_skills/skill-creator/scripts/quick_validate.py +103 -0
data/lib/octo/default_skills/skill-creator/scripts/run_eval.py +317 -0
data/lib/octo/default_skills/skill-creator/scripts/run_loop.py +331 -0
data/lib/octo/default_skills/skill-creator/scripts/utils.py +47 -0
data/lib/octo/default_skills/skill-creator/scripts/validate_skill_frontmatter.rb +143 -0
data/lib/octo/idle_compression_timer.rb +115 -0
data/lib/octo/json_ui_controller.rb +204 -0
data/lib/octo/message_format/anthropic.rb +409 -0
data/lib/octo/message_format/bedrock.rb +361 -0
data/lib/octo/message_format/open_ai.rb +222 -0
data/lib/octo/message_history.rb +373 -0
data/lib/octo/openai_stream_aggregator.rb +130 -0
data/lib/octo/plain_ui_controller.rb +166 -0
data/lib/octo/providers.rb +534 -0
data/lib/octo/server/browser_manager.rb +397 -0
data/lib/octo/server/channel/adapters/base.rb +82 -0
data/lib/octo/server/channel/adapters/dingtalk/adapter.rb +314 -0
data/lib/octo/server/channel/adapters/dingtalk/api_client.rb +391 -0
data/lib/octo/server/channel/adapters/dingtalk/stream_client.rb +203 -0
data/lib/octo/server/channel/adapters/discord/adapter.rb +229 -0
data/lib/octo/server/channel/adapters/discord/api_client.rb +107 -0
data/lib/octo/server/channel/adapters/discord/gateway_client.rb +270 -0
data/lib/octo/server/channel/adapters/feishu/adapter.rb +320 -0
data/lib/octo/server/channel/adapters/feishu/bot.rb +478 -0
data/lib/octo/server/channel/adapters/feishu/file_processor.rb +36 -0
data/lib/octo/server/channel/adapters/feishu/message_parser.rb +129 -0
data/lib/octo/server/channel/adapters/feishu/ws_client.rb +423 -0
data/lib/octo/server/channel/adapters/telegram/adapter.rb +375 -0
data/lib/octo/server/channel/adapters/telegram/api_client.rb +205 -0
data/lib/octo/server/channel/adapters/wecom/adapter.rb +148 -0
data/lib/octo/server/channel/adapters/wecom/media_downloader.rb +115 -0
data/lib/octo/server/channel/adapters/wecom/ws_client.rb +395 -0
data/lib/octo/server/channel/adapters/weixin/adapter.rb +692 -0
data/lib/octo/server/channel/adapters/weixin/api_client.rb +402 -0
data/lib/octo/server/channel/channel_config.rb +178 -0
data/lib/octo/server/channel/channel_manager.rb +468 -0
data/lib/octo/server/channel/channel_ui_controller.rb +224 -0
data/lib/octo/server/channel.rb +33 -0
data/lib/octo/server/discover.rb +77 -0
data/lib/octo/server/epipe_safe_io.rb +105 -0
data/lib/octo/server/http_server.rb +3554 -0
data/lib/octo/server/scheduler.rb +317 -0
data/lib/octo/server/server_master.rb +325 -0
data/lib/octo/server/session_registry.rb +431 -0
data/lib/octo/server/web_ui_controller.rb +487 -0
data/lib/octo/session_manager.rb +385 -0
data/lib/octo/skill.rb +466 -0
data/lib/octo/skill_loader.rb +328 -0
data/lib/octo/tools/base.rb +118 -0
data/lib/octo/tools/browser.rb +625 -0
data/lib/octo/tools/edit.rb +165 -0
data/lib/octo/tools/file_reader.rb +549 -0
data/lib/octo/tools/glob.rb +162 -0
data/lib/octo/tools/grep.rb +356 -0
data/lib/octo/tools/invoke_skill.rb +96 -0
data/lib/octo/tools/list_tasks.rb +54 -0
data/lib/octo/tools/redo_task.rb +41 -0
data/lib/octo/tools/request_user_feedback.rb +84 -0
data/lib/octo/tools/security.rb +333 -0
data/lib/octo/tools/terminal/output_cleaner.rb +63 -0
data/lib/octo/tools/terminal/persistent_session.rb +268 -0
data/lib/octo/tools/terminal/safe_rm.sh +106 -0
data/lib/octo/tools/terminal/session_manager.rb +213 -0
data/lib/octo/tools/terminal.rb +1828 -0
data/lib/octo/tools/todo_manager.rb +374 -0
data/lib/octo/tools/trash_manager.rb +388 -0
data/lib/octo/tools/undo_task.rb +35 -0
data/lib/octo/tools/web_fetch.rb +242 -0
data/lib/octo/tools/web_search.rb +260 -0
data/lib/octo/tools/write.rb +77 -0
data/lib/octo/ui2/block_font.rb +10 -0
data/lib/octo/ui2/components/base_component.rb +163 -0
data/lib/octo/ui2/components/command_suggestions.rb +290 -0
data/lib/octo/ui2/components/common_component.rb +96 -0
data/lib/octo/ui2/components/inline_input.rb +226 -0
data/lib/octo/ui2/components/input_area.rb +1338 -0
data/lib/octo/ui2/components/message_component.rb +99 -0
data/lib/octo/ui2/components/modal_component.rb +419 -0
data/lib/octo/ui2/components/todo_area.rb +149 -0
data/lib/octo/ui2/components/tool_component.rb +107 -0
data/lib/octo/ui2/components/welcome_banner.rb +139 -0
data/lib/octo/ui2/layout_manager.rb +807 -0
data/lib/octo/ui2/line_editor.rb +363 -0
data/lib/octo/ui2/markdown_renderer.rb +100 -0
data/lib/octo/ui2/output_buffer.rb +370 -0
data/lib/octo/ui2/progress_handle.rb +362 -0
data/lib/octo/ui2/progress_indicator.rb +55 -0
data/lib/octo/ui2/screen_buffer.rb +273 -0
data/lib/octo/ui2/terminal_detector.rb +119 -0
data/lib/octo/ui2/theme_manager.rb +85 -0
data/lib/octo/ui2/themes/base_theme.rb +105 -0
data/lib/octo/ui2/themes/hacker_theme.rb +62 -0
data/lib/octo/ui2/themes/minimal_theme.rb +56 -0
data/lib/octo/ui2/thinking_verbs.rb +26 -0
data/lib/octo/ui2/ui_controller.rb +1625 -0
data/lib/octo/ui2/view_renderer.rb +177 -0
data/lib/octo/ui2.rb +40 -0
data/lib/octo/ui_interface.rb +154 -0
data/lib/octo/utils/arguments_parser.rb +191 -0
data/lib/octo/utils/browser_detector.rb +195 -0
data/lib/octo/utils/encoding.rb +92 -0
data/lib/octo/utils/environment_detector.rb +140 -0
data/lib/octo/utils/file_ignore_helper.rb +170 -0
data/lib/octo/utils/file_processor.rb +601 -0
data/lib/octo/utils/gitignore_parser.rb +154 -0
data/lib/octo/utils/limit_stack.rb +152 -0
data/lib/octo/utils/logger.rb +124 -0
data/lib/octo/utils/login_shell.rb +72 -0
data/lib/octo/utils/model_pricing.rb +646 -0
data/lib/octo/utils/parser_manager.rb +165 -0
data/lib/octo/utils/path_helper.rb +15 -0
data/lib/octo/utils/scripts_manager.rb +59 -0
data/lib/octo/utils/string_matcher.rb +158 -0
data/lib/octo/utils/trash_directory.rb +112 -0
data/lib/octo/utils/workspace_rules.rb +46 -0
data/lib/octo/version.rb +5 -0
data/lib/octo/web/app.css +7141 -0
data/lib/octo/web/app.js +543 -0
data/lib/octo/web/apple-touch-icon.png +0 -0
data/lib/octo/web/auth.js +150 -0
data/lib/octo/web/channels.js +276 -0
data/lib/octo/web/datepicker.js +205 -0
data/lib/octo/web/favicon.png +0 -0
data/lib/octo/web/i18n.js +1073 -0
data/lib/octo/web/icon-512.png +0 -0
data/lib/octo/web/icon-dark.svg +25 -0
data/lib/octo/web/icon.svg +29 -0
data/lib/octo/web/index.html +871 -0
data/lib/octo/web/marked.min.js +69 -0
data/lib/octo/web/onboard.js +491 -0
data/lib/octo/web/profile.js +442 -0
data/lib/octo/web/sessions.js +4421 -0
data/lib/octo/web/settings.js +913 -0
data/lib/octo/web/sidebar.js +32 -0
data/lib/octo/web/skills.js +885 -0
data/lib/octo/web/tasks.js +297 -0
data/lib/octo/web/theme.js +105 -0
data/lib/octo/web/trash.js +343 -0
data/lib/octo/web/vendor/hljs/highlight.min.js +1244 -0
data/lib/octo/web/vendor/hljs/hljs-theme.css +95 -0
data/lib/octo/web/vendor/katex/auto-render.min.js +1 -0
data/lib/octo/web/vendor/katex/fonts/KaTeX_AMS-Regular.woff2 +0 -0
data/lib/octo/web/vendor/katex/fonts/KaTeX_Caligraphic-Bold.woff2 +0 -0
data/lib/octo/web/vendor/katex/fonts/KaTeX_Caligraphic-Regular.woff2 +0 -0
data/lib/octo/web/vendor/katex/fonts/KaTeX_Fraktur-Bold.woff2 +0 -0
data/lib/octo/web/vendor/katex/fonts/KaTeX_Fraktur-Regular.woff2 +0 -0
data/lib/octo/web/vendor/katex/fonts/KaTeX_Main-Bold.woff2 +0 -0
data/lib/octo/web/vendor/katex/fonts/KaTeX_Main-BoldItalic.woff2 +0 -0
data/lib/octo/web/vendor/katex/fonts/KaTeX_Main-Italic.woff2 +0 -0
data/lib/octo/web/vendor/katex/fonts/KaTeX_Main-Regular.woff2 +0 -0
data/lib/octo/web/vendor/katex/fonts/KaTeX_Math-BoldItalic.woff2 +0 -0
data/lib/octo/web/vendor/katex/fonts/KaTeX_Math-Italic.woff2 +0 -0
data/lib/octo/web/vendor/katex/fonts/KaTeX_SansSerif-Bold.woff2 +0 -0
data/lib/octo/web/vendor/katex/fonts/KaTeX_SansSerif-Italic.woff2 +0 -0
data/lib/octo/web/vendor/katex/fonts/KaTeX_SansSerif-Regular.woff2 +0 -0
data/lib/octo/web/vendor/katex/fonts/KaTeX_Script-Regular.woff2 +0 -0
data/lib/octo/web/vendor/katex/fonts/KaTeX_Size1-Regular.woff2 +0 -0
data/lib/octo/web/vendor/katex/fonts/KaTeX_Size2-Regular.woff2 +0 -0
data/lib/octo/web/vendor/katex/fonts/KaTeX_Size3-Regular.woff2 +0 -0
data/lib/octo/web/vendor/katex/fonts/KaTeX_Size4-Regular.woff2 +0 -0
data/lib/octo/web/vendor/katex/fonts/KaTeX_Typewriter-Regular.woff2 +0 -0
data/lib/octo/web/vendor/katex/katex.min.css +1 -0
data/lib/octo/web/vendor/katex/katex.min.js +1 -0
data/lib/octo/web/version.js +449 -0
data/lib/octo/web/weixin-qr.html +209 -0
data/lib/octo/web/ws-dispatcher.js +357 -0
data/lib/octo/web/ws.js +128 -0
data/lib/octo.rb +145 -0
data/scripts/build/build.sh +329 -0
data/scripts/build/lib/apt.sh +56 -0
data/scripts/build/lib/brew.sh +89 -0
data/scripts/build/lib/colors.sh +17 -0
data/scripts/build/lib/gem.sh +95 -0
data/scripts/build/lib/mise.sh +125 -0
data/scripts/build/lib/network.sh +157 -0
data/scripts/build/lib/os.sh +57 -0
data/scripts/build/lib/shell.sh +37 -0
data/scripts/build/src/install.sh.cc +174 -0
data/scripts/build/src/install_browser.sh.cc +101 -0
data/scripts/build/src/install_full.sh.cc +290 -0
data/scripts/build/src/install_rails_deps.sh.cc +145 -0
data/scripts/build/src/install_system_deps.sh.cc +123 -0
data/scripts/build/src/uninstall.sh.cc +101 -0
data/scripts/install.ps1 +532 -0
data/scripts/install.sh +567 -0
data/scripts/install_browser.sh +479 -0
data/scripts/install_full.sh +838 -0
data/scripts/install_rails_deps.sh +746 -0
data/scripts/install_system_deps.sh +518 -0
data/scripts/uninstall.sh +287 -0
data/sig/octo.rbs +4 -0
metadata +614 -0

data/lib/octo/default_skills/skill-creator/SKILL.md ADDED Viewed

@@ -0,0 +1,602 @@
+---
+name: skill-creator
+description: Create new skills, modify and improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, edit, or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy.
+---
+# Skill Creator
+A skill for creating new skills and iteratively improving them.
+## Usage Modes
+This skill supports two modes:
+### 1. Interactive Mode (default)
+The full workflow with user interviews, test cases, and iteration cycles.
+Use when creating or refining skills manually.
+At a high level, the process of creating a skill goes like this:
+- Decide what you want the skill to do and roughly how it should do it
+- Write a draft of the skill
+- Create a few test prompts and simulate running them (with vs. without the skill instructions)
+- Help the user evaluate the results both qualitatively and quantitatively
+  - While reviewing, draft quantitative assertions if there aren't any
+  - Use `eval-viewer/generate_review.py` to generate a static HTML viewer for the user to review results and leave feedback
+- Rewrite the skill based on the user's feedback
+- Repeat until satisfied
+Your job is to figure out where the user is in this process and jump in to help them progress through these stages. Maybe they say "I want to make a skill for X" — help narrow down the intent, write a draft, write test cases, evaluate, and repeat. Or maybe they already have a draft — go straight to the eval/iterate part.
+Always be flexible. If the user says "skip the evals, just vibe with me", do that instead.
+### 2. Quick Mode (for agent self-evolution)
+**Trigger**: When invoked with `mode: "quick"` in the task arguments.
+Fast, opinionated skill creation without user interaction. This mode is used by the agent's self-evolution system to automatically create or improve skills.
+**Behavior**:
+- Skip user interviews and detailed requirements gathering
+- Extract workflow pattern from provided context
+- Write a minimal but functional SKILL.md
+- Save to `~/.octo/skills/auto-<name>-<timestamp>/` (or improve existing skill in place)
+- Skip test cases and evals (user can refine later if needed)
+- Always validate frontmatter with the validator script after creation
+- Focus on the happy path; edge cases can be added later
+**Expected arguments when using quick mode**:
+- `task`: Clear description of what to automate and how (be specific about workflow steps)
+- `mode`: Must be set to `"quick"`
+- `suggested_name`: (optional) Proposed skill identifier (lowercase, hyphens OK)
+**Quick mode principles**:
+- **Be opinionated**: Make reasonable assumptions without asking
+- **Be concise**: Keep instructions simple and focused
+- **Be practical**: Focus on the core workflow that will save the most time
+- **Be correct**: Always set `disable-model-invocation: false` and `user-invocable: true`
+- **Be validating**: Run the frontmatter validator immediately after creation
+**Example invocation from the agent's self-evolution system**:
+```
+invoke_skill(
+  skill_name: "skill-creator",
+  task: "Create a skill to extract and summarize content from URLs. The skill should: 1) fetch the URL using terminal with curl, 2) parse the HTML to extract main text content, 3) generate a concise markdown summary. Expected input: URL string. Expected output: markdown summary with title and key points.",
+  mode: "quick",
+  suggested_name: "url-summarizer"
+)
+```
+---
+## Platform Context: Octo
+This skill runs inside **Octo** (octo). Key platform specifics:
+- **Skills** live at `~/.octo/skills/<skill-name>/` — **always create new skills here** (global user skills, visible to Web UI and all sessions). To locate an existing skill, check these paths in order using `glob` or `ls`: (1) `.octo/skills/` — project-level skills, (2) `~/.octo/skills/` — user-level skills. Built-in skills (shipped with the gem) are always available via `invoke_skill` by name — no file lookup needed. Never use `find /` or broad filesystem searches to locate skills.
+- **No parallel subagents** — Octo runs as a single agent; all test cases execute serially in the current session
+- **No external agent CLI** — for evals, just execute the task directly in-session (read the skill, follow instructions, save outputs)
+- **Scripts** — prefer **Ruby** (`.rb` files); Octo is Ruby-native. Run with `ruby path/to/script.rb`. Python is available but Ruby is the default choice
+- **`python3`** — if Python scripts are needed (e.g., `generate_review.py`), use `python3` explicitly
+- The description optimization scripts (`run_loop.py`, `run_eval.py`) work in Octo — they use `octo agent --json` to detect `invoke_skill` events. See the Description Optimization section for usage
+---
+## Communicating with the user
+Pay attention to context cues to understand how technical the user is. In general:
+- "evaluation" and "benchmark" are fine
+- For "JSON" and "assertion" — explain briefly if you're unsure the user knows these terms
+It's always OK to briefly explain a term if you're in doubt.
+---
+## Creating a skill
+### Capture Intent
+Start by understanding what the user wants. If the current conversation already shows a workflow they want to capture (tools used, sequence of steps, corrections made, input/output formats), extract answers from history first — the user may just need to fill gaps and confirm.
+1. What should this skill enable Octo to do?
+2. When should this skill trigger? (what phrases/contexts)
+3. What's the expected output format?
+4. Should we set up test cases? Skills with objectively verifiable outputs (file transforms, data extraction, code generation) benefit from test cases. Skills with subjective outputs (writing style, creative work) often don't need them.
+### Interview and Research
+Ask about edge cases, input/output formats, example files, success criteria, and dependencies before writing test prompts. Come prepared with context to reduce burden on the user.
+### Write the SKILL.md
+Components to fill in:
+- **name**: Skill identifier (lowercase, hyphens OK)
+- **description**: Primary triggering mechanism — include BOTH what the skill does AND specific contexts for when to use it. All "when to use" info goes here, not in the body. Make the description a little "pushy" — err toward over-triggering rather than under-triggering. Example: instead of "Helps with dashboard creation", write "Helps with dashboard creation. Use this skill whenever the user mentions dashboards, data visualization, or wants to display any kind of data, even if they don't explicitly say 'dashboard'."
+- **disable-model-invocation**: Set to `false` (always include this)
+- **user-invocable**: Set to `true` to make the skill appear in the WebUI chatbox `/` command list. **Always include this** — without it, users cannot manually invoke the skill from the Octo Web UI session chat.
+- **compatibility** (optional): Required tools or dependencies
+- **Body**: The actual instructions
+> **Octo-specific**: Every skill MUST include `disable-model-invocation: false` and `user-invocable: true` in the YAML frontmatter, or it will be invisible in the WebUI `/` command list. The minimal valid frontmatter is:
+> ```yaml
+> ---
+> name: my-skill
+> description: 'Your description here. Avoid colons followed by a space (like "wants to: do X") inside the description — they break YAML parsing and the skill will silently fail to load. Wrap the entire description in single quotes to be safe, or rephrase to avoid the colon pattern.'
+> disable-model-invocation: false
+> user-invocable: true
+> ---
+> ```
+>
+> **YAML description gotcha**: If the description contains `word: value` patterns (colons followed by space), YAML treats them as key-value pairs and the frontmatter parse fails silently. Always wrap description values in single quotes. Avoid embedded double-quotes inside single-quoted strings (use rephrasing instead).
+> **After writing SKILL.md — always validate and auto-fix**: Run this immediately after creating or updating any skill file:
+> ```bash
+> ruby SKILL_DIR/scripts/validate_skill_frontmatter.rb /path/to/new-skill/SKILL.md
+> ```
+> The script validates the YAML frontmatter and auto-fixes common issues (unquoted descriptions, multi-line block scalars with colons). If it prints `OK:` — you're done. If it prints `Auto-fixed and saved` — it repaired the file automatically. If it prints `ERROR` — manual fix required.
+### Skill Writing Guide
+#### Anatomy of a Skill
+Skills are created at `~/.octo/skills/<skill-name>/`:
+```
+~/.octo/skills/skill-name/
+├── SKILL.md (required)
+│   ├── YAML frontmatter (name, description required)
+│   └── Markdown instructions
+└── Bundled Resources (optional)
+    ├── scripts/    - Executable code (prefer .rb Ruby scripts)
+    ├── references/ - Docs loaded into context as needed
+    └── assets/     - Files used in output (templates, icons, fonts)
+```
+#### Progressive Disclosure
+Skills use a three-level loading system:
+1. **Metadata** (name + description) — Always in context (~100 words)
+2. **SKILL.md body** — In context whenever skill triggers (<500 lines ideal)
+3. **Bundled resources** — Loaded as needed (unlimited)
+**Key patterns:**
+- Keep SKILL.md under 500 lines; if approaching the limit, extract content into `references/` files and add clear pointers
+- Reference files from SKILL.md with guidance on when to read them
+- For large reference files (>300 lines), include a table of contents
+**Domain organization** — When a skill supports multiple frameworks/domains, organize by variant:
+```
+my-skill/
+├── SKILL.md (workflow + which reference to load)
+└── references/
+    ├── rails.md
+    ├── django.md
+    └── express.md
+```
+#### Bundled Scripts (Ruby preferred)
+When a skill needs to execute code — API calls, file processing, data transforms — bundle a Ruby script instead of writing inline shell commands. This is cleaner, reusable, and more maintainable.
+**Ruby script template:**
+```ruby
+#!/usr/bin/env ruby
+# skill-name/scripts/do_something.rb
+# Usage: ruby path/to/do_something.rb [args]
+require 'net/http'
+require 'json'
+require 'fileutils'
+# Read args
+input = ARGV[0]
+if input.nil? || input.strip.empty?
+  warn "Usage: ruby do_something.rb <input>"
+  exit 1
+end
+# ... logic ...
+puts result  # stdout is the output
+```
+Invoke from SKILL.md by referencing the script via the Supporting Files block — at runtime, the AI receives the full absolute path of every supporting file. Refer to it as `SKILL_DIR` in instructions so the AI substitutes the correct path from the Supporting Files list:
+```bash
+ruby "SKILL_DIR/scripts/do_something.rb" "argument"
+```
+Never hardcode paths like `~/.octo/skills/my-skill/scripts/...` — they break when the skill is installed at a different location. Never use `find` to locate scripts — the Supporting Files block always provides the correct absolute paths.
+Ruby standard library covers most needs (`net/http`, `json`, `fileutils`, `uri`, `time`). No gems needed for basic API calls.
+#### Principle of Least Surprise
+Skills must not contain malware, exploit code, or anything that could compromise security. A skill's contents should not surprise the user if described. Don't create misleading skills or skills designed for unauthorized access or data exfiltration.
+#### Writing Patterns
+Use the imperative form in instructions.
+**Defining output formats:**
+```markdown
+## Report structure
+Use this exact template:
+# [Title]
+## Executive summary
+## Key findings
+## Recommendations
+```
+**Examples pattern:**
+```markdown
+## Commit message format
+**Example 1:**
+Input: Added user authentication with JWT tokens
+Output: feat(auth): implement JWT-based authentication
+```
+### Writing Style
+Explain *why* things are important rather than just issuing commands. Use theory of mind — make the skill general, not over-fitted to specific examples. Write a draft, then look at it with fresh eyes and improve it. If you find yourself writing ALWAYS or NEVER in all caps, that's a yellow flag — try to reframe as an explanation of why, so the agent understands the reasoning rather than just following a rule.
+### Test Cases
+After writing the skill draft, come up with 2–3 realistic test prompts — the kind of thing a real user would actually say. Share them with the user for review, then run them.
+Save test cases to `evals/evals.json`:
+```json
+{
+  "skill_name": "example-skill",
+  "evals": [
+    {
+      "id": 1,
+      "prompt": "User's task prompt",
+      "expected_output": "Description of expected result",
+      "files": []
+    }
+  ]
+}
+```
+Don't write assertions yet — just the prompts. Add assertions in the next step.
+See `references/schemas.md` for the full schema.
+---
+## Running and Evaluating Test Cases
+This is one continuous sequence — don't stop partway through.
+Since Octo has no subagents, run test cases **serially** in the current session. For each test case, simulate two runs:
+- **with_skill**: Read the SKILL.md, then follow its instructions to complete the task
+- **without_skill**: Complete the same task using only general knowledge (no skill instructions)
+Put results in `<skill-name>-workspace/` as a sibling to the skill directory. Organize by iteration (`iteration-1/`, `iteration-2/`, etc.), and within that by test case (use descriptive names like `eval-create-report`, not `eval-0`).
+### Step 1: For each test case, create the eval directory and run both variants
+```
+<skill-name>-workspace/
+└── iteration-1/
+    ├── eval-<descriptive-name>/
+    │   ├── eval_metadata.json
+    │   ├── with_skill/
+    │   │   ├── outputs/        ← files produced
+    │   │   └── grading.json    ← filled in later
+    │   └── without_skill/
+    │       ├── outputs/
+    │       └── grading.json
+    └── benchmark.json          ← filled in after all evals
+```
+Write `eval_metadata.json` for each test case:
+```json
+{
+  "eval_id": 1,
+  "eval_name": "descriptive-name",
+  "prompt": "The task prompt",
+  "assertions": []
+}
+```
+**Running a with_skill eval**: Read the skill's SKILL.md fully, then execute the task as instructed by the skill — create files, run scripts, write outputs to `with_skill/outputs/`.
+**Running a without_skill eval**: Execute the same task using only general knowledge. Write outputs to `without_skill/outputs/`. This is the baseline.
+### Step 2: Draft assertions while running
+Don't wait until all runs finish — draft quantitative assertions as you go and explain them to the user.
+Good assertions are **objectively verifiable** and **descriptively named** — someone glancing at the benchmark should immediately understand what each one checks. Subjective skills are better evaluated qualitatively; don't force assertions onto things that need human judgment.
+Update `eval_metadata.json` with assertions once drafted. Also update `evals/evals.json`.
+### Step 3: Grade each run
+For each run, evaluate assertions against the outputs. Save results to `grading.json` in each run directory.
+The `grading.json` format (exact field names matter for the viewer):
+```json
+{
+  "eval_id": 1,
+  "configuration": "with_skill",
+  "expectations": [
+    {
+      "text": "The script uses absolute paths",
+      "passed": true,
+      "evidence": "Script uses $HOME/... throughout"
+    }
+  ],
+  "pass_count": 1,
+  "total_count": 1,
+  "pass_rate": 1.0
+}
+```
+For assertions that can be checked programmatically, write and run a Ruby script — it's faster and more reliable than eyeballing:
+```ruby
+#!/usr/bin/env ruby
+# Check assertion: output file contains expected content
+output = File.read("with_skill/outputs/result.md")
+puts output.include?("expected phrase") ? "PASS" : "FAIL"
+```
+### Step 4: Aggregate into benchmark
+Create `benchmark.json` in the iteration directory. List `with_skill` before `without_skill` for each eval:
+```json
+{
+  "skill_name": "my-skill",
+  "iteration": 1,
+  "configurations": [
+    {
+      "name": "with_skill",
+      "label": "With skill",
+      "evals": [
+        {"eval_id": 1, "eval_name": "eval-name", "pass_rate": 1.0, "pass_count": 3, "total_count": 3}
+      ],
+      "overall_pass_rate": 1.0,
+      "total_pass": 3,
+      "total_assertions": 3
+    },
+    {
+      "name": "without_skill",
+      "label": "Without skill (baseline)",
+      "evals": [
+        {"eval_id": 1, "eval_name": "eval-name", "pass_rate": 0.33, "pass_count": 1, "total_count": 3}
+      ],
+      "overall_pass_rate": 0.33,
+      "total_pass": 1,
+      "total_assertions": 3
+    }
+  ],
+  "delta": {
+    "pass_rate_improvement": 0.67,
+    "summary": "With skill: 100% | Without skill: 33% | Delta: +67pp"
+  },
+  "analyst_observations": [
+    "..."
+  ]
+}
+```
+Or run the aggregation script (from the skill-creator directory):
+```bash
+python3 -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>
+```
+### Step 5: Do an analyst pass
+Read the benchmark data and surface patterns the aggregate stats might hide. See `agents/analyzer.md` for what to look for — things like assertions that always pass regardless of skill (non-discriminating), high-variance evals, and time/effort tradeoffs.
+### Step 6: Generate the eval viewer — ALWAYS DO THIS BEFORE REVISING THE SKILL
+**Generate the viewer first. Get the outputs in front of the user before making any changes.**
+```bash
+python3 <skill-creator-path>/eval-viewer/generate_review.py \
+  <workspace>/iteration-N \
+  --skill-name "my-skill" \
+  --benchmark <workspace>/iteration-N/benchmark.json \
+  --static /tmp/<skill-name>-review.html
+open /tmp/<skill-name>-review.html
+```
+For iteration 2+, also pass `--previous-workspace <workspace>/iteration-<N-1>`.
+Tell the user: "I've opened the results in your browser. 'Outputs' tab lets you click through each test case and leave feedback; 'Benchmark' shows the quantitative comparison. When you're done, come back and let me know."
+### What the user sees in the viewer
+**Outputs tab**: One test case at a time.
+- Prompt, output files (rendered inline where possible)
+- Previous output (iteration 2+, collapsed)
+- Formal grades (collapsed)
+- Feedback textbox (auto-saves)
+- Previous feedback (iteration 2+)
+**Benchmark tab**: Pass rates, per-eval breakdowns, analyst observations.
+Navigation: prev/next buttons or arrow keys. "Submit All Reviews" saves to `feedback.json`.
+### Step 7: Read the feedback
+When the user says they're done, read `feedback.json`:
+```json
+{
+  "reviews": [
+    {"run_id": "eval-0-with_skill", "feedback": "missing axis labels on chart", "timestamp": "..."},
+    {"run_id": "eval-1-with_skill", "feedback": "", "timestamp": "..."}
+  ],
+  "status": "complete"
+}
+```
+Empty feedback = user was happy with that test case. Focus on cases with specific complaints.
+---
+## Improving the Skill
+This is the heart of the loop. You've run tests, the user reviewed results — now make the skill better.
+### How to think about improvements
+**Generalize from feedback.** You're iterating on a few examples, but the skill will be used across thousands of different prompts. Avoid overfitting to specific examples. If there's a stubborn issue, try different metaphors or different approaches rather than adding more rigid rules.
+**Keep it lean.** Remove things that aren't pulling their weight. Read the execution trace, not just the final output — if the skill is making the agent waste time on unproductive steps, cut those parts.
+**Explain the why.** Try hard to explain *why* each instruction matters. Agents are smart — they perform better when they understand the reasoning rather than following rules blindly. If you find yourself writing ALWAYS or NEVER in all caps, reframe it as an explanation.
+**Look for repeated work.** If every test case resulted in writing similar helper logic (e.g., an API call setup, a file parser), that's a signal to bundle a reusable Ruby script into `scripts/` and tell the skill to use it.
+### The iteration loop
+1. Apply improvements to the skill
+2. Re-run all test cases into a new `iteration-<N+1>/` directory (with_skill and without_skill)
+3. Generate the viewer with `--previous-workspace` pointing at the previous iteration
+4. Wait for the user to review and tell you they're done
+5. Read the new feedback, improve again, repeat
+Keep going until:
+- The user says they're happy
+- Feedback is all empty
+- You're not making meaningful progress
+---
+## Advanced: Blind Comparison
+For more rigorous comparison, read `agents/comparator.md` and `agents/analyzer.md`. Optional — the human review loop is usually sufficient.
+---
+## Description Optimization
+The `description` field in SKILL.md frontmatter is the primary triggering mechanism. After creating or improving a skill, offer to optimize it.
+> **Octo note**: `run_eval.py` and `run_loop.py` have been adapted for Octo. They use `octo agent --json` (NDJSON streaming) to detect `invoke_skill` tool calls targeting temp skills in `~/.octo/skills/`. Queries run **serially** (single agent). `improve_description.py` calls the LLM directly via OpenRouter using `~/.octo/config.yml` credentials.
+### Manual description optimization
+**Step 1: Generate trigger eval queries**
+Create 20 eval queries — a mix of should-trigger and should-not-trigger. Save as JSON:
+```json
+[
+  {"query": "the user prompt", "should_trigger": true},
+  {"query": "another prompt", "should_trigger": false}
+]
+```
+Queries must be realistic — concrete, specific, with enough context that a real user would actually say them. Include file paths, personal context, column names, backstory. Use a mix of lengths and styles (casual, formal, typos, abbreviations). Focus on edge cases.
+Bad: `"Format this data"`, `"Extract text from PDF"`, `"Create a chart"`
+Good: `"ok so my boss just sent me this xlsx file (its in downloads, called Q4 sales final FINAL v2.xlsx) and she wants me to add a column showing profit margin. Revenue is column C, costs in column D i think"`
+**Should-trigger queries (8–10):** Different phrasings of the same intent — some formal, some casual. Include cases where the user doesn't explicitly name the skill but clearly needs it. Uncommon use cases, and cases where this skill competes with another but should win.
+**Should-not-trigger queries (8–10):** Near-misses — queries that share keywords but actually need something different. The negative cases should be genuinely tricky, not obviously irrelevant ("write a fibonacci function" as a negative for a PDF skill is too easy).
+**Step 2: Review with user**
+Use the HTML template in `assets/eval_review.html`:
+1. Read the template
+2. Replace `__EVAL_DATA_PLACEHOLDER__` with the JSON array, `__SKILL_NAME_PLACEHOLDER__` with the skill name, `__SKILL_DESCRIPTION_PLACEHOLDER__` with the current description
+3. Write to `/tmp/eval_review_<skill-name>.html` and `open` it
+4. User edits queries, toggles should-trigger, clicks "Export Eval Set"
+5. File downloads to `~/Downloads/eval_set.json`
+**Step 3: Run automated optimization (recommended)**
+Use the scripts from the skill-creator `scripts/` directory. Run from the skill-creator root:
+```bash
+# Single eval run — check current description pass rate
+python3 -m scripts.run_eval \
+  --eval-set ~/Downloads/eval_set.json \
+  --skill-path ~/.octo/skills/my-skill \
+  --verbose
+# Full optimize loop — auto-improves description over N iterations
+python3 -m scripts.run_loop \
+  --eval-set ~/Downloads/eval_set.json \
+  --skill-path ~/.octo/skills/my-skill \
+  --max-iterations 5 \
+  --runs-per-query 1 \
+  --verbose
+  # Outputs: best description + HTML report (auto-opens in browser)
+```
+Notes:
+- **No `--num-workers`** needed (or it's ignored) — Octo runs queries serially
+- **No `--model`** needed — uses the model from `~/.octo/config.yml` automatically
+- Temp skills are written to `~/.octo/skills/` and cleaned up after each query
+- Each query spawns a fresh `octo agent --json` process to avoid session contamination
+**Step 3 (manual fallback)**
+If scripts fail, manually iterate: for each query in the eval set, judge whether the description would trigger. Tally passes/fails. Write improved description targeting failures. Repeat 2–3 times.
+Focus on:
+- Failing should-trigger queries → description is too narrow; broaden the trigger language
+- Failing should-not-trigger queries → description is too broad; tighten specificity
+**Step 4: Apply the result**
+Update the skill's SKILL.md frontmatter with the improved description. Show the user before/after.
+### How skill triggering works
+Skills appear in Octo's `available_skills` list. The agent consults a skill based on the description match — but only for tasks it can't handle alone. Simple, one-step queries often won't trigger even with a good description. Make eval queries substantive enough that the skill genuinely helps.
+---
+## Packaging
+New skills are created directly in `~/.octo/skills/<skill-name>/` — no packaging step needed. The skill is immediately available in all sessions and the Web UI.
+If distributing externally, you can package it:
+```bash
+python3 -m scripts.package_skill <path/to/skill-folder>
+```
+This creates a `.skill` file. Direct the user to the resulting file path.
+---
+## Reference files
+- `agents/grader.md` — How to evaluate assertions against outputs
+- `agents/comparator.md` — How to do blind A/B comparison between two outputs
+- `agents/analyzer.md` — How to analyze why one version beat another
+- `references/schemas.md` — JSON structures for evals.json, grading.json, benchmark.json
+---
+## The core loop (summary)
+1. Understand what the skill should do
+2. Draft or edit the SKILL.md
+3. Run test prompts — with and without the skill — and save outputs
+4. **Generate the eval viewer with `generate_review.py`** so the user can review
+5. Grade assertions, aggregate benchmark
+6. Get user feedback, improve the skill
+7. Repeat until satisfied
+8. Package and deliver
+Add these steps to your todo list. Specifically: **always generate the eval viewer before revising the skill** — the user's feedback is the primary signal, not your own judgment of the outputs.