octo-agent 0.11.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (319) hide show
  1. checksums.yaml +7 -0
  2. data/.clacky/skills/commit/SKILL.md +423 -0
  3. data/.clacky/skills/gem-release/SKILL.md +199 -0
  4. data/.clacky/skills/gem-release/scripts/release.sh +304 -0
  5. data/.clacky/skills/oss-upload/SKILL.md +47 -0
  6. data/.octorules +106 -0
  7. data/.rspec +3 -0
  8. data/.rubocop.yml +8 -0
  9. data/CHANGELOG.md +76 -0
  10. data/CODE_OF_CONDUCT.md +132 -0
  11. data/CONTRIBUTING.md +92 -0
  12. data/Dockerfile +28 -0
  13. data/LICENSE.txt +22 -0
  14. data/POSITIONING.md +46 -0
  15. data/README.md +134 -0
  16. data/README_CN.md +134 -0
  17. data/Rakefile +34 -0
  18. data/benchmark/fixtures/sample_project/Gemfile +3 -0
  19. data/benchmark/fixtures/sample_project/lib/api_handler.rb +32 -0
  20. data/benchmark/fixtures/sample_project/lib/order_calculator.rb +23 -0
  21. data/benchmark/fixtures/sample_project/lib/user_renderer.rb +20 -0
  22. data/benchmark/fixtures/sample_project/spec/order_calculator_spec.rb +20 -0
  23. data/benchmark/results/EVALUATION_REPORT.md +165 -0
  24. data/benchmark/results/baseline_20260511_174424.json +128 -0
  25. data/benchmark/results/report_20260511_175256.json +271 -0
  26. data/benchmark/results/report_20260511_175444.json +271 -0
  27. data/benchmark/results/treatment_20260511_175103.json +130 -0
  28. data/benchmark/runner.rb +441 -0
  29. data/bin/octo +7 -0
  30. data/docs/agent-first-ui-design.md +77 -0
  31. data/docs/billing-system.md +318 -0
  32. data/docs/channel-architecture.md +235 -0
  33. data/docs/engineering-article.md +343 -0
  34. data/docs/session-skill-invocation.md +69 -0
  35. data/docs/time_machine_design.md +247 -0
  36. data/docs/ui2-architecture.md +124 -0
  37. data/homebrew/README.md +96 -0
  38. data/homebrew/openocto.rb +24 -0
  39. data/lib/octo/agent/hook_manager.rb +61 -0
  40. data/lib/octo/agent/llm_caller.rb +800 -0
  41. data/lib/octo/agent/memory_updater.rb +246 -0
  42. data/lib/octo/agent/message_compressor.rb +225 -0
  43. data/lib/octo/agent/message_compressor_helper.rb +869 -0
  44. data/lib/octo/agent/next_message_suggester.rb +215 -0
  45. data/lib/octo/agent/session_serializer.rb +685 -0
  46. data/lib/octo/agent/skill_auto_creator.rb +114 -0
  47. data/lib/octo/agent/skill_evolution.rb +61 -0
  48. data/lib/octo/agent/skill_manager.rb +466 -0
  49. data/lib/octo/agent/skill_reflector.rb +89 -0
  50. data/lib/octo/agent/system_prompt_builder.rb +101 -0
  51. data/lib/octo/agent/time_machine.rb +214 -0
  52. data/lib/octo/agent/tool_executor.rb +454 -0
  53. data/lib/octo/agent/tool_registry.rb +150 -0
  54. data/lib/octo/agent.rb +2180 -0
  55. data/lib/octo/agent_config.rb +989 -0
  56. data/lib/octo/agent_profile.rb +112 -0
  57. data/lib/octo/anthropic_stream_aggregator.rb +137 -0
  58. data/lib/octo/background_task_registry.rb +324 -0
  59. data/lib/octo/banner.rb +34 -0
  60. data/lib/octo/bedrock_stream_aggregator.rb +137 -0
  61. data/lib/octo/block_font.rb +331 -0
  62. data/lib/octo/cli.rb +968 -0
  63. data/lib/octo/client.rb +623 -0
  64. data/lib/octo/default_agents/SOUL.md +3 -0
  65. data/lib/octo/default_agents/USER.md +1 -0
  66. data/lib/octo/default_agents/base_prompt.md +66 -0
  67. data/lib/octo/default_agents/coding/profile.yml +2 -0
  68. data/lib/octo/default_agents/coding/system_prompt.md +67 -0
  69. data/lib/octo/default_agents/general/profile.yml +2 -0
  70. data/lib/octo/default_agents/general/system_prompt.md +16 -0
  71. data/lib/octo/default_parsers/doc_parser.rb +69 -0
  72. data/lib/octo/default_parsers/docx_parser.rb +188 -0
  73. data/lib/octo/default_parsers/pdf_parser.rb +120 -0
  74. data/lib/octo/default_parsers/pdf_parser_ocr.py +103 -0
  75. data/lib/octo/default_parsers/pdf_parser_plumber.py +62 -0
  76. data/lib/octo/default_parsers/pptx_parser.rb +140 -0
  77. data/lib/octo/default_parsers/xlsx_parser.rb +121 -0
  78. data/lib/octo/default_skills/browser-setup/SKILL.md +426 -0
  79. data/lib/octo/default_skills/channel-manager/SKILL.md +623 -0
  80. data/lib/octo/default_skills/channel-manager/dingtalk_setup.rb +191 -0
  81. data/lib/octo/default_skills/channel-manager/discord_setup.rb +199 -0
  82. data/lib/octo/default_skills/channel-manager/feishu_setup.rb +574 -0
  83. data/lib/octo/default_skills/channel-manager/import_lark_skills.rb +97 -0
  84. data/lib/octo/default_skills/channel-manager/install_feishu_skills.rb +105 -0
  85. data/lib/octo/default_skills/channel-manager/weixin_setup.rb +274 -0
  86. data/lib/octo/default_skills/code-explorer/SKILL.md +36 -0
  87. data/lib/octo/default_skills/cron-task-creator/SKILL.md +257 -0
  88. data/lib/octo/default_skills/cron-task-creator/evals/evals.json +38 -0
  89. data/lib/octo/default_skills/onboard/SKILL.md +578 -0
  90. data/lib/octo/default_skills/onboard/scripts/import_external_skills.rb +413 -0
  91. data/lib/octo/default_skills/onboard/scripts/install_builtin_skills.rb +97 -0
  92. data/lib/octo/default_skills/persist-memory/SKILL.md +59 -0
  93. data/lib/octo/default_skills/personal-website/SKILL.md +113 -0
  94. data/lib/octo/default_skills/personal-website/publish.rb +235 -0
  95. data/lib/octo/default_skills/product-help/SKILL.md +123 -0
  96. data/lib/octo/default_skills/product-help/docs/agent-config.md +74 -0
  97. data/lib/octo/default_skills/product-help/docs/best-practices.md +49 -0
  98. data/lib/octo/default_skills/product-help/docs/browser-tool.md +53 -0
  99. data/lib/octo/default_skills/product-help/docs/built-in-skills.md +43 -0
  100. data/lib/octo/default_skills/product-help/docs/cli-reference.md +82 -0
  101. data/lib/octo/default_skills/product-help/docs/create-your-first-skill.md +47 -0
  102. data/lib/octo/default_skills/product-help/docs/faq.md +98 -0
  103. data/lib/octo/default_skills/product-help/docs/how-to-use-a-skill.md +58 -0
  104. data/lib/octo/default_skills/product-help/docs/installation.md +59 -0
  105. data/lib/octo/default_skills/product-help/docs/memory-system.md +61 -0
  106. data/lib/octo/default_skills/product-help/docs/octorules.md +62 -0
  107. data/lib/octo/default_skills/product-help/docs/session-management.md +63 -0
  108. data/lib/octo/default_skills/product-help/docs/skill-basics.md +55 -0
  109. data/lib/octo/default_skills/product-help/docs/skill-frontmatter.md +61 -0
  110. data/lib/octo/default_skills/product-help/docs/web-server.md +49 -0
  111. data/lib/octo/default_skills/product-help/docs/what-is-octo.md +37 -0
  112. data/lib/octo/default_skills/product-help/docs/windows-installation.md +36 -0
  113. data/lib/octo/default_skills/product-help/docs/writing-tips.md +53 -0
  114. data/lib/octo/default_skills/recall-memory/SKILL.md +65 -0
  115. data/lib/octo/default_skills/skill-add/SKILL.md +59 -0
  116. data/lib/octo/default_skills/skill-add/scripts/install_from_zip.rb +295 -0
  117. data/lib/octo/default_skills/skill-creator/SKILL.md +602 -0
  118. data/lib/octo/default_skills/skill-creator/agents/analyzer.md +274 -0
  119. data/lib/octo/default_skills/skill-creator/agents/comparator.md +202 -0
  120. data/lib/octo/default_skills/skill-creator/agents/grader.md +223 -0
  121. data/lib/octo/default_skills/skill-creator/eval-viewer/generate_review.py +471 -0
  122. data/lib/octo/default_skills/skill-creator/eval-viewer/viewer.html +1325 -0
  123. data/lib/octo/default_skills/skill-creator/references/schemas.md +430 -0
  124. data/lib/octo/default_skills/skill-creator/scripts/__init__.py +0 -0
  125. data/lib/octo/default_skills/skill-creator/scripts/aggregate_benchmark.py +401 -0
  126. data/lib/octo/default_skills/skill-creator/scripts/generate_report.py +326 -0
  127. data/lib/octo/default_skills/skill-creator/scripts/improve_description.py +310 -0
  128. data/lib/octo/default_skills/skill-creator/scripts/quick_validate.py +103 -0
  129. data/lib/octo/default_skills/skill-creator/scripts/run_eval.py +317 -0
  130. data/lib/octo/default_skills/skill-creator/scripts/run_loop.py +331 -0
  131. data/lib/octo/default_skills/skill-creator/scripts/utils.py +47 -0
  132. data/lib/octo/default_skills/skill-creator/scripts/validate_skill_frontmatter.rb +143 -0
  133. data/lib/octo/idle_compression_timer.rb +115 -0
  134. data/lib/octo/json_ui_controller.rb +204 -0
  135. data/lib/octo/message_format/anthropic.rb +409 -0
  136. data/lib/octo/message_format/bedrock.rb +361 -0
  137. data/lib/octo/message_format/open_ai.rb +222 -0
  138. data/lib/octo/message_history.rb +373 -0
  139. data/lib/octo/openai_stream_aggregator.rb +130 -0
  140. data/lib/octo/plain_ui_controller.rb +166 -0
  141. data/lib/octo/providers.rb +534 -0
  142. data/lib/octo/server/browser_manager.rb +397 -0
  143. data/lib/octo/server/channel/adapters/base.rb +82 -0
  144. data/lib/octo/server/channel/adapters/dingtalk/adapter.rb +314 -0
  145. data/lib/octo/server/channel/adapters/dingtalk/api_client.rb +391 -0
  146. data/lib/octo/server/channel/adapters/dingtalk/stream_client.rb +203 -0
  147. data/lib/octo/server/channel/adapters/discord/adapter.rb +229 -0
  148. data/lib/octo/server/channel/adapters/discord/api_client.rb +107 -0
  149. data/lib/octo/server/channel/adapters/discord/gateway_client.rb +270 -0
  150. data/lib/octo/server/channel/adapters/feishu/adapter.rb +320 -0
  151. data/lib/octo/server/channel/adapters/feishu/bot.rb +478 -0
  152. data/lib/octo/server/channel/adapters/feishu/file_processor.rb +36 -0
  153. data/lib/octo/server/channel/adapters/feishu/message_parser.rb +129 -0
  154. data/lib/octo/server/channel/adapters/feishu/ws_client.rb +423 -0
  155. data/lib/octo/server/channel/adapters/telegram/adapter.rb +375 -0
  156. data/lib/octo/server/channel/adapters/telegram/api_client.rb +205 -0
  157. data/lib/octo/server/channel/adapters/wecom/adapter.rb +148 -0
  158. data/lib/octo/server/channel/adapters/wecom/media_downloader.rb +115 -0
  159. data/lib/octo/server/channel/adapters/wecom/ws_client.rb +395 -0
  160. data/lib/octo/server/channel/adapters/weixin/adapter.rb +692 -0
  161. data/lib/octo/server/channel/adapters/weixin/api_client.rb +402 -0
  162. data/lib/octo/server/channel/channel_config.rb +178 -0
  163. data/lib/octo/server/channel/channel_manager.rb +468 -0
  164. data/lib/octo/server/channel/channel_ui_controller.rb +224 -0
  165. data/lib/octo/server/channel.rb +33 -0
  166. data/lib/octo/server/discover.rb +77 -0
  167. data/lib/octo/server/epipe_safe_io.rb +105 -0
  168. data/lib/octo/server/http_server.rb +3554 -0
  169. data/lib/octo/server/scheduler.rb +317 -0
  170. data/lib/octo/server/server_master.rb +325 -0
  171. data/lib/octo/server/session_registry.rb +431 -0
  172. data/lib/octo/server/web_ui_controller.rb +487 -0
  173. data/lib/octo/session_manager.rb +385 -0
  174. data/lib/octo/skill.rb +466 -0
  175. data/lib/octo/skill_loader.rb +328 -0
  176. data/lib/octo/tools/base.rb +118 -0
  177. data/lib/octo/tools/browser.rb +625 -0
  178. data/lib/octo/tools/edit.rb +165 -0
  179. data/lib/octo/tools/file_reader.rb +549 -0
  180. data/lib/octo/tools/glob.rb +162 -0
  181. data/lib/octo/tools/grep.rb +356 -0
  182. data/lib/octo/tools/invoke_skill.rb +96 -0
  183. data/lib/octo/tools/list_tasks.rb +54 -0
  184. data/lib/octo/tools/redo_task.rb +41 -0
  185. data/lib/octo/tools/request_user_feedback.rb +84 -0
  186. data/lib/octo/tools/security.rb +333 -0
  187. data/lib/octo/tools/terminal/output_cleaner.rb +63 -0
  188. data/lib/octo/tools/terminal/persistent_session.rb +268 -0
  189. data/lib/octo/tools/terminal/safe_rm.sh +106 -0
  190. data/lib/octo/tools/terminal/session_manager.rb +213 -0
  191. data/lib/octo/tools/terminal.rb +1828 -0
  192. data/lib/octo/tools/todo_manager.rb +374 -0
  193. data/lib/octo/tools/trash_manager.rb +388 -0
  194. data/lib/octo/tools/undo_task.rb +35 -0
  195. data/lib/octo/tools/web_fetch.rb +242 -0
  196. data/lib/octo/tools/web_search.rb +260 -0
  197. data/lib/octo/tools/write.rb +77 -0
  198. data/lib/octo/ui2/block_font.rb +10 -0
  199. data/lib/octo/ui2/components/base_component.rb +163 -0
  200. data/lib/octo/ui2/components/command_suggestions.rb +290 -0
  201. data/lib/octo/ui2/components/common_component.rb +96 -0
  202. data/lib/octo/ui2/components/inline_input.rb +226 -0
  203. data/lib/octo/ui2/components/input_area.rb +1338 -0
  204. data/lib/octo/ui2/components/message_component.rb +99 -0
  205. data/lib/octo/ui2/components/modal_component.rb +419 -0
  206. data/lib/octo/ui2/components/todo_area.rb +149 -0
  207. data/lib/octo/ui2/components/tool_component.rb +107 -0
  208. data/lib/octo/ui2/components/welcome_banner.rb +139 -0
  209. data/lib/octo/ui2/layout_manager.rb +807 -0
  210. data/lib/octo/ui2/line_editor.rb +363 -0
  211. data/lib/octo/ui2/markdown_renderer.rb +100 -0
  212. data/lib/octo/ui2/output_buffer.rb +370 -0
  213. data/lib/octo/ui2/progress_handle.rb +362 -0
  214. data/lib/octo/ui2/progress_indicator.rb +55 -0
  215. data/lib/octo/ui2/screen_buffer.rb +273 -0
  216. data/lib/octo/ui2/terminal_detector.rb +119 -0
  217. data/lib/octo/ui2/theme_manager.rb +85 -0
  218. data/lib/octo/ui2/themes/base_theme.rb +105 -0
  219. data/lib/octo/ui2/themes/hacker_theme.rb +62 -0
  220. data/lib/octo/ui2/themes/minimal_theme.rb +56 -0
  221. data/lib/octo/ui2/thinking_verbs.rb +26 -0
  222. data/lib/octo/ui2/ui_controller.rb +1625 -0
  223. data/lib/octo/ui2/view_renderer.rb +177 -0
  224. data/lib/octo/ui2.rb +40 -0
  225. data/lib/octo/ui_interface.rb +154 -0
  226. data/lib/octo/utils/arguments_parser.rb +191 -0
  227. data/lib/octo/utils/browser_detector.rb +195 -0
  228. data/lib/octo/utils/encoding.rb +92 -0
  229. data/lib/octo/utils/environment_detector.rb +140 -0
  230. data/lib/octo/utils/file_ignore_helper.rb +170 -0
  231. data/lib/octo/utils/file_processor.rb +601 -0
  232. data/lib/octo/utils/gitignore_parser.rb +154 -0
  233. data/lib/octo/utils/limit_stack.rb +152 -0
  234. data/lib/octo/utils/logger.rb +124 -0
  235. data/lib/octo/utils/login_shell.rb +72 -0
  236. data/lib/octo/utils/model_pricing.rb +646 -0
  237. data/lib/octo/utils/parser_manager.rb +165 -0
  238. data/lib/octo/utils/path_helper.rb +15 -0
  239. data/lib/octo/utils/scripts_manager.rb +59 -0
  240. data/lib/octo/utils/string_matcher.rb +158 -0
  241. data/lib/octo/utils/trash_directory.rb +112 -0
  242. data/lib/octo/utils/workspace_rules.rb +46 -0
  243. data/lib/octo/version.rb +5 -0
  244. data/lib/octo/web/app.css +7141 -0
  245. data/lib/octo/web/app.js +543 -0
  246. data/lib/octo/web/apple-touch-icon.png +0 -0
  247. data/lib/octo/web/auth.js +150 -0
  248. data/lib/octo/web/channels.js +276 -0
  249. data/lib/octo/web/datepicker.js +205 -0
  250. data/lib/octo/web/favicon.png +0 -0
  251. data/lib/octo/web/i18n.js +1073 -0
  252. data/lib/octo/web/icon-512.png +0 -0
  253. data/lib/octo/web/icon-dark.svg +25 -0
  254. data/lib/octo/web/icon.svg +29 -0
  255. data/lib/octo/web/index.html +871 -0
  256. data/lib/octo/web/marked.min.js +69 -0
  257. data/lib/octo/web/onboard.js +491 -0
  258. data/lib/octo/web/profile.js +442 -0
  259. data/lib/octo/web/sessions.js +4421 -0
  260. data/lib/octo/web/settings.js +913 -0
  261. data/lib/octo/web/sidebar.js +32 -0
  262. data/lib/octo/web/skills.js +885 -0
  263. data/lib/octo/web/tasks.js +297 -0
  264. data/lib/octo/web/theme.js +105 -0
  265. data/lib/octo/web/trash.js +343 -0
  266. data/lib/octo/web/vendor/hljs/highlight.min.js +1244 -0
  267. data/lib/octo/web/vendor/hljs/hljs-theme.css +95 -0
  268. data/lib/octo/web/vendor/katex/auto-render.min.js +1 -0
  269. data/lib/octo/web/vendor/katex/fonts/KaTeX_AMS-Regular.woff2 +0 -0
  270. data/lib/octo/web/vendor/katex/fonts/KaTeX_Caligraphic-Bold.woff2 +0 -0
  271. data/lib/octo/web/vendor/katex/fonts/KaTeX_Caligraphic-Regular.woff2 +0 -0
  272. data/lib/octo/web/vendor/katex/fonts/KaTeX_Fraktur-Bold.woff2 +0 -0
  273. data/lib/octo/web/vendor/katex/fonts/KaTeX_Fraktur-Regular.woff2 +0 -0
  274. data/lib/octo/web/vendor/katex/fonts/KaTeX_Main-Bold.woff2 +0 -0
  275. data/lib/octo/web/vendor/katex/fonts/KaTeX_Main-BoldItalic.woff2 +0 -0
  276. data/lib/octo/web/vendor/katex/fonts/KaTeX_Main-Italic.woff2 +0 -0
  277. data/lib/octo/web/vendor/katex/fonts/KaTeX_Main-Regular.woff2 +0 -0
  278. data/lib/octo/web/vendor/katex/fonts/KaTeX_Math-BoldItalic.woff2 +0 -0
  279. data/lib/octo/web/vendor/katex/fonts/KaTeX_Math-Italic.woff2 +0 -0
  280. data/lib/octo/web/vendor/katex/fonts/KaTeX_SansSerif-Bold.woff2 +0 -0
  281. data/lib/octo/web/vendor/katex/fonts/KaTeX_SansSerif-Italic.woff2 +0 -0
  282. data/lib/octo/web/vendor/katex/fonts/KaTeX_SansSerif-Regular.woff2 +0 -0
  283. data/lib/octo/web/vendor/katex/fonts/KaTeX_Script-Regular.woff2 +0 -0
  284. data/lib/octo/web/vendor/katex/fonts/KaTeX_Size1-Regular.woff2 +0 -0
  285. data/lib/octo/web/vendor/katex/fonts/KaTeX_Size2-Regular.woff2 +0 -0
  286. data/lib/octo/web/vendor/katex/fonts/KaTeX_Size3-Regular.woff2 +0 -0
  287. data/lib/octo/web/vendor/katex/fonts/KaTeX_Size4-Regular.woff2 +0 -0
  288. data/lib/octo/web/vendor/katex/fonts/KaTeX_Typewriter-Regular.woff2 +0 -0
  289. data/lib/octo/web/vendor/katex/katex.min.css +1 -0
  290. data/lib/octo/web/vendor/katex/katex.min.js +1 -0
  291. data/lib/octo/web/version.js +449 -0
  292. data/lib/octo/web/weixin-qr.html +209 -0
  293. data/lib/octo/web/ws-dispatcher.js +357 -0
  294. data/lib/octo/web/ws.js +128 -0
  295. data/lib/octo.rb +145 -0
  296. data/scripts/build/build.sh +329 -0
  297. data/scripts/build/lib/apt.sh +56 -0
  298. data/scripts/build/lib/brew.sh +89 -0
  299. data/scripts/build/lib/colors.sh +17 -0
  300. data/scripts/build/lib/gem.sh +95 -0
  301. data/scripts/build/lib/mise.sh +125 -0
  302. data/scripts/build/lib/network.sh +157 -0
  303. data/scripts/build/lib/os.sh +57 -0
  304. data/scripts/build/lib/shell.sh +37 -0
  305. data/scripts/build/src/install.sh.cc +174 -0
  306. data/scripts/build/src/install_browser.sh.cc +101 -0
  307. data/scripts/build/src/install_full.sh.cc +290 -0
  308. data/scripts/build/src/install_rails_deps.sh.cc +145 -0
  309. data/scripts/build/src/install_system_deps.sh.cc +123 -0
  310. data/scripts/build/src/uninstall.sh.cc +101 -0
  311. data/scripts/install.ps1 +532 -0
  312. data/scripts/install.sh +567 -0
  313. data/scripts/install_browser.sh +479 -0
  314. data/scripts/install_full.sh +838 -0
  315. data/scripts/install_rails_deps.sh +746 -0
  316. data/scripts/install_system_deps.sh +518 -0
  317. data/scripts/uninstall.sh +287 -0
  318. data/sig/octo.rbs +4 -0
  319. metadata +614 -0
@@ -0,0 +1,602 @@
1
+ ---
2
+ name: skill-creator
3
+ description: Create new skills, modify and improve existing skills, and measure skill performance. Use when users want to create a skill from scratch, edit, or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy.
4
+ ---
5
+
6
+ # Skill Creator
7
+
8
+ A skill for creating new skills and iteratively improving them.
9
+
10
+ ## Usage Modes
11
+
12
+ This skill supports two modes:
13
+
14
+ ### 1. Interactive Mode (default)
15
+
16
+ The full workflow with user interviews, test cases, and iteration cycles.
17
+ Use when creating or refining skills manually.
18
+
19
+ At a high level, the process of creating a skill goes like this:
20
+
21
+ - Decide what you want the skill to do and roughly how it should do it
22
+ - Write a draft of the skill
23
+ - Create a few test prompts and simulate running them (with vs. without the skill instructions)
24
+ - Help the user evaluate the results both qualitatively and quantitatively
25
+ - While reviewing, draft quantitative assertions if there aren't any
26
+ - Use `eval-viewer/generate_review.py` to generate a static HTML viewer for the user to review results and leave feedback
27
+ - Rewrite the skill based on the user's feedback
28
+ - Repeat until satisfied
29
+
30
+ Your job is to figure out where the user is in this process and jump in to help them progress through these stages. Maybe they say "I want to make a skill for X" — help narrow down the intent, write a draft, write test cases, evaluate, and repeat. Or maybe they already have a draft — go straight to the eval/iterate part.
31
+
32
+ Always be flexible. If the user says "skip the evals, just vibe with me", do that instead.
33
+
34
+ ### 2. Quick Mode (for agent self-evolution)
35
+
36
+ **Trigger**: When invoked with `mode: "quick"` in the task arguments.
37
+
38
+ Fast, opinionated skill creation without user interaction. This mode is used by the agent's self-evolution system to automatically create or improve skills.
39
+
40
+ **Behavior**:
41
+ - Skip user interviews and detailed requirements gathering
42
+ - Extract workflow pattern from provided context
43
+ - Write a minimal but functional SKILL.md
44
+ - Save to `~/.octo/skills/auto-<name>-<timestamp>/` (or improve existing skill in place)
45
+ - Skip test cases and evals (user can refine later if needed)
46
+ - Always validate frontmatter with the validator script after creation
47
+ - Focus on the happy path; edge cases can be added later
48
+
49
+ **Expected arguments when using quick mode**:
50
+ - `task`: Clear description of what to automate and how (be specific about workflow steps)
51
+ - `mode`: Must be set to `"quick"`
52
+ - `suggested_name`: (optional) Proposed skill identifier (lowercase, hyphens OK)
53
+
54
+ **Quick mode principles**:
55
+ - **Be opinionated**: Make reasonable assumptions without asking
56
+ - **Be concise**: Keep instructions simple and focused
57
+ - **Be practical**: Focus on the core workflow that will save the most time
58
+ - **Be correct**: Always set `disable-model-invocation: false` and `user-invocable: true`
59
+ - **Be validating**: Run the frontmatter validator immediately after creation
60
+
61
+ **Example invocation from the agent's self-evolution system**:
62
+ ```
63
+ invoke_skill(
64
+ skill_name: "skill-creator",
65
+ task: "Create a skill to extract and summarize content from URLs. The skill should: 1) fetch the URL using terminal with curl, 2) parse the HTML to extract main text content, 3) generate a concise markdown summary. Expected input: URL string. Expected output: markdown summary with title and key points.",
66
+ mode: "quick",
67
+ suggested_name: "url-summarizer"
68
+ )
69
+ ```
70
+
71
+ ---
72
+
73
+ ## Platform Context: Octo
74
+
75
+ This skill runs inside **Octo** (octo). Key platform specifics:
76
+
77
+ - **Skills** live at `~/.octo/skills/<skill-name>/` — **always create new skills here** (global user skills, visible to Web UI and all sessions). To locate an existing skill, check these paths in order using `glob` or `ls`: (1) `.octo/skills/` — project-level skills, (2) `~/.octo/skills/` — user-level skills. Built-in skills (shipped with the gem) are always available via `invoke_skill` by name — no file lookup needed. Never use `find /` or broad filesystem searches to locate skills.
78
+ - **No parallel subagents** — Octo runs as a single agent; all test cases execute serially in the current session
79
+ - **No external agent CLI** — for evals, just execute the task directly in-session (read the skill, follow instructions, save outputs)
80
+ - **Scripts** — prefer **Ruby** (`.rb` files); Octo is Ruby-native. Run with `ruby path/to/script.rb`. Python is available but Ruby is the default choice
81
+ - **`python3`** — if Python scripts are needed (e.g., `generate_review.py`), use `python3` explicitly
82
+ - The description optimization scripts (`run_loop.py`, `run_eval.py`) work in Octo — they use `octo agent --json` to detect `invoke_skill` events. See the Description Optimization section for usage
83
+
84
+ ---
85
+
86
+ ## Communicating with the user
87
+
88
+ Pay attention to context cues to understand how technical the user is. In general:
89
+
90
+ - "evaluation" and "benchmark" are fine
91
+ - For "JSON" and "assertion" — explain briefly if you're unsure the user knows these terms
92
+
93
+ It's always OK to briefly explain a term if you're in doubt.
94
+
95
+ ---
96
+
97
+ ## Creating a skill
98
+
99
+ ### Capture Intent
100
+
101
+ Start by understanding what the user wants. If the current conversation already shows a workflow they want to capture (tools used, sequence of steps, corrections made, input/output formats), extract answers from history first — the user may just need to fill gaps and confirm.
102
+
103
+ 1. What should this skill enable Octo to do?
104
+ 2. When should this skill trigger? (what phrases/contexts)
105
+ 3. What's the expected output format?
106
+ 4. Should we set up test cases? Skills with objectively verifiable outputs (file transforms, data extraction, code generation) benefit from test cases. Skills with subjective outputs (writing style, creative work) often don't need them.
107
+
108
+ ### Interview and Research
109
+
110
+ Ask about edge cases, input/output formats, example files, success criteria, and dependencies before writing test prompts. Come prepared with context to reduce burden on the user.
111
+
112
+ ### Write the SKILL.md
113
+
114
+ Components to fill in:
115
+
116
+ - **name**: Skill identifier (lowercase, hyphens OK)
117
+ - **description**: Primary triggering mechanism — include BOTH what the skill does AND specific contexts for when to use it. All "when to use" info goes here, not in the body. Make the description a little "pushy" — err toward over-triggering rather than under-triggering. Example: instead of "Helps with dashboard creation", write "Helps with dashboard creation. Use this skill whenever the user mentions dashboards, data visualization, or wants to display any kind of data, even if they don't explicitly say 'dashboard'."
118
+ - **disable-model-invocation**: Set to `false` (always include this)
119
+ - **user-invocable**: Set to `true` to make the skill appear in the WebUI chatbox `/` command list. **Always include this** — without it, users cannot manually invoke the skill from the Octo Web UI session chat.
120
+ - **compatibility** (optional): Required tools or dependencies
121
+ - **Body**: The actual instructions
122
+
123
+ > **Octo-specific**: Every skill MUST include `disable-model-invocation: false` and `user-invocable: true` in the YAML frontmatter, or it will be invisible in the WebUI `/` command list. The minimal valid frontmatter is:
124
+ > ```yaml
125
+ > ---
126
+ > name: my-skill
127
+ > description: 'Your description here. Avoid colons followed by a space (like "wants to: do X") inside the description — they break YAML parsing and the skill will silently fail to load. Wrap the entire description in single quotes to be safe, or rephrase to avoid the colon pattern.'
128
+ > disable-model-invocation: false
129
+ > user-invocable: true
130
+ > ---
131
+ > ```
132
+ >
133
+ > **YAML description gotcha**: If the description contains `word: value` patterns (colons followed by space), YAML treats them as key-value pairs and the frontmatter parse fails silently. Always wrap description values in single quotes. Avoid embedded double-quotes inside single-quoted strings (use rephrasing instead).
134
+
135
+ > **After writing SKILL.md — always validate and auto-fix**: Run this immediately after creating or updating any skill file:
136
+ > ```bash
137
+ > ruby SKILL_DIR/scripts/validate_skill_frontmatter.rb /path/to/new-skill/SKILL.md
138
+ > ```
139
+ > The script validates the YAML frontmatter and auto-fixes common issues (unquoted descriptions, multi-line block scalars with colons). If it prints `OK:` — you're done. If it prints `Auto-fixed and saved` — it repaired the file automatically. If it prints `ERROR` — manual fix required.
140
+
141
+ ### Skill Writing Guide
142
+
143
+ #### Anatomy of a Skill
144
+
145
+ Skills are created at `~/.octo/skills/<skill-name>/`:
146
+
147
+ ```
148
+ ~/.octo/skills/skill-name/
149
+ ├── SKILL.md (required)
150
+ │ ├── YAML frontmatter (name, description required)
151
+ │ └── Markdown instructions
152
+ └── Bundled Resources (optional)
153
+ ├── scripts/ - Executable code (prefer .rb Ruby scripts)
154
+ ├── references/ - Docs loaded into context as needed
155
+ └── assets/ - Files used in output (templates, icons, fonts)
156
+ ```
157
+
158
+ #### Progressive Disclosure
159
+
160
+ Skills use a three-level loading system:
161
+ 1. **Metadata** (name + description) — Always in context (~100 words)
162
+ 2. **SKILL.md body** — In context whenever skill triggers (<500 lines ideal)
163
+ 3. **Bundled resources** — Loaded as needed (unlimited)
164
+
165
+ **Key patterns:**
166
+ - Keep SKILL.md under 500 lines; if approaching the limit, extract content into `references/` files and add clear pointers
167
+ - Reference files from SKILL.md with guidance on when to read them
168
+ - For large reference files (>300 lines), include a table of contents
169
+
170
+ **Domain organization** — When a skill supports multiple frameworks/domains, organize by variant:
171
+ ```
172
+ my-skill/
173
+ ├── SKILL.md (workflow + which reference to load)
174
+ └── references/
175
+ ├── rails.md
176
+ ├── django.md
177
+ └── express.md
178
+ ```
179
+
180
+ #### Bundled Scripts (Ruby preferred)
181
+
182
+ When a skill needs to execute code — API calls, file processing, data transforms — bundle a Ruby script instead of writing inline shell commands. This is cleaner, reusable, and more maintainable.
183
+
184
+ **Ruby script template:**
185
+ ```ruby
186
+ #!/usr/bin/env ruby
187
+ # skill-name/scripts/do_something.rb
188
+ # Usage: ruby path/to/do_something.rb [args]
189
+
190
+ require 'net/http'
191
+ require 'json'
192
+ require 'fileutils'
193
+
194
+ # Read args
195
+ input = ARGV[0]
196
+ if input.nil? || input.strip.empty?
197
+ warn "Usage: ruby do_something.rb <input>"
198
+ exit 1
199
+ end
200
+
201
+ # ... logic ...
202
+
203
+ puts result # stdout is the output
204
+ ```
205
+
206
+ Invoke from SKILL.md by referencing the script via the Supporting Files block — at runtime, the AI receives the full absolute path of every supporting file. Refer to it as `SKILL_DIR` in instructions so the AI substitutes the correct path from the Supporting Files list:
207
+
208
+ ```bash
209
+ ruby "SKILL_DIR/scripts/do_something.rb" "argument"
210
+ ```
211
+
212
+ Never hardcode paths like `~/.octo/skills/my-skill/scripts/...` — they break when the skill is installed at a different location. Never use `find` to locate scripts — the Supporting Files block always provides the correct absolute paths.
213
+
214
+ Ruby standard library covers most needs (`net/http`, `json`, `fileutils`, `uri`, `time`). No gems needed for basic API calls.
215
+
216
+ #### Principle of Least Surprise
217
+
218
+ Skills must not contain malware, exploit code, or anything that could compromise security. A skill's contents should not surprise the user if described. Don't create misleading skills or skills designed for unauthorized access or data exfiltration.
219
+
220
+ #### Writing Patterns
221
+
222
+ Use the imperative form in instructions.
223
+
224
+ **Defining output formats:**
225
+ ```markdown
226
+ ## Report structure
227
+ Use this exact template:
228
+ # [Title]
229
+ ## Executive summary
230
+ ## Key findings
231
+ ## Recommendations
232
+ ```
233
+
234
+ **Examples pattern:**
235
+ ```markdown
236
+ ## Commit message format
237
+ **Example 1:**
238
+ Input: Added user authentication with JWT tokens
239
+ Output: feat(auth): implement JWT-based authentication
240
+ ```
241
+
242
+ ### Writing Style
243
+
244
+ Explain *why* things are important rather than just issuing commands. Use theory of mind — make the skill general, not over-fitted to specific examples. Write a draft, then look at it with fresh eyes and improve it. If you find yourself writing ALWAYS or NEVER in all caps, that's a yellow flag — try to reframe as an explanation of why, so the agent understands the reasoning rather than just following a rule.
245
+
246
+ ### Test Cases
247
+
248
+ After writing the skill draft, come up with 2–3 realistic test prompts — the kind of thing a real user would actually say. Share them with the user for review, then run them.
249
+
250
+ Save test cases to `evals/evals.json`:
251
+
252
+ ```json
253
+ {
254
+ "skill_name": "example-skill",
255
+ "evals": [
256
+ {
257
+ "id": 1,
258
+ "prompt": "User's task prompt",
259
+ "expected_output": "Description of expected result",
260
+ "files": []
261
+ }
262
+ ]
263
+ }
264
+ ```
265
+
266
+ Don't write assertions yet — just the prompts. Add assertions in the next step.
267
+
268
+ See `references/schemas.md` for the full schema.
269
+
270
+ ---
271
+
272
+ ## Running and Evaluating Test Cases
273
+
274
+ This is one continuous sequence — don't stop partway through.
275
+
276
+ Since Octo has no subagents, run test cases **serially** in the current session. For each test case, simulate two runs:
277
+
278
+ - **with_skill**: Read the SKILL.md, then follow its instructions to complete the task
279
+ - **without_skill**: Complete the same task using only general knowledge (no skill instructions)
280
+
281
+ Put results in `<skill-name>-workspace/` as a sibling to the skill directory. Organize by iteration (`iteration-1/`, `iteration-2/`, etc.), and within that by test case (use descriptive names like `eval-create-report`, not `eval-0`).
282
+
283
+ ### Step 1: For each test case, create the eval directory and run both variants
284
+
285
+ ```
286
+ <skill-name>-workspace/
287
+ └── iteration-1/
288
+ ├── eval-<descriptive-name>/
289
+ │ ├── eval_metadata.json
290
+ │ ├── with_skill/
291
+ │ │ ├── outputs/ ← files produced
292
+ │ │ └── grading.json ← filled in later
293
+ │ └── without_skill/
294
+ │ ├── outputs/
295
+ │ └── grading.json
296
+ └── benchmark.json ← filled in after all evals
297
+ ```
298
+
299
+ Write `eval_metadata.json` for each test case:
300
+ ```json
301
+ {
302
+ "eval_id": 1,
303
+ "eval_name": "descriptive-name",
304
+ "prompt": "The task prompt",
305
+ "assertions": []
306
+ }
307
+ ```
308
+
309
+ **Running a with_skill eval**: Read the skill's SKILL.md fully, then execute the task as instructed by the skill — create files, run scripts, write outputs to `with_skill/outputs/`.
310
+
311
+ **Running a without_skill eval**: Execute the same task using only general knowledge. Write outputs to `without_skill/outputs/`. This is the baseline.
312
+
313
+ ### Step 2: Draft assertions while running
314
+
315
+ Don't wait until all runs finish — draft quantitative assertions as you go and explain them to the user.
316
+
317
+ Good assertions are **objectively verifiable** and **descriptively named** — someone glancing at the benchmark should immediately understand what each one checks. Subjective skills are better evaluated qualitatively; don't force assertions onto things that need human judgment.
318
+
319
+ Update `eval_metadata.json` with assertions once drafted. Also update `evals/evals.json`.
320
+
321
+ ### Step 3: Grade each run
322
+
323
+ For each run, evaluate assertions against the outputs. Save results to `grading.json` in each run directory.
324
+
325
+ The `grading.json` format (exact field names matter for the viewer):
326
+ ```json
327
+ {
328
+ "eval_id": 1,
329
+ "configuration": "with_skill",
330
+ "expectations": [
331
+ {
332
+ "text": "The script uses absolute paths",
333
+ "passed": true,
334
+ "evidence": "Script uses $HOME/... throughout"
335
+ }
336
+ ],
337
+ "pass_count": 1,
338
+ "total_count": 1,
339
+ "pass_rate": 1.0
340
+ }
341
+ ```
342
+
343
+ For assertions that can be checked programmatically, write and run a Ruby script — it's faster and more reliable than eyeballing:
344
+
345
+ ```ruby
346
+ #!/usr/bin/env ruby
347
+ # Check assertion: output file contains expected content
348
+ output = File.read("with_skill/outputs/result.md")
349
+ puts output.include?("expected phrase") ? "PASS" : "FAIL"
350
+ ```
351
+
352
+ ### Step 4: Aggregate into benchmark
353
+
354
+ Create `benchmark.json` in the iteration directory. List `with_skill` before `without_skill` for each eval:
355
+
356
+ ```json
357
+ {
358
+ "skill_name": "my-skill",
359
+ "iteration": 1,
360
+ "configurations": [
361
+ {
362
+ "name": "with_skill",
363
+ "label": "With skill",
364
+ "evals": [
365
+ {"eval_id": 1, "eval_name": "eval-name", "pass_rate": 1.0, "pass_count": 3, "total_count": 3}
366
+ ],
367
+ "overall_pass_rate": 1.0,
368
+ "total_pass": 3,
369
+ "total_assertions": 3
370
+ },
371
+ {
372
+ "name": "without_skill",
373
+ "label": "Without skill (baseline)",
374
+ "evals": [
375
+ {"eval_id": 1, "eval_name": "eval-name", "pass_rate": 0.33, "pass_count": 1, "total_count": 3}
376
+ ],
377
+ "overall_pass_rate": 0.33,
378
+ "total_pass": 1,
379
+ "total_assertions": 3
380
+ }
381
+ ],
382
+ "delta": {
383
+ "pass_rate_improvement": 0.67,
384
+ "summary": "With skill: 100% | Without skill: 33% | Delta: +67pp"
385
+ },
386
+ "analyst_observations": [
387
+ "..."
388
+ ]
389
+ }
390
+ ```
391
+
392
+ Or run the aggregation script (from the skill-creator directory):
393
+ ```bash
394
+ python3 -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>
395
+ ```
396
+
397
+ ### Step 5: Do an analyst pass
398
+
399
+ Read the benchmark data and surface patterns the aggregate stats might hide. See `agents/analyzer.md` for what to look for — things like assertions that always pass regardless of skill (non-discriminating), high-variance evals, and time/effort tradeoffs.
400
+
401
+ ### Step 6: Generate the eval viewer — ALWAYS DO THIS BEFORE REVISING THE SKILL
402
+
403
+ **Generate the viewer first. Get the outputs in front of the user before making any changes.**
404
+
405
+ ```bash
406
+ python3 <skill-creator-path>/eval-viewer/generate_review.py \
407
+ <workspace>/iteration-N \
408
+ --skill-name "my-skill" \
409
+ --benchmark <workspace>/iteration-N/benchmark.json \
410
+ --static /tmp/<skill-name>-review.html
411
+
412
+ open /tmp/<skill-name>-review.html
413
+ ```
414
+
415
+ For iteration 2+, also pass `--previous-workspace <workspace>/iteration-<N-1>`.
416
+
417
+ Tell the user: "I've opened the results in your browser. 'Outputs' tab lets you click through each test case and leave feedback; 'Benchmark' shows the quantitative comparison. When you're done, come back and let me know."
418
+
419
+ ### What the user sees in the viewer
420
+
421
+ **Outputs tab**: One test case at a time.
422
+ - Prompt, output files (rendered inline where possible)
423
+ - Previous output (iteration 2+, collapsed)
424
+ - Formal grades (collapsed)
425
+ - Feedback textbox (auto-saves)
426
+ - Previous feedback (iteration 2+)
427
+
428
+ **Benchmark tab**: Pass rates, per-eval breakdowns, analyst observations.
429
+
430
+ Navigation: prev/next buttons or arrow keys. "Submit All Reviews" saves to `feedback.json`.
431
+
432
+ ### Step 7: Read the feedback
433
+
434
+ When the user says they're done, read `feedback.json`:
435
+
436
+ ```json
437
+ {
438
+ "reviews": [
439
+ {"run_id": "eval-0-with_skill", "feedback": "missing axis labels on chart", "timestamp": "..."},
440
+ {"run_id": "eval-1-with_skill", "feedback": "", "timestamp": "..."}
441
+ ],
442
+ "status": "complete"
443
+ }
444
+ ```
445
+
446
+ Empty feedback = user was happy with that test case. Focus on cases with specific complaints.
447
+
448
+ ---
449
+
450
+ ## Improving the Skill
451
+
452
+ This is the heart of the loop. You've run tests, the user reviewed results — now make the skill better.
453
+
454
+ ### How to think about improvements
455
+
456
+ **Generalize from feedback.** You're iterating on a few examples, but the skill will be used across thousands of different prompts. Avoid overfitting to specific examples. If there's a stubborn issue, try different metaphors or different approaches rather than adding more rigid rules.
457
+
458
+ **Keep it lean.** Remove things that aren't pulling their weight. Read the execution trace, not just the final output — if the skill is making the agent waste time on unproductive steps, cut those parts.
459
+
460
+ **Explain the why.** Try hard to explain *why* each instruction matters. Agents are smart — they perform better when they understand the reasoning rather than following rules blindly. If you find yourself writing ALWAYS or NEVER in all caps, reframe it as an explanation.
461
+
462
+ **Look for repeated work.** If every test case resulted in writing similar helper logic (e.g., an API call setup, a file parser), that's a signal to bundle a reusable Ruby script into `scripts/` and tell the skill to use it.
463
+
464
+ ### The iteration loop
465
+
466
+ 1. Apply improvements to the skill
467
+ 2. Re-run all test cases into a new `iteration-<N+1>/` directory (with_skill and without_skill)
468
+ 3. Generate the viewer with `--previous-workspace` pointing at the previous iteration
469
+ 4. Wait for the user to review and tell you they're done
470
+ 5. Read the new feedback, improve again, repeat
471
+
472
+ Keep going until:
473
+ - The user says they're happy
474
+ - Feedback is all empty
475
+ - You're not making meaningful progress
476
+
477
+ ---
478
+
479
+ ## Advanced: Blind Comparison
480
+
481
+ For more rigorous comparison, read `agents/comparator.md` and `agents/analyzer.md`. Optional — the human review loop is usually sufficient.
482
+
483
+ ---
484
+
485
+ ## Description Optimization
486
+
487
+ The `description` field in SKILL.md frontmatter is the primary triggering mechanism. After creating or improving a skill, offer to optimize it.
488
+
489
+ > **Octo note**: `run_eval.py` and `run_loop.py` have been adapted for Octo. They use `octo agent --json` (NDJSON streaming) to detect `invoke_skill` tool calls targeting temp skills in `~/.octo/skills/`. Queries run **serially** (single agent). `improve_description.py` calls the LLM directly via OpenRouter using `~/.octo/config.yml` credentials.
490
+
491
+ ### Manual description optimization
492
+
493
+ **Step 1: Generate trigger eval queries**
494
+
495
+ Create 20 eval queries — a mix of should-trigger and should-not-trigger. Save as JSON:
496
+
497
+ ```json
498
+ [
499
+ {"query": "the user prompt", "should_trigger": true},
500
+ {"query": "another prompt", "should_trigger": false}
501
+ ]
502
+ ```
503
+
504
+ Queries must be realistic — concrete, specific, with enough context that a real user would actually say them. Include file paths, personal context, column names, backstory. Use a mix of lengths and styles (casual, formal, typos, abbreviations). Focus on edge cases.
505
+
506
+ Bad: `"Format this data"`, `"Extract text from PDF"`, `"Create a chart"`
507
+
508
+ Good: `"ok so my boss just sent me this xlsx file (its in downloads, called Q4 sales final FINAL v2.xlsx) and she wants me to add a column showing profit margin. Revenue is column C, costs in column D i think"`
509
+
510
+ **Should-trigger queries (8–10):** Different phrasings of the same intent — some formal, some casual. Include cases where the user doesn't explicitly name the skill but clearly needs it. Uncommon use cases, and cases where this skill competes with another but should win.
511
+
512
+ **Should-not-trigger queries (8–10):** Near-misses — queries that share keywords but actually need something different. The negative cases should be genuinely tricky, not obviously irrelevant ("write a fibonacci function" as a negative for a PDF skill is too easy).
513
+
514
+ **Step 2: Review with user**
515
+
516
+ Use the HTML template in `assets/eval_review.html`:
517
+ 1. Read the template
518
+ 2. Replace `__EVAL_DATA_PLACEHOLDER__` with the JSON array, `__SKILL_NAME_PLACEHOLDER__` with the skill name, `__SKILL_DESCRIPTION_PLACEHOLDER__` with the current description
519
+ 3. Write to `/tmp/eval_review_<skill-name>.html` and `open` it
520
+ 4. User edits queries, toggles should-trigger, clicks "Export Eval Set"
521
+ 5. File downloads to `~/Downloads/eval_set.json`
522
+
523
+ **Step 3: Run automated optimization (recommended)**
524
+
525
+ Use the scripts from the skill-creator `scripts/` directory. Run from the skill-creator root:
526
+
527
+ ```bash
528
+ # Single eval run — check current description pass rate
529
+ python3 -m scripts.run_eval \
530
+ --eval-set ~/Downloads/eval_set.json \
531
+ --skill-path ~/.octo/skills/my-skill \
532
+ --verbose
533
+
534
+ # Full optimize loop — auto-improves description over N iterations
535
+ python3 -m scripts.run_loop \
536
+ --eval-set ~/Downloads/eval_set.json \
537
+ --skill-path ~/.octo/skills/my-skill \
538
+ --max-iterations 5 \
539
+ --runs-per-query 1 \
540
+ --verbose
541
+ # Outputs: best description + HTML report (auto-opens in browser)
542
+ ```
543
+
544
+ Notes:
545
+ - **No `--num-workers`** needed (or it's ignored) — Octo runs queries serially
546
+ - **No `--model`** needed — uses the model from `~/.octo/config.yml` automatically
547
+ - Temp skills are written to `~/.octo/skills/` and cleaned up after each query
548
+ - Each query spawns a fresh `octo agent --json` process to avoid session contamination
549
+
550
+ **Step 3 (manual fallback)**
551
+
552
+ If scripts fail, manually iterate: for each query in the eval set, judge whether the description would trigger. Tally passes/fails. Write improved description targeting failures. Repeat 2–3 times.
553
+
554
+ Focus on:
555
+ - Failing should-trigger queries → description is too narrow; broaden the trigger language
556
+ - Failing should-not-trigger queries → description is too broad; tighten specificity
557
+
558
+ **Step 4: Apply the result**
559
+
560
+ Update the skill's SKILL.md frontmatter with the improved description. Show the user before/after.
561
+
562
+ ### How skill triggering works
563
+
564
+ Skills appear in Octo's `available_skills` list. The agent consults a skill based on the description match — but only for tasks it can't handle alone. Simple, one-step queries often won't trigger even with a good description. Make eval queries substantive enough that the skill genuinely helps.
565
+
566
+ ---
567
+
568
+ ## Packaging
569
+
570
+ New skills are created directly in `~/.octo/skills/<skill-name>/` — no packaging step needed. The skill is immediately available in all sessions and the Web UI.
571
+
572
+ If distributing externally, you can package it:
573
+
574
+ ```bash
575
+ python3 -m scripts.package_skill <path/to/skill-folder>
576
+ ```
577
+
578
+ This creates a `.skill` file. Direct the user to the resulting file path.
579
+
580
+ ---
581
+
582
+ ## Reference files
583
+
584
+ - `agents/grader.md` — How to evaluate assertions against outputs
585
+ - `agents/comparator.md` — How to do blind A/B comparison between two outputs
586
+ - `agents/analyzer.md` — How to analyze why one version beat another
587
+ - `references/schemas.md` — JSON structures for evals.json, grading.json, benchmark.json
588
+
589
+ ---
590
+
591
+ ## The core loop (summary)
592
+
593
+ 1. Understand what the skill should do
594
+ 2. Draft or edit the SKILL.md
595
+ 3. Run test prompts — with and without the skill — and save outputs
596
+ 4. **Generate the eval viewer with `generate_review.py`** so the user can review
597
+ 5. Grade assertions, aggregate benchmark
598
+ 6. Get user feedback, improve the skill
599
+ 7. Repeat until satisfied
600
+ 8. Package and deliver
601
+
602
+ Add these steps to your todo list. Specifically: **always generate the eval viewer before revising the skill** — the user's feedback is the primary signal, not your own judgment of the outputs.