@glrs-dev/cli 0.0.1 → 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (173) hide show
  1. package/CHANGELOG.md +50 -0
  2. package/README.md +14 -15
  3. package/dist/chunk-3RG5ZIWI.js +10 -0
  4. package/dist/chunk-6RHN2EDH.js +93 -0
  5. package/dist/chunk-DEODG2LC.js +55 -0
  6. package/dist/chunk-FSAGM22T.js +17 -0
  7. package/dist/chunk-GQBZREK5.js +136 -0
  8. package/dist/chunk-HWMRY35D.js +139 -0
  9. package/dist/chunk-LMRDQ4GW.js +129 -0
  10. package/dist/chunk-NLPX2KOF.js +149 -0
  11. package/dist/chunk-P7PRH4I3.js +177 -0
  12. package/dist/chunk-VCN7RNLU.js +60 -0
  13. package/dist/chunk-VJFNIKQJ.js +120 -0
  14. package/dist/chunk-W37UX3U2.js +35 -0
  15. package/dist/chunk-YBCA3IP6.js +25 -0
  16. package/dist/chunk-YGNDPKIW.js +99 -0
  17. package/dist/cli.d.ts +1 -1
  18. package/dist/cli.js +89 -36
  19. package/dist/commands/cleanup.d.ts +19 -0
  20. package/dist/commands/cleanup.js +11 -0
  21. package/dist/commands/create.d.ts +17 -0
  22. package/dist/commands/create.js +12 -0
  23. package/dist/commands/delete.d.ts +17 -0
  24. package/dist/commands/delete.js +12 -0
  25. package/dist/commands/go.d.ts +4 -0
  26. package/dist/commands/go.js +11 -0
  27. package/dist/commands/list.d.ts +15 -0
  28. package/dist/commands/list.js +12 -0
  29. package/dist/commands/switch.d.ts +11 -0
  30. package/dist/commands/switch.js +12 -0
  31. package/dist/commands/types.d.ts +10 -0
  32. package/dist/commands/types.js +0 -0
  33. package/dist/index.d.ts +16 -19
  34. package/dist/index.js +4 -1
  35. package/dist/lib/config.d.ts +14 -0
  36. package/dist/lib/config.js +14 -0
  37. package/dist/lib/fmt.d.ts +12 -0
  38. package/dist/lib/fmt.js +25 -0
  39. package/dist/lib/git.d.ts +26 -0
  40. package/dist/lib/git.js +25 -0
  41. package/dist/lib/registry.d.ts +14 -0
  42. package/dist/lib/registry.js +13 -0
  43. package/dist/lib/select.d.ts +21 -0
  44. package/dist/lib/select.js +10 -0
  45. package/dist/lib/worktree.d.ts +35 -0
  46. package/dist/lib/worktree.js +17 -0
  47. package/dist/vendor/harness-opencode/dist/agents/prompts/agents-md-writer.md +89 -0
  48. package/dist/vendor/harness-opencode/dist/agents/prompts/architecture-advisor.md +46 -0
  49. package/dist/vendor/harness-opencode/dist/agents/prompts/build.md +93 -0
  50. package/dist/vendor/harness-opencode/dist/agents/prompts/code-searcher.md +54 -0
  51. package/dist/vendor/harness-opencode/dist/agents/prompts/docs-maintainer.md +128 -0
  52. package/dist/vendor/harness-opencode/dist/agents/prompts/gap-analyzer.md +44 -0
  53. package/dist/vendor/harness-opencode/dist/agents/prompts/lib-reader.md +39 -0
  54. package/dist/vendor/harness-opencode/dist/agents/prompts/pilot-builder.md +107 -0
  55. package/dist/vendor/harness-opencode/dist/agents/prompts/pilot-planner.md +153 -0
  56. package/dist/vendor/harness-opencode/dist/agents/prompts/plan-reviewer.md +49 -0
  57. package/dist/vendor/harness-opencode/dist/agents/prompts/plan.md +144 -0
  58. package/dist/vendor/harness-opencode/dist/agents/prompts/prime.md +374 -0
  59. package/dist/vendor/harness-opencode/dist/agents/prompts/qa-reviewer.md +68 -0
  60. package/dist/vendor/harness-opencode/dist/agents/prompts/qa-thorough.md +63 -0
  61. package/dist/vendor/harness-opencode/dist/agents/prompts/research.md +138 -0
  62. package/dist/vendor/harness-opencode/dist/agents/shared/index.ts +26 -0
  63. package/dist/vendor/harness-opencode/dist/agents/shared/workflow-mechanics.md +32 -0
  64. package/dist/vendor/harness-opencode/dist/bin/memory-mcp-launcher.sh +145 -0
  65. package/dist/vendor/harness-opencode/dist/bin/plan-check.sh +255 -0
  66. package/dist/vendor/harness-opencode/dist/chunk-VJUETC6A.js +205 -0
  67. package/dist/vendor/harness-opencode/dist/chunk-VVMP6QWS.js +731 -0
  68. package/dist/vendor/harness-opencode/dist/chunk-XCZ3NOXR.js +703 -0
  69. package/dist/vendor/harness-opencode/dist/cli.d.ts +1 -0
  70. package/dist/vendor/harness-opencode/dist/cli.js +5096 -0
  71. package/dist/vendor/harness-opencode/dist/commands/prompts/autopilot.md +96 -0
  72. package/dist/vendor/harness-opencode/dist/commands/prompts/costs.md +94 -0
  73. package/dist/vendor/harness-opencode/dist/commands/prompts/fresh.md +382 -0
  74. package/dist/vendor/harness-opencode/dist/commands/prompts/init-deep.md +196 -0
  75. package/dist/vendor/harness-opencode/dist/commands/prompts/research.md +27 -0
  76. package/dist/vendor/harness-opencode/dist/commands/prompts/review.md +96 -0
  77. package/dist/vendor/harness-opencode/dist/commands/prompts/ship.md +104 -0
  78. package/dist/vendor/harness-opencode/dist/index.d.ts +21 -0
  79. package/dist/vendor/harness-opencode/dist/index.js +2092 -0
  80. package/dist/vendor/harness-opencode/dist/install-4EYR56OR.js +9 -0
  81. package/dist/vendor/harness-opencode/dist/skills/agent-estimation/SKILL.md +159 -0
  82. package/dist/vendor/harness-opencode/dist/skills/paths.ts +18 -0
  83. package/dist/vendor/harness-opencode/dist/skills/pilot-planning/SKILL.md +49 -0
  84. package/dist/vendor/harness-opencode/dist/skills/pilot-planning/rules/dag-shape.md +47 -0
  85. package/dist/vendor/harness-opencode/dist/skills/pilot-planning/rules/decomposition.md +36 -0
  86. package/dist/vendor/harness-opencode/dist/skills/pilot-planning/rules/first-principles.md +29 -0
  87. package/dist/vendor/harness-opencode/dist/skills/pilot-planning/rules/milestones.md +57 -0
  88. package/dist/vendor/harness-opencode/dist/skills/pilot-planning/rules/self-review.md +46 -0
  89. package/dist/vendor/harness-opencode/dist/skills/pilot-planning/rules/task-context.md +47 -0
  90. package/dist/vendor/harness-opencode/dist/skills/pilot-planning/rules/touches-scope.md +47 -0
  91. package/dist/vendor/harness-opencode/dist/skills/pilot-planning/rules/verify-design.md +53 -0
  92. package/dist/vendor/harness-opencode/dist/skills/research/SKILL.md +350 -0
  93. package/dist/vendor/harness-opencode/dist/skills/research-auto/SKILL.md +283 -0
  94. package/dist/vendor/harness-opencode/dist/skills/research-local/SKILL.md +268 -0
  95. package/dist/vendor/harness-opencode/dist/skills/research-web/SKILL.md +119 -0
  96. package/dist/vendor/harness-opencode/dist/skills/review-plan/SKILL.md +32 -0
  97. package/dist/vendor/harness-opencode/dist/skills/vercel-composition-patterns/AGENTS.md +946 -0
  98. package/dist/vendor/harness-opencode/dist/skills/vercel-composition-patterns/README.md +60 -0
  99. package/dist/vendor/harness-opencode/dist/skills/vercel-composition-patterns/SKILL.md +89 -0
  100. package/dist/vendor/harness-opencode/dist/skills/vercel-composition-patterns/rules/architecture-avoid-boolean-props.md +100 -0
  101. package/dist/vendor/harness-opencode/dist/skills/vercel-composition-patterns/rules/architecture-compound-components.md +112 -0
  102. package/dist/vendor/harness-opencode/dist/skills/vercel-composition-patterns/rules/patterns-children-over-render-props.md +87 -0
  103. package/dist/vendor/harness-opencode/dist/skills/vercel-composition-patterns/rules/patterns-explicit-variants.md +100 -0
  104. package/dist/vendor/harness-opencode/dist/skills/vercel-composition-patterns/rules/react19-no-forwardref.md +42 -0
  105. package/dist/vendor/harness-opencode/dist/skills/vercel-composition-patterns/rules/state-context-interface.md +191 -0
  106. package/dist/vendor/harness-opencode/dist/skills/vercel-composition-patterns/rules/state-decouple-implementation.md +113 -0
  107. package/dist/vendor/harness-opencode/dist/skills/vercel-composition-patterns/rules/state-lift-state.md +125 -0
  108. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/AGENTS.md +2975 -0
  109. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/README.md +123 -0
  110. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/SKILL.md +137 -0
  111. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/advanced-event-handler-refs.md +55 -0
  112. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/advanced-init-once.md +42 -0
  113. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/advanced-use-latest.md +39 -0
  114. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/async-api-routes.md +38 -0
  115. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/async-defer-await.md +80 -0
  116. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/async-dependencies.md +51 -0
  117. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/async-parallel.md +28 -0
  118. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/async-suspense-boundaries.md +99 -0
  119. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/bundle-barrel-imports.md +59 -0
  120. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/bundle-conditional.md +31 -0
  121. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/bundle-defer-third-party.md +49 -0
  122. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/bundle-dynamic-imports.md +35 -0
  123. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/bundle-preload.md +50 -0
  124. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/client-event-listeners.md +74 -0
  125. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/client-localstorage-schema.md +71 -0
  126. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/client-passive-event-listeners.md +48 -0
  127. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/client-swr-dedup.md +56 -0
  128. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/js-batch-dom-css.md +107 -0
  129. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/js-cache-function-results.md +80 -0
  130. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/js-cache-property-access.md +28 -0
  131. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/js-cache-storage.md +70 -0
  132. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/js-combine-iterations.md +32 -0
  133. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/js-early-exit.md +50 -0
  134. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/js-hoist-regexp.md +45 -0
  135. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/js-index-maps.md +37 -0
  136. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/js-length-check-first.md +49 -0
  137. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/js-min-max-loop.md +82 -0
  138. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/js-set-map-lookups.md +24 -0
  139. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/js-tosorted-immutable.md +57 -0
  140. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/rendering-activity.md +26 -0
  141. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/rendering-animate-svg-wrapper.md +47 -0
  142. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/rendering-conditional-render.md +40 -0
  143. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/rendering-content-visibility.md +38 -0
  144. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/rendering-hoist-jsx.md +46 -0
  145. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/rendering-hydration-no-flicker.md +82 -0
  146. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/rendering-hydration-suppress-warning.md +30 -0
  147. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/rendering-svg-precision.md +28 -0
  148. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/rendering-usetransition-loading.md +75 -0
  149. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/rerender-defer-reads.md +39 -0
  150. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/rerender-dependencies.md +45 -0
  151. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/rerender-derived-state-no-effect.md +40 -0
  152. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/rerender-derived-state.md +29 -0
  153. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/rerender-functional-setstate.md +74 -0
  154. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/rerender-lazy-state-init.md +58 -0
  155. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/rerender-memo-with-default-value.md +38 -0
  156. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/rerender-memo.md +44 -0
  157. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/rerender-move-effect-to-event.md +45 -0
  158. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/rerender-simple-expression-in-memo.md +35 -0
  159. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/rerender-transitions.md +40 -0
  160. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/rerender-use-ref-transient-values.md +73 -0
  161. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/server-after-nonblocking.md +73 -0
  162. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/server-auth-actions.md +96 -0
  163. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/server-cache-lru.md +41 -0
  164. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/server-cache-react.md +76 -0
  165. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/server-dedup-props.md +65 -0
  166. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/server-hoist-static-io.md +142 -0
  167. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/server-parallel-fetching.md +83 -0
  168. package/dist/vendor/harness-opencode/dist/skills/vercel-react-best-practices/rules/server-serialization.md +38 -0
  169. package/dist/vendor/harness-opencode/dist/skills/web-design-guidelines/SKILL.md +39 -0
  170. package/dist/vendor/harness-opencode/package.json +11 -0
  171. package/package.json +20 -15
  172. package/LICENSE +0 -21
  173. package/dist/chunk-TU23AE2F.js +0 -69
@@ -0,0 +1,350 @@
1
+ ---
2
+ name: research
3
+ description: Research orchestrator — plans workstreams, dispatches parallel research agents (local, web, or auto), reviews results, identifies gaps, and iterates until comprehensive. Use when user says 'research', '/research', 'investigate', 'deep dive', 'explore', 'understand how', 'what do we know about'. Provide the research topic and context.
4
+ ---
5
+
6
+ # /research — Research Orchestrator
7
+
8
+ Multi-round research orchestrator that plans, dispatches, reviews, and iterates across research modes until the result is comprehensive.
9
+
10
+ **Research Query:** $ARGUMENTS
11
+
12
+ ```
13
+ THE IRON LAW: EVERY STEP IS A SUBAGENT. NO EXCEPTIONS.
14
+
15
+ You are an orchestrator. You do NOT research, synthesize, review, or analyze.
16
+ You launch subagents and pass their outputs to other subagents.
17
+ You use the Agent tool for ALL dispatches — never Skill tool.
18
+
19
+ Your jobs:
20
+ 1. Launch a planning subagent
21
+ 2. Dispatch research agents (which read skill files and follow them)
22
+ 3. Launch a review subagent
23
+ 4. If gaps exist, dispatch more research agents
24
+ 5. Launch a synthesis subagent
25
+ 6. Present the final result
26
+
27
+ You do NOTHING else. Every cognitive task is a subagent.
28
+ ```
29
+
30
+ ## Execution Model
31
+
32
+ ```
33
+ CRITICAL — WHY AGENT TOOL, NOT SKILL TOOL:
34
+
35
+ Skill() runs the sub-skill in the main conversation. It dumps the full skill
36
+ instructions + all output into your context window. By the time you run a second
37
+ skill, you're out of context and the pipeline stalls.
38
+
39
+ Agent tool runs in a subprocess with its own context window. Each research agent
40
+ gets a fresh context, reads its skill file from the bundled skills, and reports back.
41
+ The main conversation stays clean for orchestration.
42
+
43
+ DISPATCH TEMPLATE:
44
+ Agent tool with prompt:
45
+ "You are a research agent.
46
+
47
+ ## Research Query
48
+ {the full query or sub-question}
49
+
50
+ ## Task
51
+ 1. Read the bundled {skill-name} skill via the Skill tool
52
+ 2. {any additional context or constraints}
53
+ 3. Report back with your complete findings"
54
+
55
+ PARALLEL: Independent workstreams → ALL Agent calls in ONE message
56
+ SEQUENTIAL: Dependent workstreams → wait for prior agents
57
+ ```
58
+
59
+ If no query is provided as $ARGUMENTS, ask the user what they want to research.
60
+
61
+ ## Phase 1: Plan — Subagent
62
+
63
+ Launch a **general-purpose subagent** to plan the research:
64
+
65
+ ```
66
+ PROMPT:
67
+ "You are a research planner. Given a research query, decompose it into workstreams
68
+ and classify each by research type.
69
+
70
+ Research Query: [QUERY]
71
+
72
+ For each workstream, provide:
73
+ 1. A specific sub-question to answer
74
+ 2. Classification: LOCAL, WEB, or AUTO
75
+ 3. Why this classification (one sentence)
76
+ 4. Dependencies: which other workstreams must complete first (if any)
77
+
78
+ Classification rules:
79
+ - LOCAL: Questions answerable from THIS codebase — architecture, data flow,
80
+ patterns, implementations, file structure, dependencies within the repo
81
+ - WEB: Questions requiring external knowledge — best practices, competitor
82
+ analysis, market research, industry standards, technology comparisons,
83
+ regulatory info, documentation for external tools/libraries
84
+ - AUTO: ONLY when the query explicitly asks for experimentation with measurable
85
+ outcomes and iterative trials. This is RARE. Most queries are LOCAL or WEB.
86
+
87
+ A single query often needs BOTH local and web workstreams. For example:
88
+ 'How should we redesign our auth system?' needs LOCAL (how does it work now?)
89
+ AND WEB (what are current best practices?).
90
+
91
+ Output 3-6 workstreams. Prefer a mix of LOCAL and WEB when the query benefits.
92
+ Mark dependencies explicitly — independent workstreams will run in parallel."
93
+ ```
94
+
95
+ ## Phase 2: Execute Round 1 — Parallel Agent Dispatches
96
+
97
+ From the plan, dispatch **one Agent per workstream**. Launch ALL independent workstreams in a SINGLE message.
98
+
99
+ ```
100
+ FOR EACH LOCAL WORKSTREAM:
101
+ Agent tool:
102
+ "You are a codebase research agent.
103
+
104
+ ## Research Query
105
+ {original query}
106
+
107
+ ## Your Workstream
108
+ {sub-question from plan}
109
+
110
+ ## Task
111
+ 1. Read the bundled research-local skill via the Skill tool and follow every instruction
112
+ 2. Focus specifically on: {sub-question}
113
+ 3. Report back with your complete findings including all file:line references"
114
+
115
+ FOR EACH WEB WORKSTREAM:
116
+ Agent tool:
117
+ "You are a web research agent.
118
+
119
+ ## Research Query
120
+ {original query}
121
+
122
+ ## Your Workstream
123
+ {sub-question from plan}
124
+
125
+ ## Task
126
+ 1. Read the bundled research-web skill via the Skill tool and follow every instruction
127
+ 2. Focus specifically on: {sub-question}
128
+ 3. Write your output to research/{slug}/{workstream-name}.md
129
+ 4. Report back with your complete findings including source URLs"
130
+
131
+ FOR EACH AUTO WORKSTREAM (rare):
132
+ Agent tool:
133
+ "You are an experimentation agent.
134
+
135
+ ## Research Query
136
+ {original query}
137
+
138
+ ## Your Workstream
139
+ {sub-question from plan}
140
+
141
+ ## Task
142
+ 1. Read the bundled research-auto skill via the Skill tool and follow every instruction
143
+ 2. Focus specifically on: {sub-question}
144
+ 3. Report back with your findings and experiment results"
145
+
146
+ CRITICAL: ALL independent workstreams in ONE message.
147
+ Wait for ALL agents to complete before proceeding.
148
+ If workstream B depends on A, launch A first, wait, then launch B.
149
+ ```
150
+
151
+ ## Phase 3: Review Round 1 — Subagent
152
+
153
+ Launch a **general-purpose subagent** to review all findings:
154
+
155
+ ```
156
+ PROMPT:
157
+ "You are a research reviewer. Given a research query and findings from multiple
158
+ parallel research agents, assess completeness and identify gaps.
159
+
160
+ Original Query: [QUERY]
161
+
162
+ Workstream Plan:
163
+ [PLAN FROM PHASE 1]
164
+
165
+ Findings from all agents:
166
+ [ALL RESULTS FROM PHASE 2]
167
+
168
+ Evaluate:
169
+ 1. **Coverage** — Is every workstream adequately answered?
170
+ 2. **Depth** — Are answers backed by specific evidence (file:line for local, URLs for web)?
171
+ 3. **Connections** — Are cross-cutting themes identified? Do local and web findings
172
+ inform each other?
173
+ 4. **Contradictions** — Do any findings conflict? If so, which is more credible?
174
+ 5. **Blind spots** — What aspects of the query weren't addressed by ANY workstream?
175
+
176
+ For each gap found, provide:
177
+ - What is missing
178
+ - Classification: LOCAL or WEB (where to look for the answer)
179
+ - Specific search guidance (file patterns, search terms, URLs to check)
180
+ - Why it matters for answering the original query
181
+
182
+ Output:
183
+ - VERDICT: COMPREHENSIVE or GAPS FOUND
184
+ - If GAPS FOUND: list each gap as a numbered, dispatchable research question with classification
185
+ - CROSS-CUTTING INSIGHTS: themes that span multiple workstreams (these feed synthesis)"
186
+ ```
187
+
188
+ ## Phase 4: Execute Round 2 — Fill Gaps (If Needed)
189
+
190
+ If the review found gaps, dispatch **Agent tool calls for each gap** — ALL in ONE message.
191
+
192
+ Use the same dispatch templates from Phase 2, but with the gap question as the workstream and any relevant prior findings included as context.
193
+
194
+ ```
195
+ PROMPT ADDITION FOR GAP-FILLING AGENTS:
196
+ "## Context from Prior Research
197
+ {relevant findings from Round 1 that border this gap}
198
+
199
+ Focus on what the prior research missed. Don't repeat known findings."
200
+ ```
201
+
202
+ If COMPREHENSIVE, skip to Phase 5.
203
+
204
+ ## Phase 5: Review Round 2 — Subagent (If Phase 4 Ran)
205
+
206
+ Launch another review subagent with the SAME prompt template as Phase 3, but now including Round 1 + Round 2 findings.
207
+
208
+ ```
209
+ ITERATION RULES:
210
+ - Maximum 3 total rounds of execute → review (Round 1 is mandatory)
211
+ - If still GAPS FOUND after Round 3, proceed to synthesis anyway — note remaining
212
+ gaps as open questions
213
+ - Each round should have FEWER gaps than the previous. If gap count increases,
214
+ stop iterating and proceed to synthesis.
215
+ ```
216
+
217
+ ## Phase 6: Synthesize — Subagent
218
+
219
+ Launch a **general-purpose subagent** to produce the final synthesis:
220
+
221
+ ```
222
+ PROMPT:
223
+ "You are a research synthesizer. Combine findings from multiple research agents
224
+ into a single, authoritative document.
225
+
226
+ Original Query: [QUERY]
227
+
228
+ All Research Findings (all rounds):
229
+ [EVERY FINDING FROM ALL ROUNDS]
230
+
231
+ Cross-Cutting Insights from Reviews:
232
+ [INSIGHTS FROM PHASE 3/5 REVIEW AGENTS]
233
+
234
+ Create a comprehensive research report:
235
+
236
+ ## Executive Summary
237
+ 3-5 sentences answering the original query directly.
238
+
239
+ ## Key Findings
240
+ Organized by theme (NOT by workstream). Each finding should:
241
+ - State the conclusion
242
+ - Cite evidence (file:line for local, URL for web)
243
+ - Note confidence level (high/medium/low based on evidence strength)
244
+
245
+ ## Architecture & Code (if local research was involved)
246
+ How the relevant code is structured, with specific file:line references.
247
+
248
+ ## External Context (if web research was involved)
249
+ Best practices, market context, competitor approaches — with source URLs.
250
+
251
+ ## Connections
252
+ How local findings and external findings inform each other. Where the codebase
253
+ aligns with or diverges from industry standards/best practices.
254
+
255
+ ## Recommendations
256
+ Actionable next steps based on the research. Each recommendation should cite
257
+ the finding that supports it.
258
+
259
+ ## Open Questions
260
+ Anything that couldn't be fully resolved. Note whether it needs LOCAL exploration,
261
+ WEB research, or human input.
262
+
263
+ IMPORTANT: Organize by THEME, not by source agent. Merge overlapping findings.
264
+ Resolve contradictions explicitly. Every claim needs a citation."
265
+ ```
266
+
267
+ ## Phase 7: Final Quality Gate — Subagent
268
+
269
+ Launch a **general-purpose subagent** to score the final report:
270
+
271
+ ```
272
+ PROMPT:
273
+ "You are a quality reviewer for research output. Score this report.
274
+
275
+ Original Query: [QUERY]
276
+ Final Report: [OUTPUT FROM PHASE 6]
277
+
278
+ Score 1-5 on each dimension:
279
+ 1. **Answers the question** — Does the report directly address what was asked?
280
+ 2. **Evidence quality** — Are claims backed by specific file:line or URL citations?
281
+ 3. **Depth** — Does it go beyond surface-level to explain mechanisms and tradeoffs?
282
+ 4. **Synthesis** — Are findings from different sources meaningfully connected?
283
+ 5. **Actionability** — Could someone make decisions based on this report?
284
+
285
+ Overall: average of all dimensions.
286
+
287
+ If >= 4.0: 'QUALITY: PASS' with minor suggestions
288
+ If < 4.0: 'QUALITY: NEEDS WORK' with specific deficiencies framed as questions
289
+
290
+ Note: At this stage, the report is presented regardless of score. The score
291
+ helps the user gauge confidence."
292
+ ```
293
+
294
+ ## Phase 8: Present
295
+
296
+ Present to the user:
297
+ 1. The full synthesis report from Phase 6
298
+ 2. Quality score from Phase 7
299
+ 3. Research metadata: how many rounds, how many agents dispatched, modes used
300
+ 4. If quality < 4.0, explicitly note which areas are weak and suggest follow-up commands
301
+
302
+ ```
303
+ DO NOT dump raw agent outputs. Only the synthesized report.
304
+ DO NOT ask "which section would you like to explore further?" — present everything.
305
+ DO suggest specific follow-up /research commands if open questions remain.
306
+ ```
307
+
308
+ ## Subagent Summary
309
+
310
+ ```
311
+ MINIMUM SUBAGENTS PER RESEARCH (no gaps found):
312
+ 1 planner + 3-6 research agents + 1 reviewer + 1 synthesizer + 1 quality = 7-10
313
+
314
+ TYPICAL SUBAGENTS (one round of gap filling):
315
+ 1 planner + 4 research + 1 review + 2 gap-fill + 1 review + 1 synth + 1 quality = 11
316
+
317
+ MAXIMUM SUBAGENTS (three rounds):
318
+ 1 planner + 4 research + 1 review + 3 gap-fill + 1 review + 2 gap-fill + 1 review
319
+ + 1 synth + 1 quality = 15
320
+
321
+ Each research agent internally spawns its OWN subagents (research-local spawns
322
+ 8-14 Explore subagents, research-web spawns parallel web research agents).
323
+ Total subagent tree can be 30-50+ agents for a complex query. This is by design.
324
+ ```
325
+
326
+ ## Red Flags — STOP
327
+
328
+ - About to use Skill() to invoke a sub-skill — USE AGENT TOOL
329
+ - About to research something yourself — LAUNCH A SUBAGENT
330
+ - About to synthesize findings yourself — LAUNCH A SYNTHESIS SUBAGENT
331
+ - About to skip the review phase — REVIEW IS MANDATORY AFTER EVERY ROUND
332
+ - About to skip the planning phase — THE PLANNER SUBAGENT DECIDES WORKSTREAMS
333
+ - About to launch research agents one at a time — ONE MESSAGE, ALL INDEPENDENT AGENTS
334
+ - About to present raw agent outputs — SYNTHESIZE FIRST
335
+ - About to decide "no gaps" without a review subagent — THE REVIEWER DECIDES, NOT YOU
336
+ - About to run a 4th round of gap filling — MAX 3 ROUNDS, THEN PRESENT
337
+
338
+ ## Common Rationalizations
339
+
340
+ | Excuse | Reality |
341
+ |--------|---------|
342
+ | "I'll use Skill() — it's simpler" | Skill() eats main context. Agent tool keeps orchestration clean. |
343
+ | "I can plan the workstreams myself" | The planning subagent produces better decompositions with proper classification. |
344
+ | "One round of research is enough" | Review almost always finds gaps. The iterate-review loop is what makes research comprehensive. |
345
+ | "I'll skip the review — the findings look complete" | Your intuition is not a quality gate. The review subagent finds what you miss. |
346
+ | "I'll synthesize from the agent summaries" | A synthesis subagent with all findings produces better-connected, themed output. |
347
+ | "This only needs local research" | The planner subagent decides. Many queries benefit from both local and web context. |
348
+ | "I'll route to AUTO for thoroughness" | AUTO is for experimentation with measurable outcomes. Thoroughness comes from iteration, not AUTO. |
349
+ | "I'll launch agents sequentially to be safe" | Parallel is always faster. All independent workstreams in one message. |
350
+ | "The quality review is overhead" | The quality score tells the user how much to trust the report. 30 seconds well spent. |
@@ -0,0 +1,283 @@
1
+ ---
2
+ name: research-auto
3
+ description: Autonomous experimentation skill — agent interviews the user, sets up a lab, then explores freely (think, test, reflect) until stopped or a target is hit. Works for any domain where you can measure or evaluate a result. Use when user says 'optimize this', 'experiment with', 'find the best approach', 'iterate on', 'research mode'. Do NOT use for binary validation tests (use /spec-lab instead). Based on ResearcherSkill v1.4.4 by krzysztofdudek.
4
+ ---
5
+
6
+ # /research-auto — Autonomous Experimentation Skill
7
+
8
+ <critical>
9
+ Non-negotiable rules — every real experiment, no exceptions:
10
+ 1. Commit before running. Log before resetting. Reset on discard. Each branch holds only keeps.
11
+ 2. Protect `.lab/` — it is the single source of truth.
12
+ 3. Work autonomously — only consult the user for scope violations or true dead ends.
13
+ 4. Follow ALL guardrails (discard streaks, plateau, re-validation). They are mandatory, not suggestions.
14
+ </critical>
15
+
16
+ You are entering **researcher mode**. This skill is for YOU — the main agent. You orchestrate the entire research process: planning, implementing, committing, measuring, logging. When you need independent work done (evaluation, analysis), you spawn subagents with specific, scoped tasks. You control what each subagent knows through the prompt you give it.
17
+
18
+ You have **complete freedom** in how you navigate the problem space. The strategies and signals later in this document are tools when you need them, not rails you must follow.
19
+
20
+ ---
21
+
22
+ ## `.lab/` is Sacred
23
+
24
+ `.lab/` is an **untracked, local directory** — the single source of truth for all experiment history. It survives all git operations because it is in `.gitignore`. Git manages code state. `.lab/` manages experiment knowledge. They are independent.
25
+
26
+ **Structure:**
27
+ - `.lab/config.md`, `results.tsv`, `log.md`, `branches.md`, `parking-lot.md` — experiment metadata
28
+ - `.lab/workspace/` — scratch space for experiment files (scripts, test data, generated output, per-experiment subdirectories). Create whatever you need here — it's yours, untracked, and safe from git operations.
29
+
30
+ Always protect `.lab/`. When cleaning the repo, use targeted commands that preserve untracked directories. When resetting, use `git reset` and `git checkout` which leave `.lab/` intact.
31
+
32
+ ---
33
+
34
+ ## Phase 0: Resume Check
35
+
36
+ Check if `.lab/` already exists in the project root.
37
+
38
+ **If it exists:**
39
+ 1. Read `.lab/config.md`, `.lab/results.tsv`, `.lab/branches.md`, and tail of `.lab/log.md`
40
+ 2. Present a summary: objective, metrics, active branches, experiment counts, current best vs baseline, last experiment status
41
+ 3. Ask: **resume or start fresh?**
42
+ - **Resume** → checkout the active branch, pick up from next experiment number, jump to Phase 3
43
+ - **Start fresh** → archive to `.lab.bak.<timestamp>/`, proceed to Phase 1
44
+
45
+ **If it does not exist:** proceed to Phase 1.
46
+
47
+ ---
48
+
49
+ ## Phase 1: Discovery
50
+
51
+ Before any experiment, understand the problem. Ask these questions conversationally — skip what's obvious from context, use the **defaults** shown when the user has no preference:
52
+
53
+ 1. **Objective** — What are we trying to achieve?
54
+ 2. **Metrics** — How do we measure success?
55
+ - **Primary metric** (required): drives keep/discard decisions
56
+ - *Quantitative*: a command that outputs a number
57
+ - *Qualitative*: agent judgment against a rubric (see Qualitative Rubric below). Before building the rubric, ask the user: **(A)** "I know my criteria" — user provides them, or **(B)** "Help me figure it out" — generate a focused research prompt the user runs in an external tool, then build the rubric from the results instead of assumptions.
58
+ - **Secondary metrics** (optional): tracked for context, don't drive decisions unless primary is tied
59
+ - For each: **name**, **measure command** (or "agent judgment"), **direction** (lower/higher is better)
60
+ 3. **Scope** — What files/areas can we modify?
61
+ 4. **Constraints** — What is off-limits?
62
+ 5. **Run command** — How do we execute one experiment? Single command or chain (entire chain must succeed). May be omitted for qualitative-only research.
63
+ 6. **Wall-clock budget per experiment** — Maximum time a single experiment run may take before being killed. Default: **5 minutes**.
64
+ 7. **Termination** — When do we stop? Default: **infinite** (run until user interrupts or target is reached). Do not self-impose experiment limits. If the session ends (context limit, interruption), `.lab/` persists — the next session resumes via Phase 0.
65
+ - *Target value*: stop when primary metric reaches X
66
+ - *Experiment count*: stop after N experiments (only if the user explicitly requests it)
67
+
68
+ Once you have answers, **repeat the configuration back** and get explicit confirmation before proceeding.
69
+
70
+ ---
71
+
72
+ ## Phase 2: Lab Setup
73
+
74
+ After confirmation:
75
+
76
+ 1. **Branch** — Create `research/<slug>` from current HEAD.
77
+ 2. **Lab directory** — Create `.lab/` in the project root.
78
+ 3. **Config file** — Write `.lab/config.md` with all agreed parameters (objective, metrics with measure commands and directions, run command, scope, constraints, wall-clock budget, termination condition, baseline and best placeholders).
79
+ 4. **Results log** — Create `.lab/results.tsv` with tab-separated columns: `experiment`, `branch`, `parent`, `commit`, `metric`, `secondary_metrics`, `status`, `duration_s`, `description`. Status values: `keep`, `discard`, `crash`, `thought`, `keep*`, `interesting`.
80
+ 5. **Iteration log** — Create `.lab/log.md`
81
+ 6. **Parking lot** — Create `.lab/parking-lot.md` for deferred ideas
82
+ 7. **Branch registry** — Create `.lab/branches.md` with columns: Branch, Forked from, Status, Experiments, Best metric, Notes
83
+ 8. **Workspace** — Create `.lab/workspace/` for scratch files (scripts, test data, generated output). Use per-experiment subdirectories (e.g., `.lab/workspace/exp-3/`) when needed.
84
+ 9. **Git ignore** — Add `.lab/` and `run.log` to `.gitignore`.
85
+ 10. **Baseline** — Record experiment #0 with NO changes. For quantitative: run the measure command. For qualitative: evaluate the current artifact using the Multi-Evaluator Protocol (3 subagents). Fill in baseline in config.
86
+ 11. **Start** — Begin autonomous work immediately. No announcements needed.
87
+
88
+ ---
89
+
90
+ ## Phase 3: Autonomous Research
91
+
92
+ ### Flow: THINK → TEST → REFLECT → repeat
93
+
94
+ **THINK** — Before anything, read: `.lab/results.tsv`, `.lab/log.md` (last 5 entries if 20+), `.lab/branches.md`, `.lab/parking-lot.md`, and in-scope source files. Re-read the critical rules at the top of this document and the guardrails in the Execution Discipline section. Then write a `## THINK — before Experiment N` entry in `.lab/log.md` covering:
95
+ 1. **Convergence signals:** check against current state
96
+ 2. **Untested assumptions:** what am I assuming that I haven't tested? Have I tried the opposite of what's currently working? (e.g., if adding detail improved the score, what happens if I simplify instead?)
97
+ 3. **Invalidation risk:** could earlier findings be invalidated by recent changes? (e.g., after changing B, re-test assumptions made when only A was changed)
98
+ 4. **Next hypothesis:** what will I test and why
99
+
100
+ The log entry is mandatory — it is the evidence that you stopped to think. Without it, the THINK phase didn't happen. Stay as long as productive.
101
+
102
+ **TEST** — Implement, run, measure. Verify hypotheses. Follow execution discipline (below). Stay as long as you're generating new data.
103
+
104
+ **REFLECT** — What confirmed? What surprised? What breaks your model? Log everything. Update parking lot.
105
+
106
+ ### Execution Discipline
107
+
108
+ <critical>
109
+ These rules apply to every real experiment without exception. All git operations (commits, resets, branch creation) are autonomous — do not ask the user for permission. They are systemic to the research process, not discretionary actions.
110
+ </critical>
111
+
112
+ **Repo-file experiments** modify any file in scope (as defined in config). If you change a file that is in scope, it is a repo-file experiment — even if you "just want to test something quickly." No exceptions.
113
+ **Lab-only experiments** only touch `.lab/` or files outside scope. The commit rules below apply to repo-file experiments. Lab-only experiments just need logging.
114
+
115
+ **For every real experiment (code change + run):**
116
+
117
+ 1. **Commit BEFORE running** (repo-file experiments only):
118
+ ```
119
+ experiment #{N}: {short description}
120
+
121
+ Branch: {research branch name}
122
+ Parent: #{parent experiment number}
123
+ Hypothesis: {one-line hypothesis}
124
+ ```
125
+ Next experiment number = highest `experiment` in `.lab/results.tsv` + 1. Keeps stay on the branch as permanent checkpoints. Discards are reset — their SHA is recorded in `results.tsv` and remains accessible until `git gc` runs. Fork from discarded SHAs sooner rather than later.
126
+
127
+ 2. Execute ALL measure commands (primary + secondary), record raw values
128
+
129
+ 3. **Log first** — write a structured entry to `.lab/log.md` and a row to `.lab/results.tsv` (including the commit SHA). This must happen before any reset.
130
+
131
+ 4. Then decide:
132
+ - **KEEP** — metric improved above 0.1% noise threshold, or equal with simpler code
133
+ - **KEEP*** — primary improved but secondary significantly regressed (log the trade-off, note in commit or lab log)
134
+ - **DISCARD** — metric equal or worse → `git reset --hard HEAD~1`. The commit disappears from the branch but its SHA is in `.lab/results.tsv`. Want to revisit a discarded idea? Fork a new branch from that SHA.
135
+ - **INTERESTING** — metric didn't improve, but result reveals something valuable → keep or reset, your call
136
+ - **CRASH** — `git reset --hard HEAD~1`. Only read last 50 lines of `run.log` or grep for patterns.
137
+ - Trivial (typo, missing import): fix and re-run ONCE
138
+ - Fundamental (OOM, missing dependency): log, reset, move on
139
+ - 3+ crashes in a row: rethink the approach entirely
140
+ - **TIMEOUT** — kill, log as crash (metric = 0.000000), reset. 2+ in a row: reassess viability.
141
+
142
+ 5. **Guardrails** (after every decide/reset):
143
+ - <critical>**3+ discards in a row:** STOP. Write a `## 3-Discard Guardrail — after Experiment N` entry in `.lab/log.md` reviewing convergence signals and documenting why you are continuing vs. forking. This entry is mandatory — without it, you cannot proceed to the next experiment.</critical>
144
+ - <critical>**5+ discards in a row:** Fork is the **default action**. Write a `## 5-Discard Fork — after Experiment N` entry in `.lab/log.md`. Before forking, check `.lab/parking-lot.md` — if there are untested ideas there, try one first. Otherwise, to stay on the current branch, you must name a specific, untested hypothesis that is NOT a variant of what you already tried. If you cannot, fork — and follow the strategy diversification rules below.</critical>
145
+ - **Global best unchanged for 8+ real experiments:** You are on a plateau. Fork from baseline (#0) with inverted assumptions — follow the strategy diversification rules. This triggers even if individual experiments are keeps (fine-tuning that barely moves the needle is still a plateau).
146
+ - <critical>**Every 10th real experiment** (experiment #10, #20, #30...): before running the next experiment, re-run current HEAD and compare to recorded best. Log the re-validation result in `.lab/log.md` as `## Re-Validation after Experiment N`. If regressed >2%, log drift and consider forking from the best experiment. This is mandatory — do not skip.</critical>
147
+
148
+ **For every thought experiment:** Log with status `thought` in both files.
149
+
150
+ **Log entry format** — each entry as a heading, followed by labeled fields (one per line or inline, your choice — just be consistent):
151
+ ```
152
+ ## Experiment N — <title>
153
+ Branch: ... / Type: thought|real / Parent: #M
154
+ Hypothesis: ...
155
+ Changes: ...
156
+ Result: ...
157
+ Duration: ...
158
+ Status: keep|discard|crash|thought|keep*|interesting
159
+ Insight: ...
160
+ ```
161
+
162
+ ### Autonomy
163
+
164
+ **Default: complete autonomy.** You do not return to the user with progress updates. You work, you log, the user observes.
165
+
166
+ **Consult the user ONLY when:**
167
+ 1. The only viable path requires modifying files outside agreed scope
168
+ 2. You have exhausted all strategies, branches, and parking lot ideas
169
+
170
+ When the user intervenes: accept the direction, log the intervention, continue.
171
+
172
+ ### Branching
173
+
174
+ The experiment history is non-linear. Fork branches to explore divergent approaches.
175
+
176
+ **When to fork:** fundamentally different approach from an earlier state, current branch stagnating, combining keeps from different branches into a new line of experimentation, or promising divergence.
177
+
178
+ **How to fork:**
179
+ 1. Pick a parent experiment from any branch. For keeps: `git log --oneline --grep="experiment #N:"`. For discards: find the SHA in `.lab/results.tsv`.
180
+ 2. `git checkout <SHA>` → `git checkout -b research/<descriptive-slug>`
181
+ 3. Register in `.lab/branches.md` (the "Forked from" column tracks genealogy — branch names don't need to encode it)
182
+ 4. Continue — next experiment's parent is the forked-from experiment. Experiment numbers are global (not per-branch).
183
+
184
+ Always consider results from ALL branches when thinking. Mark exhausted branches as `closed` in `.lab/branches.md`.
185
+
186
+ ### Strategy Diversification
187
+
188
+ When forking due to stagnation, you are probably stuck in a local optimum. Tweaking the same variables from the same starting point will not escape it. Before creating the fork:
189
+
190
+ 1. **Write an assumptions list** in `.lab/log.md`: what does the current best strategy assume? (e.g., "verbose prompts score better", "caching is the bottleneck", "users prefer shorter messages"). These are your current priors.
191
+ 2. **Choose a fork point deliberately:**
192
+ - Fork from **baseline (#0)** when you want to explore a completely different region — this prevents anchoring to your current best.
193
+ - Fork from the **best keep** only when you want to refine or combine with a specific finding.
194
+ - Fork from a **discarded experiment** when it showed an interesting signal worth pursuing differently.
195
+ 3. <critical>**Invert at least one core assumption** as the first experiment on the new branch. This is mandatory, not optional. If the current strategy assumes "more detail is better" — try minimal. If it assumes "aggressive caching" — try no caching. Not a minor tweak of the same approach. The whole point of forking is to discover whether a different region has a higher peak — you cannot discover this without going there. Invert means explore the opposite region, not the opposite extreme.</critical>
196
+ 4. **Name the branch after the strategy**, not the parameter (e.g., `research/low-alpha-approach` not `research/tweak-delta`).
197
+
198
+ ### Metric Revision
199
+
200
+ When the current metric is flawed — dimensions are unmeasurable from output, scale doesn't differentiate quality, or rubric misses what actually matters — revise it mid-series:
201
+
202
+ 1. **Log the problem** — in `.lab/log.md`, describe what is wrong with the current metric and why (e.g., which dimensions always score neutral, what the metric fails to capture)
203
+ 2. **Define new metric** — in `.lab/config.md`, add a `## Metric v2` section (keep v1 intact). Include: date, what changed, rationale for each dropped/added/modified dimension
204
+ 3. **Re-score all keeps** — evaluate every existing keep with the new metric. This is mandatory — without re-scoring, the trend in `results.tsv` is meaningless because you cannot tell whether improvement came from the experiment or the metric change
205
+ 4. **Mark re-scored rows** — append new rows to `results.tsv` with a version suffix on the experiment number (e.g., `2v2` for experiment #2 re-scored under metric v2). Original rows stay untouched for audit
206
+ 5. **Continue** — the new metric applies to all experiments from this point forward. Update baseline in config if re-scoring changed its value
207
+
208
+ Metric revision is expensive (re-scoring every keep), so do it once and get it right. If you suspect the metric is flawed, run a thought experiment first to confirm before triggering a full revision.
209
+
210
+ ---
211
+
212
+ ## Phase 4: Wrap-Up
213
+
214
+ When termination is met or user interrupts:
215
+
216
+ 1. **Re-validate** — re-run from global best, confirm final metric. For qualitative metrics, use the Multi-Evaluator Protocol.
217
+ 2. **Summary** — write `.lab/summary.md`: total experiments, keeps, discards per branch and global; best vs baseline; top 3 impactful changes; branch history; experiment genealogy; key insights; failed approaches; remaining parking lot ideas
218
+ 3. **Code state** — checkout the branch containing the global best experiment. If it's on a closed branch, create a new branch from that experiment's SHA. Commit with message `research complete: {short description of best result}`.
219
+ 4. **Report** — present summary concisely
220
+
221
+ ---
222
+
223
+ ## Qualitative Rubric
224
+
225
+ When the primary metric is qualitative, define a rubric in `.lab/config.md` during Phase 2:
226
+ 1. List 3–5 criteria with clear definitions
227
+ 2. Assign weights (sum to 1.0)
228
+ 3. Use a consistent scale (e.g., 1–10)
229
+ 4. Composite score = `sum(criterion_score × weight)`
230
+
231
+ This composite becomes the quantitative proxy. Log it in results.tsv with per-criterion scores in log entries.
232
+
233
+ ### Multi-Evaluator Protocol
234
+
235
+ When the metric is qualitative (agent judgment), a single evaluator introduces bias — the same agent that made the change also judges it. To counteract this:
236
+
237
+ 1. **Spawn at least 3 evaluator subagents** per experiment. You (the main agent) spawn each one as a separate subagent call. Each evaluator is a fresh subagent with no shared context. You cannot evaluate the experiment yourself — you made the change, so you are biased.
238
+ 2. **Each evaluator subagent receives only:**
239
+ - The artifact/output to evaluate (e.g., the file content)
240
+ - The rubric (criteria, weights, scale)
241
+ - An instruction to return scores in a structured format
242
+ - Nothing else — no hypothesis, no experiment number, no prior scores, no context about what changed or why
243
+ 3. **Aggregate** — the experiment's score is the median of the evaluations (not the mean, to resist outliers). Log all individual scores in `.lab/log.md`, median in `results.tsv`.
244
+ 4. **Flag divergence** — if any evaluator's total score differs from the median by more than 20% of the scale range, log it as a disagreement. Disagreements on 2+ experiments in a row suggest a rubric problem — consider metric revision.
245
+
246
+ This protocol is mandatory for qualitative metrics. Quantitative metrics (command output) do not need it.
247
+
248
+ ## Hypothesis Strategies
249
+
250
+ Tools when you're stuck, not a menu to follow. You have complete freedom to invent your own.
251
+
252
+ | Strategy | When it helps |
253
+ |----------|---------------|
254
+ | **Ablation** — remove something | Unsure what's actually helping |
255
+ | **Amplification** — push what works further | After a keep |
256
+ | **Combination** — merge wins from separate experiments | Multiple keeps in different areas |
257
+ | **Inversion** — try the opposite | String of discards |
258
+ | **Isolation** — change one variable | Unclear what helped |
259
+ | **Analogy** — borrow from adjacent domains | Truly stuck |
260
+ | **Simplification** — remove complexity, preserve metric | Accumulated cruft |
261
+ | **Scaling** — change by order of magnitude | Small tweaks plateaued |
262
+ | **Decomposition** — split big change into parts | Promising change discarded |
263
+ | **Sweep** — test parameter across a range | Right value unknown |
264
+
265
+ ## Convergence Signals
266
+
267
+ | Signal | Meaning |
268
+ |--------|---------|
269
+ | 5+ discards in a row | Current approach exhausted |
270
+ | Thought experiments repeating | Go empirical |
271
+ | Results consistently confirm theory | Go deeper |
272
+ | Results contradict theory | Model is wrong — rethink |
273
+ | Metric plateau (<0.5% over 5 keeps) | Try something radically different |
274
+ | Same code area modified 3+ times | Explore elsewhere |
275
+ | Alternating keep/discard on similar changes | Isolate variables |
276
+ | 2+ timeouts in a row | Approach too expensive |
277
+ | Branch stagnating, other thriving | Switch or combine |
278
+ | Best results split across branches | Fork to combine |
279
+ | Change only tested in one direction | Test the opposite to confirm the assumption holds |
280
+ | 5+ discards with increasingly desperate variants | Locally optimal — fork from baseline, invert assumptions |
281
+ | All branches share the same core assumptions | Anchored — fork from baseline and invert |
282
+ | Global best unchanged for 8+ experiments | Plateau — fork from baseline with inverted assumptions |
283
+ | Dimension always scores neutral (e.g., 5/10) | Dimension unmeasurable — consider metric revision |