npm - @glrs-dev/cli - Versions diffs - 0.0.1 → 0.1.1 - Mend

@glrs-dev/cli 0.0.1 → 0.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (173) hide show

package/dist/vendor/harness-opencode/dist/skills/research/SKILL.md ADDED Viewed

@@ -0,0 +1,350 @@
+---
+name: research
+description: Research orchestrator — plans workstreams, dispatches parallel research agents (local, web, or auto), reviews results, identifies gaps, and iterates until comprehensive. Use when user says 'research', '/research', 'investigate', 'deep dive', 'explore', 'understand how', 'what do we know about'. Provide the research topic and context.
+---
+# /research — Research Orchestrator
+Multi-round research orchestrator that plans, dispatches, reviews, and iterates across research modes until the result is comprehensive.
+**Research Query:** $ARGUMENTS
+```
+THE IRON LAW: EVERY STEP IS A SUBAGENT. NO EXCEPTIONS.
+You are an orchestrator. You do NOT research, synthesize, review, or analyze.
+You launch subagents and pass their outputs to other subagents.
+You use the Agent tool for ALL dispatches — never Skill tool.
+Your jobs:
+1. Launch a planning subagent
+2. Dispatch research agents (which read skill files and follow them)
+3. Launch a review subagent
+4. If gaps exist, dispatch more research agents
+5. Launch a synthesis subagent
+6. Present the final result
+You do NOTHING else. Every cognitive task is a subagent.
+```
+## Execution Model
+```
+CRITICAL — WHY AGENT TOOL, NOT SKILL TOOL:
+Skill() runs the sub-skill in the main conversation. It dumps the full skill
+instructions + all output into your context window. By the time you run a second
+skill, you're out of context and the pipeline stalls.
+Agent tool runs in a subprocess with its own context window. Each research agent
+gets a fresh context, reads its skill file from the bundled skills, and reports back.
+The main conversation stays clean for orchestration.
+DISPATCH TEMPLATE:
+  Agent tool with prompt:
+  "You are a research agent.
+   ## Research Query
+   {the full query or sub-question}
+   ## Task
+   1. Read the bundled {skill-name} skill via the Skill tool
+   2. {any additional context or constraints}
+   3. Report back with your complete findings"
+PARALLEL: Independent workstreams → ALL Agent calls in ONE message
+SEQUENTIAL: Dependent workstreams → wait for prior agents
+```
+If no query is provided as $ARGUMENTS, ask the user what they want to research.
+## Phase 1: Plan — Subagent
+Launch a **general-purpose subagent** to plan the research:
+```
+PROMPT:
+"You are a research planner. Given a research query, decompose it into workstreams
+and classify each by research type.
+Research Query: [QUERY]
+For each workstream, provide:
+1. A specific sub-question to answer
+2. Classification: LOCAL, WEB, or AUTO
+3. Why this classification (one sentence)
+4. Dependencies: which other workstreams must complete first (if any)
+Classification rules:
+- LOCAL: Questions answerable from THIS codebase — architecture, data flow,
+  patterns, implementations, file structure, dependencies within the repo
+- WEB: Questions requiring external knowledge — best practices, competitor
+  analysis, market research, industry standards, technology comparisons,
+  regulatory info, documentation for external tools/libraries
+- AUTO: ONLY when the query explicitly asks for experimentation with measurable
+  outcomes and iterative trials. This is RARE. Most queries are LOCAL or WEB.
+A single query often needs BOTH local and web workstreams. For example:
+'How should we redesign our auth system?' needs LOCAL (how does it work now?)
+AND WEB (what are current best practices?).
+Output 3-6 workstreams. Prefer a mix of LOCAL and WEB when the query benefits.
+Mark dependencies explicitly — independent workstreams will run in parallel."
+```
+## Phase 2: Execute Round 1 — Parallel Agent Dispatches
+From the plan, dispatch **one Agent per workstream**. Launch ALL independent workstreams in a SINGLE message.
+```
+FOR EACH LOCAL WORKSTREAM:
+  Agent tool:
+  "You are a codebase research agent.
+   ## Research Query
+   {original query}
+   ## Your Workstream
+   {sub-question from plan}
+   ## Task
+   1. Read the bundled research-local skill via the Skill tool and follow every instruction
+   2. Focus specifically on: {sub-question}
+   3. Report back with your complete findings including all file:line references"
+FOR EACH WEB WORKSTREAM:
+  Agent tool:
+  "You are a web research agent.
+   ## Research Query
+   {original query}
+   ## Your Workstream
+   {sub-question from plan}
+   ## Task
+   1. Read the bundled research-web skill via the Skill tool and follow every instruction
+   2. Focus specifically on: {sub-question}
+   3. Write your output to research/{slug}/{workstream-name}.md
+   4. Report back with your complete findings including source URLs"
+FOR EACH AUTO WORKSTREAM (rare):
+  Agent tool:
+  "You are an experimentation agent.
+   ## Research Query
+   {original query}
+   ## Your Workstream
+   {sub-question from plan}
+   ## Task
+   1. Read the bundled research-auto skill via the Skill tool and follow every instruction
+   2. Focus specifically on: {sub-question}
+   3. Report back with your findings and experiment results"
+CRITICAL: ALL independent workstreams in ONE message.
+Wait for ALL agents to complete before proceeding.
+If workstream B depends on A, launch A first, wait, then launch B.
+```
+## Phase 3: Review Round 1 — Subagent
+Launch a **general-purpose subagent** to review all findings:
+```
+PROMPT:
+"You are a research reviewer. Given a research query and findings from multiple
+parallel research agents, assess completeness and identify gaps.
+Original Query: [QUERY]
+Workstream Plan:
+[PLAN FROM PHASE 1]
+Findings from all agents:
+[ALL RESULTS FROM PHASE 2]
+Evaluate:
+1. **Coverage** — Is every workstream adequately answered?
+2. **Depth** — Are answers backed by specific evidence (file:line for local, URLs for web)?
+3. **Connections** — Are cross-cutting themes identified? Do local and web findings
+   inform each other?
+4. **Contradictions** — Do any findings conflict? If so, which is more credible?
+5. **Blind spots** — What aspects of the query weren't addressed by ANY workstream?
+For each gap found, provide:
+- What is missing
+- Classification: LOCAL or WEB (where to look for the answer)
+- Specific search guidance (file patterns, search terms, URLs to check)
+- Why it matters for answering the original query
+Output:
+- VERDICT: COMPREHENSIVE or GAPS FOUND
+- If GAPS FOUND: list each gap as a numbered, dispatchable research question with classification
+- CROSS-CUTTING INSIGHTS: themes that span multiple workstreams (these feed synthesis)"
+```
+## Phase 4: Execute Round 2 — Fill Gaps (If Needed)
+If the review found gaps, dispatch **Agent tool calls for each gap** — ALL in ONE message.
+Use the same dispatch templates from Phase 2, but with the gap question as the workstream and any relevant prior findings included as context.
+```
+PROMPT ADDITION FOR GAP-FILLING AGENTS:
+"## Context from Prior Research
+{relevant findings from Round 1 that border this gap}
+Focus on what the prior research missed. Don't repeat known findings."
+```
+If COMPREHENSIVE, skip to Phase 5.
+## Phase 5: Review Round 2 — Subagent (If Phase 4 Ran)
+Launch another review subagent with the SAME prompt template as Phase 3, but now including Round 1 + Round 2 findings.
+```
+ITERATION RULES:
+- Maximum 3 total rounds of execute → review (Round 1 is mandatory)
+- If still GAPS FOUND after Round 3, proceed to synthesis anyway — note remaining
+  gaps as open questions
+- Each round should have FEWER gaps than the previous. If gap count increases,
+  stop iterating and proceed to synthesis.
+```
+## Phase 6: Synthesize — Subagent
+Launch a **general-purpose subagent** to produce the final synthesis:
+```
+PROMPT:
+"You are a research synthesizer. Combine findings from multiple research agents
+into a single, authoritative document.
+Original Query: [QUERY]
+All Research Findings (all rounds):
+[EVERY FINDING FROM ALL ROUNDS]
+Cross-Cutting Insights from Reviews:
+[INSIGHTS FROM PHASE 3/5 REVIEW AGENTS]
+Create a comprehensive research report:
+## Executive Summary
+3-5 sentences answering the original query directly.
+## Key Findings
+Organized by theme (NOT by workstream). Each finding should:
+- State the conclusion
+- Cite evidence (file:line for local, URL for web)
+- Note confidence level (high/medium/low based on evidence strength)
+## Architecture & Code (if local research was involved)
+How the relevant code is structured, with specific file:line references.
+## External Context (if web research was involved)
+Best practices, market context, competitor approaches — with source URLs.
+## Connections
+How local findings and external findings inform each other. Where the codebase
+aligns with or diverges from industry standards/best practices.
+## Recommendations
+Actionable next steps based on the research. Each recommendation should cite
+the finding that supports it.
+## Open Questions
+Anything that couldn't be fully resolved. Note whether it needs LOCAL exploration,
+WEB research, or human input.
+IMPORTANT: Organize by THEME, not by source agent. Merge overlapping findings.
+Resolve contradictions explicitly. Every claim needs a citation."
+```
+## Phase 7: Final Quality Gate — Subagent
+Launch a **general-purpose subagent** to score the final report:
+```
+PROMPT:
+"You are a quality reviewer for research output. Score this report.
+Original Query: [QUERY]
+Final Report: [OUTPUT FROM PHASE 6]
+Score 1-5 on each dimension:
+1. **Answers the question** — Does the report directly address what was asked?
+2. **Evidence quality** — Are claims backed by specific file:line or URL citations?
+3. **Depth** — Does it go beyond surface-level to explain mechanisms and tradeoffs?
+4. **Synthesis** — Are findings from different sources meaningfully connected?
+5. **Actionability** — Could someone make decisions based on this report?
+Overall: average of all dimensions.
+If >= 4.0: 'QUALITY: PASS' with minor suggestions
+If < 4.0: 'QUALITY: NEEDS WORK' with specific deficiencies framed as questions
+Note: At this stage, the report is presented regardless of score. The score
+helps the user gauge confidence."
+```
+## Phase 8: Present
+Present to the user:
+1. The full synthesis report from Phase 6
+2. Quality score from Phase 7
+3. Research metadata: how many rounds, how many agents dispatched, modes used
+4. If quality < 4.0, explicitly note which areas are weak and suggest follow-up commands
+```
+DO NOT dump raw agent outputs. Only the synthesized report.
+DO NOT ask "which section would you like to explore further?" — present everything.
+DO suggest specific follow-up /research commands if open questions remain.
+```
+## Subagent Summary
+```
+MINIMUM SUBAGENTS PER RESEARCH (no gaps found):
+  1 planner + 3-6 research agents + 1 reviewer + 1 synthesizer + 1 quality = 7-10
+TYPICAL SUBAGENTS (one round of gap filling):
+  1 planner + 4 research + 1 review + 2 gap-fill + 1 review + 1 synth + 1 quality = 11
+MAXIMUM SUBAGENTS (three rounds):
+  1 planner + 4 research + 1 review + 3 gap-fill + 1 review + 2 gap-fill + 1 review
+  + 1 synth + 1 quality = 15
+Each research agent internally spawns its OWN subagents (research-local spawns
+8-14 Explore subagents, research-web spawns parallel web research agents).
+Total subagent tree can be 30-50+ agents for a complex query. This is by design.
+```
+## Red Flags — STOP
+- About to use Skill() to invoke a sub-skill — USE AGENT TOOL
+- About to research something yourself — LAUNCH A SUBAGENT
+- About to synthesize findings yourself — LAUNCH A SYNTHESIS SUBAGENT
+- About to skip the review phase — REVIEW IS MANDATORY AFTER EVERY ROUND
+- About to skip the planning phase — THE PLANNER SUBAGENT DECIDES WORKSTREAMS
+- About to launch research agents one at a time — ONE MESSAGE, ALL INDEPENDENT AGENTS
+- About to present raw agent outputs — SYNTHESIZE FIRST
+- About to decide "no gaps" without a review subagent — THE REVIEWER DECIDES, NOT YOU
+- About to run a 4th round of gap filling — MAX 3 ROUNDS, THEN PRESENT
+## Common Rationalizations
+| Excuse | Reality |
+|--------|---------|
+| "I'll use Skill() — it's simpler" | Skill() eats main context. Agent tool keeps orchestration clean. |
+| "I can plan the workstreams myself" | The planning subagent produces better decompositions with proper classification. |
+| "One round of research is enough" | Review almost always finds gaps. The iterate-review loop is what makes research comprehensive. |
+| "I'll skip the review — the findings look complete" | Your intuition is not a quality gate. The review subagent finds what you miss. |
+| "I'll synthesize from the agent summaries" | A synthesis subagent with all findings produces better-connected, themed output. |
+| "This only needs local research" | The planner subagent decides. Many queries benefit from both local and web context. |
+| "I'll route to AUTO for thoroughness" | AUTO is for experimentation with measurable outcomes. Thoroughness comes from iteration, not AUTO. |
+| "I'll launch agents sequentially to be safe" | Parallel is always faster. All independent workstreams in one message. |
+| "The quality review is overhead" | The quality score tells the user how much to trust the report. 30 seconds well spent. |

package/dist/vendor/harness-opencode/dist/skills/research-auto/SKILL.md ADDED Viewed

@@ -0,0 +1,283 @@
+---
+name: research-auto
+description: Autonomous experimentation skill — agent interviews the user, sets up a lab, then explores freely (think, test, reflect) until stopped or a target is hit. Works for any domain where you can measure or evaluate a result. Use when user says 'optimize this', 'experiment with', 'find the best approach', 'iterate on', 'research mode'. Do NOT use for binary validation tests (use /spec-lab instead). Based on ResearcherSkill v1.4.4 by krzysztofdudek.
+---
+# /research-auto — Autonomous Experimentation Skill
+<critical>
+Non-negotiable rules — every real experiment, no exceptions:
+1. Commit before running. Log before resetting. Reset on discard. Each branch holds only keeps.
+2. Protect `.lab/` — it is the single source of truth.
+3. Work autonomously — only consult the user for scope violations or true dead ends.
+4. Follow ALL guardrails (discard streaks, plateau, re-validation). They are mandatory, not suggestions.
+</critical>
+You are entering **researcher mode**. This skill is for YOU — the main agent. You orchestrate the entire research process: planning, implementing, committing, measuring, logging. When you need independent work done (evaluation, analysis), you spawn subagents with specific, scoped tasks. You control what each subagent knows through the prompt you give it.
+You have **complete freedom** in how you navigate the problem space. The strategies and signals later in this document are tools when you need them, not rails you must follow.
+---
+## `.lab/` is Sacred
+`.lab/` is an **untracked, local directory** — the single source of truth for all experiment history. It survives all git operations because it is in `.gitignore`. Git manages code state. `.lab/` manages experiment knowledge. They are independent.
+**Structure:**
+- `.lab/config.md`, `results.tsv`, `log.md`, `branches.md`, `parking-lot.md` — experiment metadata
+- `.lab/workspace/` — scratch space for experiment files (scripts, test data, generated output, per-experiment subdirectories). Create whatever you need here — it's yours, untracked, and safe from git operations.
+Always protect `.lab/`. When cleaning the repo, use targeted commands that preserve untracked directories. When resetting, use `git reset` and `git checkout` which leave `.lab/` intact.
+---
+## Phase 0: Resume Check
+Check if `.lab/` already exists in the project root.
+**If it exists:**
+1. Read `.lab/config.md`, `.lab/results.tsv`, `.lab/branches.md`, and tail of `.lab/log.md`
+2. Present a summary: objective, metrics, active branches, experiment counts, current best vs baseline, last experiment status
+3. Ask: **resume or start fresh?**
+   - **Resume** → checkout the active branch, pick up from next experiment number, jump to Phase 3
+   - **Start fresh** → archive to `.lab.bak.<timestamp>/`, proceed to Phase 1
+**If it does not exist:** proceed to Phase 1.
+---
+## Phase 1: Discovery
+Before any experiment, understand the problem. Ask these questions conversationally — skip what's obvious from context, use the **defaults** shown when the user has no preference:
+1. **Objective** — What are we trying to achieve?
+2. **Metrics** — How do we measure success?
+   - **Primary metric** (required): drives keep/discard decisions
+     - *Quantitative*: a command that outputs a number
+     - *Qualitative*: agent judgment against a rubric (see Qualitative Rubric below). Before building the rubric, ask the user: **(A)** "I know my criteria" — user provides them, or **(B)** "Help me figure it out" — generate a focused research prompt the user runs in an external tool, then build the rubric from the results instead of assumptions.
+   - **Secondary metrics** (optional): tracked for context, don't drive decisions unless primary is tied
+   - For each: **name**, **measure command** (or "agent judgment"), **direction** (lower/higher is better)
+3. **Scope** — What files/areas can we modify?
+4. **Constraints** — What is off-limits?
+5. **Run command** — How do we execute one experiment? Single command or chain (entire chain must succeed). May be omitted for qualitative-only research.
+6. **Wall-clock budget per experiment** — Maximum time a single experiment run may take before being killed. Default: **5 minutes**.
+7. **Termination** — When do we stop? Default: **infinite** (run until user interrupts or target is reached). Do not self-impose experiment limits. If the session ends (context limit, interruption), `.lab/` persists — the next session resumes via Phase 0.
+   - *Target value*: stop when primary metric reaches X
+   - *Experiment count*: stop after N experiments (only if the user explicitly requests it)
+Once you have answers, **repeat the configuration back** and get explicit confirmation before proceeding.
+---
+## Phase 2: Lab Setup
+After confirmation:
+1. **Branch** — Create `research/<slug>` from current HEAD.
+2. **Lab directory** — Create `.lab/` in the project root.
+3. **Config file** — Write `.lab/config.md` with all agreed parameters (objective, metrics with measure commands and directions, run command, scope, constraints, wall-clock budget, termination condition, baseline and best placeholders).
+4. **Results log** — Create `.lab/results.tsv` with tab-separated columns: `experiment`, `branch`, `parent`, `commit`, `metric`, `secondary_metrics`, `status`, `duration_s`, `description`. Status values: `keep`, `discard`, `crash`, `thought`, `keep*`, `interesting`.
+5. **Iteration log** — Create `.lab/log.md`
+6. **Parking lot** — Create `.lab/parking-lot.md` for deferred ideas
+7. **Branch registry** — Create `.lab/branches.md` with columns: Branch, Forked from, Status, Experiments, Best metric, Notes
+8. **Workspace** — Create `.lab/workspace/` for scratch files (scripts, test data, generated output). Use per-experiment subdirectories (e.g., `.lab/workspace/exp-3/`) when needed.
+9. **Git ignore** — Add `.lab/` and `run.log` to `.gitignore`.
+10. **Baseline** — Record experiment #0 with NO changes. For quantitative: run the measure command. For qualitative: evaluate the current artifact using the Multi-Evaluator Protocol (3 subagents). Fill in baseline in config.
+11. **Start** — Begin autonomous work immediately. No announcements needed.
+---
+## Phase 3: Autonomous Research
+### Flow: THINK → TEST → REFLECT → repeat
+**THINK** — Before anything, read: `.lab/results.tsv`, `.lab/log.md` (last 5 entries if 20+), `.lab/branches.md`, `.lab/parking-lot.md`, and in-scope source files. Re-read the critical rules at the top of this document and the guardrails in the Execution Discipline section. Then write a `## THINK — before Experiment N` entry in `.lab/log.md` covering:
+1. **Convergence signals:** check against current state
+2. **Untested assumptions:** what am I assuming that I haven't tested? Have I tried the opposite of what's currently working? (e.g., if adding detail improved the score, what happens if I simplify instead?)
+3. **Invalidation risk:** could earlier findings be invalidated by recent changes? (e.g., after changing B, re-test assumptions made when only A was changed)
+4. **Next hypothesis:** what will I test and why
+The log entry is mandatory — it is the evidence that you stopped to think. Without it, the THINK phase didn't happen. Stay as long as productive.
+**TEST** — Implement, run, measure. Verify hypotheses. Follow execution discipline (below). Stay as long as you're generating new data.
+**REFLECT** — What confirmed? What surprised? What breaks your model? Log everything. Update parking lot.
+### Execution Discipline
+<critical>
+These rules apply to every real experiment without exception. All git operations (commits, resets, branch creation) are autonomous — do not ask the user for permission. They are systemic to the research process, not discretionary actions.
+</critical>
+**Repo-file experiments** modify any file in scope (as defined in config). If you change a file that is in scope, it is a repo-file experiment — even if you "just want to test something quickly." No exceptions.
+**Lab-only experiments** only touch `.lab/` or files outside scope. The commit rules below apply to repo-file experiments. Lab-only experiments just need logging.
+**For every real experiment (code change + run):**
+1. **Commit BEFORE running** (repo-file experiments only):
+   ```
+   experiment #{N}: {short description}
+   Branch: {research branch name}
+   Parent: #{parent experiment number}
+   Hypothesis: {one-line hypothesis}
+   ```
+   Next experiment number = highest `experiment` in `.lab/results.tsv` + 1. Keeps stay on the branch as permanent checkpoints. Discards are reset — their SHA is recorded in `results.tsv` and remains accessible until `git gc` runs. Fork from discarded SHAs sooner rather than later.
+2. Execute ALL measure commands (primary + secondary), record raw values
+3. **Log first** — write a structured entry to `.lab/log.md` and a row to `.lab/results.tsv` (including the commit SHA). This must happen before any reset.
+4. Then decide:
+   - **KEEP** — metric improved above 0.1% noise threshold, or equal with simpler code
+   - **KEEP*** — primary improved but secondary significantly regressed (log the trade-off, note in commit or lab log)
+   - **DISCARD** — metric equal or worse → `git reset --hard HEAD~1`. The commit disappears from the branch but its SHA is in `.lab/results.tsv`. Want to revisit a discarded idea? Fork a new branch from that SHA.
+   - **INTERESTING** — metric didn't improve, but result reveals something valuable → keep or reset, your call
+   - **CRASH** — `git reset --hard HEAD~1`. Only read last 50 lines of `run.log` or grep for patterns.
+     - Trivial (typo, missing import): fix and re-run ONCE
+     - Fundamental (OOM, missing dependency): log, reset, move on
+     - 3+ crashes in a row: rethink the approach entirely
+   - **TIMEOUT** — kill, log as crash (metric = 0.000000), reset. 2+ in a row: reassess viability.
+5. **Guardrails** (after every decide/reset):
+   - <critical>**3+ discards in a row:** STOP. Write a `## 3-Discard Guardrail — after Experiment N` entry in `.lab/log.md` reviewing convergence signals and documenting why you are continuing vs. forking. This entry is mandatory — without it, you cannot proceed to the next experiment.</critical>
+   - <critical>**5+ discards in a row:** Fork is the **default action**. Write a `## 5-Discard Fork — after Experiment N` entry in `.lab/log.md`. Before forking, check `.lab/parking-lot.md` — if there are untested ideas there, try one first. Otherwise, to stay on the current branch, you must name a specific, untested hypothesis that is NOT a variant of what you already tried. If you cannot, fork — and follow the strategy diversification rules below.</critical>
+   - **Global best unchanged for 8+ real experiments:** You are on a plateau. Fork from baseline (#0) with inverted assumptions — follow the strategy diversification rules. This triggers even if individual experiments are keeps (fine-tuning that barely moves the needle is still a plateau).
+   - <critical>**Every 10th real experiment** (experiment #10, #20, #30...): before running the next experiment, re-run current HEAD and compare to recorded best. Log the re-validation result in `.lab/log.md` as `## Re-Validation after Experiment N`. If regressed >2%, log drift and consider forking from the best experiment. This is mandatory — do not skip.</critical>
+**For every thought experiment:** Log with status `thought` in both files.
+**Log entry format** — each entry as a heading, followed by labeled fields (one per line or inline, your choice — just be consistent):
+```
+## Experiment N — <title>
+Branch: ... / Type: thought|real / Parent: #M
+Hypothesis: ...
+Changes: ...
+Result: ...
+Duration: ...
+Status: keep|discard|crash|thought|keep*|interesting
+Insight: ...
+```
+### Autonomy
+**Default: complete autonomy.** You do not return to the user with progress updates. You work, you log, the user observes.
+**Consult the user ONLY when:**
+1. The only viable path requires modifying files outside agreed scope
+2. You have exhausted all strategies, branches, and parking lot ideas
+When the user intervenes: accept the direction, log the intervention, continue.
+### Branching
+The experiment history is non-linear. Fork branches to explore divergent approaches.
+**When to fork:** fundamentally different approach from an earlier state, current branch stagnating, combining keeps from different branches into a new line of experimentation, or promising divergence.
+**How to fork:**
+1. Pick a parent experiment from any branch. For keeps: `git log --oneline --grep="experiment #N:"`. For discards: find the SHA in `.lab/results.tsv`.
+2. `git checkout <SHA>` → `git checkout -b research/<descriptive-slug>`
+3. Register in `.lab/branches.md` (the "Forked from" column tracks genealogy — branch names don't need to encode it)
+4. Continue — next experiment's parent is the forked-from experiment. Experiment numbers are global (not per-branch).
+Always consider results from ALL branches when thinking. Mark exhausted branches as `closed` in `.lab/branches.md`.
+### Strategy Diversification
+When forking due to stagnation, you are probably stuck in a local optimum. Tweaking the same variables from the same starting point will not escape it. Before creating the fork:
+1. **Write an assumptions list** in `.lab/log.md`: what does the current best strategy assume? (e.g., "verbose prompts score better", "caching is the bottleneck", "users prefer shorter messages"). These are your current priors.
+2. **Choose a fork point deliberately:**
+   - Fork from **baseline (#0)** when you want to explore a completely different region — this prevents anchoring to your current best.
+   - Fork from the **best keep** only when you want to refine or combine with a specific finding.
+   - Fork from a **discarded experiment** when it showed an interesting signal worth pursuing differently.
+3. <critical>**Invert at least one core assumption** as the first experiment on the new branch. This is mandatory, not optional. If the current strategy assumes "more detail is better" — try minimal. If it assumes "aggressive caching" — try no caching. Not a minor tweak of the same approach. The whole point of forking is to discover whether a different region has a higher peak — you cannot discover this without going there. Invert means explore the opposite region, not the opposite extreme.</critical>
+4. **Name the branch after the strategy**, not the parameter (e.g., `research/low-alpha-approach` not `research/tweak-delta`).
+### Metric Revision
+When the current metric is flawed — dimensions are unmeasurable from output, scale doesn't differentiate quality, or rubric misses what actually matters — revise it mid-series:
+1. **Log the problem** — in `.lab/log.md`, describe what is wrong with the current metric and why (e.g., which dimensions always score neutral, what the metric fails to capture)
+2. **Define new metric** — in `.lab/config.md`, add a `## Metric v2` section (keep v1 intact). Include: date, what changed, rationale for each dropped/added/modified dimension
+3. **Re-score all keeps** — evaluate every existing keep with the new metric. This is mandatory — without re-scoring, the trend in `results.tsv` is meaningless because you cannot tell whether improvement came from the experiment or the metric change
+4. **Mark re-scored rows** — append new rows to `results.tsv` with a version suffix on the experiment number (e.g., `2v2` for experiment #2 re-scored under metric v2). Original rows stay untouched for audit
+5. **Continue** — the new metric applies to all experiments from this point forward. Update baseline in config if re-scoring changed its value
+Metric revision is expensive (re-scoring every keep), so do it once and get it right. If you suspect the metric is flawed, run a thought experiment first to confirm before triggering a full revision.
+---
+## Phase 4: Wrap-Up
+When termination is met or user interrupts:
+1. **Re-validate** — re-run from global best, confirm final metric. For qualitative metrics, use the Multi-Evaluator Protocol.
+2. **Summary** — write `.lab/summary.md`: total experiments, keeps, discards per branch and global; best vs baseline; top 3 impactful changes; branch history; experiment genealogy; key insights; failed approaches; remaining parking lot ideas
+3. **Code state** — checkout the branch containing the global best experiment. If it's on a closed branch, create a new branch from that experiment's SHA. Commit with message `research complete: {short description of best result}`.
+4. **Report** — present summary concisely
+---
+## Qualitative Rubric
+When the primary metric is qualitative, define a rubric in `.lab/config.md` during Phase 2:
+1. List 3–5 criteria with clear definitions
+2. Assign weights (sum to 1.0)
+3. Use a consistent scale (e.g., 1–10)
+4. Composite score = `sum(criterion_score × weight)`
+This composite becomes the quantitative proxy. Log it in results.tsv with per-criterion scores in log entries.
+### Multi-Evaluator Protocol
+When the metric is qualitative (agent judgment), a single evaluator introduces bias — the same agent that made the change also judges it. To counteract this:
+1. **Spawn at least 3 evaluator subagents** per experiment. You (the main agent) spawn each one as a separate subagent call. Each evaluator is a fresh subagent with no shared context. You cannot evaluate the experiment yourself — you made the change, so you are biased.
+2. **Each evaluator subagent receives only:**
+   - The artifact/output to evaluate (e.g., the file content)
+   - The rubric (criteria, weights, scale)
+   - An instruction to return scores in a structured format
+   - Nothing else — no hypothesis, no experiment number, no prior scores, no context about what changed or why
+3. **Aggregate** — the experiment's score is the median of the evaluations (not the mean, to resist outliers). Log all individual scores in `.lab/log.md`, median in `results.tsv`.
+4. **Flag divergence** — if any evaluator's total score differs from the median by more than 20% of the scale range, log it as a disagreement. Disagreements on 2+ experiments in a row suggest a rubric problem — consider metric revision.
+This protocol is mandatory for qualitative metrics. Quantitative metrics (command output) do not need it.
+## Hypothesis Strategies
+Tools when you're stuck, not a menu to follow. You have complete freedom to invent your own.
+| Strategy | When it helps |
+|----------|---------------|
+| **Ablation** — remove something | Unsure what's actually helping |
+| **Amplification** — push what works further | After a keep |
+| **Combination** — merge wins from separate experiments | Multiple keeps in different areas |
+| **Inversion** — try the opposite | String of discards |
+| **Isolation** — change one variable | Unclear what helped |
+| **Analogy** — borrow from adjacent domains | Truly stuck |
+| **Simplification** — remove complexity, preserve metric | Accumulated cruft |
+| **Scaling** — change by order of magnitude | Small tweaks plateaued |
+| **Decomposition** — split big change into parts | Promising change discarded |
+| **Sweep** — test parameter across a range | Right value unknown |
+## Convergence Signals
+| Signal | Meaning |
+|--------|---------|
+| 5+ discards in a row | Current approach exhausted |
+| Thought experiments repeating | Go empirical |
+| Results consistently confirm theory | Go deeper |
+| Results contradict theory | Model is wrong — rethink |
+| Metric plateau (<0.5% over 5 keeps) | Try something radically different |
+| Same code area modified 3+ times | Explore elsewhere |
+| Alternating keep/discard on similar changes | Isolate variables |
+| 2+ timeouts in a row | Approach too expensive |
+| Branch stagnating, other thriving | Switch or combine |
+| Best results split across branches | Fork to combine |
+| Change only tested in one direction | Test the opposite to confirm the assumption holds |
+| 5+ discards with increasingly desperate variants | Locally optimal — fork from baseline, invert assumptions |
+| All branches share the same core assumptions | Anchored — fork from baseline and invert |
+| Global best unchanged for 8+ experiments | Plateau — fork from baseline with inverted assumptions |
+| Dimension always scores neutral (e.g., 5/10) | Dimension unmeasurable — consider metric revision |