harness-evolver 2.9.1 → 3.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (54) hide show
  1. package/README.md +62 -117
  2. package/agents/evolver-architect.md +53 -0
  3. package/agents/evolver-critic.md +44 -0
  4. package/agents/evolver-proposer.md +128 -0
  5. package/agents/evolver-testgen.md +67 -0
  6. package/bin/install.js +181 -171
  7. package/package.json +7 -7
  8. package/skills/deploy/SKILL.md +49 -56
  9. package/skills/evolve/SKILL.md +156 -687
  10. package/skills/setup/SKILL.md +182 -0
  11. package/skills/status/SKILL.md +23 -21
  12. package/tools/read_results.py +240 -0
  13. package/tools/run_eval.py +202 -0
  14. package/tools/seed_from_traces.py +36 -8
  15. package/tools/setup.py +393 -0
  16. package/tools/trace_insights.py +86 -14
  17. package/agents/harness-evolver-architect.md +0 -173
  18. package/agents/harness-evolver-critic.md +0 -132
  19. package/agents/harness-evolver-judge.md +0 -110
  20. package/agents/harness-evolver-proposer.md +0 -317
  21. package/agents/harness-evolver-testgen.md +0 -112
  22. package/examples/classifier/README.md +0 -25
  23. package/examples/classifier/config.json +0 -3
  24. package/examples/classifier/eval.py +0 -58
  25. package/examples/classifier/harness.py +0 -111
  26. package/examples/classifier/tasks/task_001.json +0 -1
  27. package/examples/classifier/tasks/task_002.json +0 -1
  28. package/examples/classifier/tasks/task_003.json +0 -1
  29. package/examples/classifier/tasks/task_004.json +0 -1
  30. package/examples/classifier/tasks/task_005.json +0 -1
  31. package/examples/classifier/tasks/task_006.json +0 -1
  32. package/examples/classifier/tasks/task_007.json +0 -1
  33. package/examples/classifier/tasks/task_008.json +0 -1
  34. package/examples/classifier/tasks/task_009.json +0 -1
  35. package/examples/classifier/tasks/task_010.json +0 -1
  36. package/skills/architect/SKILL.md +0 -93
  37. package/skills/compare/SKILL.md +0 -73
  38. package/skills/critic/SKILL.md +0 -67
  39. package/skills/diagnose/SKILL.md +0 -96
  40. package/skills/import-traces/SKILL.md +0 -102
  41. package/skills/init/SKILL.md +0 -293
  42. package/tools/__pycache__/detect_stack.cpython-313.pyc +0 -0
  43. package/tools/__pycache__/init.cpython-313.pyc +0 -0
  44. package/tools/__pycache__/seed_from_traces.cpython-313.pyc +0 -0
  45. package/tools/__pycache__/trace_logger.cpython-313.pyc +0 -0
  46. package/tools/eval_llm_judge.py +0 -233
  47. package/tools/eval_passthrough.py +0 -55
  48. package/tools/evaluate.py +0 -255
  49. package/tools/import_traces.py +0 -229
  50. package/tools/init.py +0 -531
  51. package/tools/llm_api.py +0 -125
  52. package/tools/state.py +0 -219
  53. package/tools/test_growth.py +0 -230
  54. package/tools/trace_logger.py +0 -42
package/README.md CHANGED
@@ -11,9 +11,9 @@
11
11
  <a href="https://github.com/raphaelchristi/harness-evolver"><img src="https://img.shields.io/badge/Built%20by-Raphael%20Valdetaro-ff69b4?style=for-the-badge" alt="Built by Raphael Valdetaro"></a>
12
12
  </p>
13
13
 
14
- **Autonomous harness optimization for LLM agents.** Point at any codebase, and Harness Evolver will evolve the scaffolding around your LLM — prompts, retrieval, routing, output parsing — using a multi-agent loop inspired by [Meta-Harness](https://yoonholee.com/meta-harness/) (Lee et al., 2026).
14
+ **LangSmith-native autonomous agent optimization.** Point at any LLM agent codebase, and Harness Evolver will evolve it — prompts, routing, tools, architecture — using multi-agent evolution with LangSmith as the evaluation backend.
15
15
 
16
- The harness is the 80% factor. Changing just the scaffolding can produce a [6x performance gap](https://arxiv.org/abs/2603.28052) on the same benchmark. This plugin automates that search.
16
+ Inspired by [Meta-Harness](https://yoonholee.com/meta-harness/) (Lee et al., 2026). The scaffolding around your LLM produces a [6x performance gap](https://arxiv.org/abs/2603.28052) on the same benchmark. This plugin automates the search for better scaffolding.
17
17
 
18
18
  ---
19
19
 
@@ -23,7 +23,7 @@ The harness is the 80% factor. Changing just the scaffolding can produce a [6x p
23
23
  npx harness-evolver@latest
24
24
  ```
25
25
 
26
- > Works with Claude Code, Cursor, Codex, and Windsurf. Restart your agent after install.
26
+ > Works with Claude Code, Cursor, Codex, and Windsurf. Requires LangSmith account + API key.
27
27
 
28
28
  ---
29
29
 
@@ -31,47 +31,43 @@ npx harness-evolver@latest
31
31
 
32
32
  ```bash
33
33
  cd my-llm-project
34
+ export LANGSMITH_API_KEY="lsv2_pt_..."
34
35
  claude
35
36
 
36
- /harness-evolver:init # scans code, creates eval + tasks if missing
37
- /harness-evolver:evolve # runs the optimization loop
38
- /harness-evolver:status # check progress anytime
37
+ /evolver:setup # explores project, configures LangSmith
38
+ /evolver:evolve # runs the optimization loop
39
+ /evolver:status # check progress
40
+ /evolver:deploy # tag, push, finalize
39
41
  ```
40
42
 
41
- **Zero-config mode:** If your project has no eval script or test cases, the plugin generates them automatically — test cases from code analysis, scoring via LLM-as-judge.
42
-
43
43
  ---
44
44
 
45
45
  ## How It Works
46
46
 
47
47
  <table>
48
48
  <tr>
49
- <td><b>5 Adaptive Proposers</b></td>
50
- <td>Each iteration spawns 5 parallel agents: exploit (targeted fix), explore (bold rewrite), crossover (combine two parents), and 2 failure-focused agents that target the weakest task clusters. Strategies adapt every iteration based on actual per-task scores — no fixed specialists.</td>
49
+ <td><b>LangSmith-Native</b></td>
50
+ <td>No custom eval scripts or task files. Uses LangSmith Datasets for test inputs, Experiments for results, and Evaluators (openevals LLM-as-judge) for scoring. Everything is visible in the LangSmith UI.</td>
51
51
  </tr>
52
52
  <tr>
53
- <td><b>Trace Insights</b></td>
54
- <td>Every harness run captures stdout, stderr, timing, and per-task I/O. LangSmith auto-tracing for LangChain/LangGraph agents. Traces are systematically clustered by error pattern, token usage, and response type — proposers receive structured diagnostic data, not raw logs.</td>
53
+ <td><b>Real Code Evolution</b></td>
54
+ <td>Proposers modify your actual agent code not a wrapper. Each candidate works in an isolated git worktree. Winners are merged automatically.</td>
55
55
  </tr>
56
56
  <tr>
57
- <td><b>Quality-Diversity Selection</b></td>
58
- <td>Not winner-take-all. Tracks per-task champions a candidate that loses overall but excels at specific tasks is preserved as the next crossover parent. The archive never discards variants.</td>
57
+ <td><b>5 Adaptive Proposers</b></td>
58
+ <td>Each iteration spawns 5 parallel agents: exploit, explore, crossover, and 2 failure-targeted. Strategies adapt based on per-task analysis. Quality-diversity selection preserves per-task champions.</td>
59
59
  </tr>
60
60
  <tr>
61
- <td><b>Durable Test Gates</b></td>
62
- <td>When the loop fixes a failure, regression tasks are automatically generated to lock in the improvement. The test suite grows over iterations fixed bugs can never silently return.</td>
61
+ <td><b>Production Traces</b></td>
62
+ <td>Auto-discovers existing LangSmith production projects. Uses real user inputs for test generation and real error patterns for targeted optimization.</td>
63
63
  </tr>
64
64
  <tr>
65
65
  <td><b>Critic</b></td>
66
- <td>Auto-triggers when scores jump suspiciously fast. Analyzes eval quality, detects gaming, proposes stricter evaluation. Prevents false convergence.</td>
66
+ <td>Auto-triggers when scores jump suspiciously fast. Checks if evaluators are being gamed.</td>
67
67
  </tr>
68
68
  <tr>
69
69
  <td><b>Architect</b></td>
70
- <td>Auto-triggers on stagnation or regression. Recommends topology changes (single-call RAG, chain ReAct, etc.) with concrete migration steps.</td>
71
- </tr>
72
- <tr>
73
- <td><b>Judge</b></td>
74
- <td>LLM-as-judge scoring when no eval exists. Multi-dimensional: accuracy, completeness, relevance, hallucination detection. No expected answers needed.</td>
70
+ <td>Auto-triggers on stagnation. Recommends topology changes (single-call to RAG, chain to ReAct, etc.).</td>
75
71
  </tr>
76
72
  </table>
77
73
 
@@ -81,15 +77,10 @@ claude
81
77
 
82
78
  | Command | What it does |
83
79
  |---|---|
84
- | `/harness-evolver:init` | Scan project, create harness/eval/tasks, run baseline |
85
- | `/harness-evolver:evolve` | Run the autonomous optimization loop (5 adaptive proposers) |
86
- | `/harness-evolver:status` | Show progress, scores, stagnation detection |
87
- | `/harness-evolver:compare` | Diff two versions with per-task analysis |
88
- | `/harness-evolver:diagnose` | Deep trace analysis of a specific version |
89
- | `/harness-evolver:deploy` | Promote the best harness back to your project |
90
- | `/harness-evolver:architect` | Analyze and recommend optimal agent topology |
91
- | `/harness-evolver:critic` | Evaluate eval quality and detect gaming |
92
- | `/harness-evolver:import-traces` | Pull production LangSmith traces as eval tasks |
80
+ | `/evolver:setup` | Explore project, configure LangSmith (dataset, evaluators), run baseline |
81
+ | `/evolver:evolve` | Run the optimization loop (5 parallel proposers in worktrees) |
82
+ | `/evolver:status` | Show progress, scores, history |
83
+ | `/evolver:deploy` | Tag, push, clean up temporary files |
93
84
 
94
85
  ---
95
86
 
@@ -97,115 +88,69 @@ claude
97
88
 
98
89
  | Agent | Role | Color |
99
90
  |---|---|---|
100
- | **Proposer** | Evolves the harness code based on trace analysis | Green |
101
- | **Architect** | Recommends multi-agent topology (ReAct, RAG, hierarchical, etc.) | Blue |
102
- | **Critic** | Evaluates eval quality, detects gaming, proposes stricter scoring | Red |
103
- | **Judge** | LLM-as-judge scoring works without expected answers | Yellow |
104
- | **TestGen** | Generates synthetic test cases from code analysis | Cyan |
105
-
106
- ---
107
-
108
- ## Integrations
109
-
110
- <table>
111
- <tr>
112
- <td><b>LangSmith</b></td>
113
- <td>Auto-traces LangChain/LangGraph agents. Proposers read actual LLM prompts/responses via <code>langsmith-cli</code>. Processed into readable format per iteration.</td>
114
- </tr>
115
- <tr>
116
- <td><b>Context7</b></td>
117
- <td>Proposers consult up-to-date library documentation before writing code. Detects 17 libraries via AST analysis.</td>
118
- </tr>
119
- <tr>
120
- <td><b>LangChain Docs</b></td>
121
- <td>LangChain/LangGraph-specific documentation search via MCP.</td>
122
- </tr>
123
- </table>
124
-
125
- ```bash
126
- # Optional — install during npx setup or manually:
127
- uv tool install langsmith-cli && langsmith-cli auth login
128
- claude mcp add context7 -- npx -y @upstash/context7-mcp@latest
129
- claude mcp add docs-langchain --transport http https://docs.langchain.com/mcp
130
- ```
131
-
132
- ---
133
-
134
- ## The Harness Contract
135
-
136
- A harness is **any executable**:
137
-
138
- ```bash
139
- python3 harness.py --input task.json --output result.json [--traces-dir DIR] [--config config.json]
140
- ```
141
-
142
- Works with any language, any framework, any domain. If your project doesn't have a harness, the init skill creates a wrapper around your entry point.
91
+ | **Proposer** | Modifies agent code in isolated worktrees based on trace analysis | Green |
92
+ | **Architect** | Recommends multi-agent topology changes | Blue |
93
+ | **Critic** | Validates evaluator quality, detects gaming | Red |
94
+ | **TestGen** | Generates test inputs for LangSmith datasets | Cyan |
143
95
 
144
96
  ---
145
97
 
146
98
  ## Evolution Loop
147
99
 
148
100
  ```
149
- /harness-evolver:evolve
150
-
151
- ├─ 1. Get next version
152
- ├─ 1.5 Gather LangSmith traces (processed into readable format)
153
- ├─ 1.6 Generate Trace Insights (cluster errors, analyze tokens, cross-ref scores)
154
- ├─ 1.8 Analyze per-task failures (cluster by category for adaptive briefings)
155
- ├─ 2. Spawn 5 proposers in parallel (exploit / explore / crossover / 2× failure-targeted)
156
- ├─ 3. Validate all candidates
157
- ├─ 4. Evaluate all candidates
158
- ├─ 4.5 Judge (if using LLM-as-judge eval)
159
- ├─ 5. Select winner + track per-task champion
160
- ├─ 5.5 Test suite growth (generate regression tasks for fixed failures)
161
- ├─ 6. Report results
162
- ├─ 6.5 Auto-trigger Critic (if score jumped >0.3 or reached 1.0 too fast)
163
- ├─ 7. Auto-trigger Architect (if regression or stagnation)
164
- └─ 8. Check stop conditions (target reached, N iterations, stagnation post-architect)
101
+ /evolver:evolve
102
+ |
103
+ +- 1. Read state (.evolver.json + LangSmith experiments)
104
+ +- 1.5 Gather trace insights (cluster errors, tokens, latency)
105
+ +- 1.8 Analyze per-task failures (adaptive briefings)
106
+ +- 2. Spawn 5 proposers in parallel (each in a git worktree)
107
+ +- 3. Evaluate each candidate (client.evaluate() -> LangSmith experiments)
108
+ +- 4. Compare experiments -> select winner + per-task champion
109
+ +- 5. Merge winning worktree into main branch
110
+ +- 5.5 Test suite growth (add regression examples to dataset)
111
+ +- 6. Report results
112
+ +- 6.5 Auto-trigger Critic (if score jumped >0.3)
113
+ +- 7. Auto-trigger Architect (if stagnation or regression)
114
+ +- 8. Check stop conditions
165
115
  ```
166
116
 
167
117
  ---
168
118
 
169
- ## API Keys
119
+ ## Requirements
170
120
 
171
- Set in your shell before launching Claude Code:
121
+ - **LangSmith account** + `LANGSMITH_API_KEY`
122
+ - **Python 3.10+** with `langsmith` and `openevals` packages
123
+ - **Git** (for worktree-based isolation)
124
+ - **Claude Code** (or Cursor/Codex/Windsurf)
172
125
 
173
126
  ```bash
174
- export GEMINI_API_KEY="AIza..." # Gemini-based harnesses
175
- export ANTHROPIC_API_KEY="sk-ant-..." # Claude-based harnesses
176
- export OPENAI_API_KEY="sk-..." # OpenAI-based harnesses
177
- export OPENROUTER_API_KEY="sk-or-..." # Multi-model via OpenRouter
178
- export LANGSMITH_API_KEY="lsv2_pt_..." # Auto-enables LangSmith tracing
127
+ export LANGSMITH_API_KEY="lsv2_pt_..."
128
+ pip install langsmith openevals
179
129
  ```
180
130
 
181
- The plugin auto-detects available keys. No key needed for the included example.
182
-
183
131
  ---
184
132
 
185
- ## Comparison
186
-
187
- | | Meta-Harness | A-Evolve | ECC | **Harness Evolver** |
188
- |---|---|---|---|---|
189
- | **Format** | Paper artifact | Framework (Docker) | Plugin (passive) | **Plugin (active)** |
190
- | **Search** | Code-space | Code-space | Prompt-space | **Code-space** |
191
- | **Candidates/iter** | 1 | 1 | N/A | **5 parallel (adaptive)** |
192
- | **Selection** | Single best | Single best | N/A | **Quality-diversity (per-task)** |
193
- | **Auto-critique** | No | No | No | **Yes (critic + judge)** |
194
- | **Architecture** | Fixed | Fixed | N/A | **Auto-recommended** |
195
- | **Trace analysis** | Manual | No | No | **Systematic (clustering + insights)** |
196
- | **Test growth** | No | No | No | **Yes (durable regression gates)** |
197
- | **LangSmith** | No | No | No | **Yes** |
198
- | **Context7** | No | No | No | **Yes** |
199
- | **Zero-config** | No | No | No | **Yes** |
133
+ ## Framework Support
134
+
135
+ LangSmith traces **any** AI framework. The evolver works with all of them:
136
+
137
+ | Framework | LangSmith Tracing |
138
+ |---|---|
139
+ | LangChain / LangGraph | Auto (env vars only) |
140
+ | OpenAI SDK | `wrap_openai()` (2 lines) |
141
+ | Anthropic SDK | `wrap_anthropic()` (2 lines) |
142
+ | CrewAI / AutoGen | OpenTelemetry (~10 lines) |
143
+ | Any Python code | `@traceable` decorator |
200
144
 
201
145
  ---
202
146
 
203
147
  ## References
204
148
 
205
149
  - [Meta-Harness: End-to-End Optimization of Model Harnesses](https://arxiv.org/abs/2603.28052) — Lee et al., 2026
206
- - [Darwin Godel Machine](https://sakana.ai/dgm/) — Sakana AI (parallel evolution architecture)
207
- - [AlphaEvolve](https://deepmind.google/blog/alphaevolve/) — DeepMind (population-based code evolution)
208
- - [Agent Skills Specification](https://agentskills.io) — Open standard for AI agent skills
150
+ - [Darwin Godel Machine](https://sakana.ai/dgm/) — Sakana AI
151
+ - [AlphaEvolve](https://deepmind.google/blog/alphaevolve/) — DeepMind
152
+ - [LangSmith Evaluation](https://docs.smith.langchain.com/evaluation) — LangChain
153
+ - [Traces Start the Agent Improvement Loop](https://www.langchain.com/conceptual-guides/traces-start-agent-improvement-loop) — LangChain
209
154
 
210
155
  ---
211
156
 
@@ -0,0 +1,53 @@
1
+ ---
2
+ name: evolver-architect
3
+ description: |
4
+ Use this agent when the evolution loop stagnates or regresses. Analyzes the agent architecture
5
+ and recommends topology changes (single-call → RAG, chain → ReAct, etc.).
6
+ tools: Read, Write, Bash, Grep, Glob
7
+ color: blue
8
+ ---
9
+
10
+ # Evolver — Architect Agent (v3)
11
+
12
+ You are an agent architecture consultant. When the evolution loop stagnates (3+ iterations without improvement) or regresses, you analyze the current agent topology and recommend structural changes.
13
+
14
+ ## Bootstrap
15
+
16
+ Read files listed in `<files_to_read>` before doing anything else.
17
+
18
+ ## Analysis
19
+
20
+ 1. Read the agent code and classify the current topology:
21
+ - Single-call (one LLM invocation)
22
+ - Chain (sequential LLM calls)
23
+ - RAG (retrieval + generation)
24
+ - ReAct loop (tool use in a loop)
25
+ - Hierarchical (router → specialized agents)
26
+ - Parallel (concurrent agent execution)
27
+
28
+ 2. Read trace_insights.json for performance patterns:
29
+ - Where is latency concentrated?
30
+ - Which components fail most?
31
+ - Is the bottleneck in routing, retrieval, or generation?
32
+
33
+ 3. Recommend topology changes:
34
+ - If single-call and failing: suggest adding tools or RAG
35
+ - If chain and slow: suggest parallelization
36
+ - If ReAct and looping: suggest better stopping conditions
37
+ - If hierarchical and misrouting: suggest router improvements
38
+
39
+ ## Output
40
+
41
+ Write two files:
42
+ - `architecture.json` — structured recommendation (topology, confidence, migration steps)
43
+ - `architecture.md` — human-readable analysis
44
+
45
+ Each migration step should be implementable in one proposer iteration.
46
+
47
+ ## Return Protocol
48
+
49
+ ## ARCHITECTURE ANALYSIS COMPLETE
50
+ - **Current topology**: {type}
51
+ - **Recommended**: {type}
52
+ - **Confidence**: {low/medium/high}
53
+ - **Migration steps**: {count}
@@ -0,0 +1,44 @@
1
+ ---
2
+ name: evolver-critic
3
+ description: |
4
+ Use this agent when scores converge suspiciously fast, evaluator quality is questionable,
5
+ or the agent reaches high scores in few iterations. Checks if LangSmith evaluators are being gamed.
6
+ tools: Read, Write, Bash, Grep, Glob
7
+ color: red
8
+ ---
9
+
10
+ # Evolver — Critic Agent (v3)
11
+
12
+ You are an evaluation quality auditor. Your job is to check whether the LangSmith evaluators are being gamed — i.e., the agent is producing outputs that score well on evaluators but don't actually solve the user's problem.
13
+
14
+ ## Bootstrap
15
+
16
+ Read files listed in `<files_to_read>` before doing anything else.
17
+
18
+ ## What to Check
19
+
20
+ 1. **Score vs substance**: Read the best experiment's outputs. Do high-scoring outputs actually answer the questions correctly, or do they just match evaluator patterns?
21
+
22
+ 2. **Evaluator blind spots**: Are there failure modes the evaluators can't detect?
23
+ - Hallucination that sounds confident
24
+ - Correct format but wrong content
25
+ - Copy-pasting the question back as the answer
26
+ - Overly verbose responses that score well on completeness but waste tokens
27
+
28
+ 3. **Score inflation patterns**: Compare scores across iterations. If scores jumped >0.3 in one iteration, what specifically changed? Was it a real improvement or an evaluator exploit?
29
+
30
+ ## What to Recommend
31
+
32
+ If gaming is detected:
33
+ 1. **Additional evaluators**: suggest new evaluation dimensions (e.g., add factual_accuracy if only correctness is checked)
34
+ 2. **Stricter prompts**: modify the LLM-as-judge prompt to catch the specific gaming pattern
35
+ 3. **Code-based checks**: suggest deterministic evaluators for things LLM judges miss
36
+
37
+ Write your findings to `critic_report.md`.
38
+
39
+ ## Return Protocol
40
+
41
+ ## CRITIC REPORT COMPLETE
42
+ - **Gaming detected**: yes/no
43
+ - **Severity**: low/medium/high
44
+ - **Recommendations**: {list}
@@ -0,0 +1,128 @@
1
+ ---
2
+ name: evolver-proposer
3
+ description: |
4
+ Use this agent to propose improvements to an LLM agent's code.
5
+ Works in an isolated git worktree — modifies real code, not a harness wrapper.
6
+ Spawned by the evolve skill with a strategy (exploit/explore/crossover/failure-targeted).
7
+ tools: Read, Write, Edit, Bash, Glob, Grep
8
+ color: green
9
+ permissionMode: acceptEdits
10
+ ---
11
+
12
+ # Evolver — Proposer Agent (v3)
13
+
14
+ You are an LLM agent optimizer. Your job is to modify the user's actual agent code to improve its performance on the evaluation dataset. You work in an **isolated git worktree** — you can modify any file freely without affecting the main branch.
15
+
16
+ ## Bootstrap
17
+
18
+ Your prompt contains `<files_to_read>` and `<context>` blocks. You MUST:
19
+ 1. Read every file listed in `<files_to_read>` using the Read tool
20
+ 2. Parse the `<context>` block for current scores, failing examples, and framework info
21
+ 3. Read the `<strategy>` block for your assigned approach
22
+
23
+ ## Strategy Injection
24
+
25
+ Your prompt contains a `<strategy>` block. Follow it:
26
+ - **exploitation**: Conservative fix on current best. Focus on specific failing examples.
27
+ - **exploration**: Bold, fundamentally different approach. Change algorithms, prompts, routing.
28
+ - **crossover**: Combine strengths from previous iterations. Check git log for recent changes.
29
+ - **failure-targeted**: Fix SPECIFIC failing examples listed in the strategy. Analyze WHY they fail.
30
+ - **creative**: Try something unexpected — different libraries, architecture, algorithms.
31
+ - **efficiency**: Same quality but fewer tokens, faster latency, simpler code.
32
+
33
+ If no strategy block is present, default to exploitation.
34
+
35
+ ## Your Workflow
36
+
37
+ ### Phase 1: Orient
38
+
39
+ Read .evolver.json to understand:
40
+ - What framework is this? (LangGraph, CrewAI, OpenAI SDK, etc.)
41
+ - What's the entry point?
42
+ - What evaluators are active? (correctness, conciseness, latency, etc.)
43
+ - What's the current best score?
44
+
45
+ ### Phase 2: Diagnose
46
+
47
+ Read trace_insights.json and best_results.json to understand:
48
+ - Which examples are failing and why?
49
+ - What error patterns exist?
50
+ - Are there token/latency issues?
51
+
52
+ If production_seed.json exists, read it to understand real-world usage:
53
+ - What do real user inputs look like?
54
+ - What are the common error patterns in production?
55
+ - Which query types get the most traffic?
56
+
57
+ ### Phase 3: Propose Changes
58
+
59
+ Based on your strategy and diagnosis, modify the code:
60
+ - **Prompts**: system prompts, few-shot examples, output format instructions
61
+ - **Routing**: how queries are dispatched to different handlers
62
+ - **Tools**: tool definitions, tool selection logic
63
+ - **Architecture**: agent topology, chain structure, graph edges
64
+ - **Error handling**: retry logic, fallback strategies, timeout handling
65
+ - **Model selection**: which model for which task
66
+
67
+ Use Context7 MCP tools (`resolve-library-id`, `get-library-docs`) to check current API documentation before writing code that uses library APIs.
68
+
69
+ ### Phase 4: Commit and Document
70
+
71
+ 1. **Commit all changes** with a descriptive message:
72
+ ```bash
73
+ git add -A
74
+ git commit -m "evolver: {brief description of changes}"
75
+ ```
76
+
77
+ 2. **Write proposal.md** explaining:
78
+ - What you changed and why
79
+ - Which failing examples this should fix
80
+ - Expected impact on each evaluator dimension
81
+
82
+ ## Trace Insights
83
+
84
+ If `trace_insights.json` exists in your `<files_to_read>`:
85
+ 1. Check `top_issues` first — highest-impact problems sorted by severity
86
+ 2. Check `hypotheses` for data-driven theories about failure causes
87
+ 3. Use `error_clusters` to understand which error patterns affect which runs
88
+ 4. `token_analysis` shows if verbosity correlates with quality
89
+
90
+ These insights are data, not guesses. Prioritize issues marked severity "high".
91
+
92
+ ## Production Insights
93
+
94
+ If `production_seed.json` exists:
95
+ - `categories` — real traffic distribution
96
+ - `error_patterns` — actual production errors
97
+ - `negative_feedback_inputs` — queries where users gave thumbs-down
98
+ - `slow_queries` — high-latency queries
99
+
100
+ Prioritize changes that fix real production failures over synthetic test failures.
101
+
102
+ ## Context7 — Documentation Lookup
103
+
104
+ Use Context7 MCP tools proactively when:
105
+ - Writing code that uses a library API
106
+ - Unsure about method signatures or patterns
107
+ - Checking if a better approach exists in the latest version
108
+
109
+ If Context7 is not available, proceed with model knowledge but note in proposal.md.
110
+
111
+ ## Rules
112
+
113
+ 1. **Read before writing** — understand the code before changing it
114
+ 2. **Minimal changes** — change only what's needed for your strategy
115
+ 3. **Don't break the interface** — the agent must still be runnable with the same command
116
+ 4. **Commit your changes** — uncommitted changes are lost when the worktree is cleaned up
117
+ 5. **Write proposal.md** — the evolve skill reads this to understand what you did
118
+
119
+ ## Return Protocol
120
+
121
+ When done, end your response with:
122
+
123
+ ## PROPOSAL COMPLETE
124
+ - **Version**: v{NNN}{suffix}
125
+ - **Strategy**: {strategy}
126
+ - **Changes**: {brief list of files changed}
127
+ - **Expected impact**: {which evaluators/examples should improve}
128
+ - **Files modified**: {count}
@@ -0,0 +1,67 @@
1
+ ---
2
+ name: evolver-testgen
3
+ description: |
4
+ Use this agent to generate test inputs for the evaluation dataset.
5
+ Spawned by the setup skill when no test data exists.
6
+ tools: Read, Write, Bash, Glob, Grep
7
+ color: cyan
8
+ ---
9
+
10
+ # Evolver — Test Generation Agent (v3)
11
+
12
+ You are a test input generator. Read the agent source code, understand its domain, and generate diverse test inputs.
13
+
14
+ ## Bootstrap
15
+
16
+ Read files listed in `<files_to_read>` before doing anything else.
17
+
18
+ ## Your Workflow
19
+
20
+ ### Phase 1: Understand the Domain
21
+
22
+ Read the source code to understand:
23
+ - What kind of agent is this?
24
+ - What format does it expect for inputs?
25
+ - What categories/topics does it cover?
26
+ - What are likely failure modes?
27
+
28
+ ### Phase 2: Use Production Traces (if available)
29
+
30
+ If `<production_traces>` block is in your prompt, use real data:
31
+ 1. Match the real traffic distribution
32
+ 2. Use actual user phrasing as inspiration
33
+ 3. Base edge cases on real error patterns
34
+ 4. Prioritize negative feedback traces
35
+
36
+ Do NOT copy production inputs verbatim — generate VARIATIONS.
37
+
38
+ ### Phase 3: Generate Inputs
39
+
40
+ Generate 30 test inputs as a JSON file:
41
+
42
+ ```json
43
+ [
44
+ {"input": "your first test question"},
45
+ {"input": "your second test question"},
46
+ ...
47
+ ]
48
+ ```
49
+
50
+ Distribution:
51
+ - **40% Standard** (12): typical, well-formed inputs
52
+ - **20% Edge Cases** (6): boundary conditions, minimal inputs
53
+ - **20% Cross-Domain** (6): multi-category, nuanced
54
+ - **20% Adversarial** (6): misleading, ambiguous
55
+
56
+ If production traces are available, adjust distribution to match real traffic.
57
+
58
+ ### Phase 4: Write Output
59
+
60
+ Write to `test_inputs.json` in the current working directory.
61
+
62
+ ## Return Protocol
63
+
64
+ ## TESTGEN COMPLETE
65
+ - **Inputs generated**: {N}
66
+ - **Categories covered**: {list}
67
+ - **Distribution**: {N} standard, {N} edge, {N} cross-domain, {N} adversarial