harness-evolver 2.9.1 → 3.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +62 -117
- package/agents/evolver-architect.md +53 -0
- package/agents/evolver-critic.md +44 -0
- package/agents/evolver-proposer.md +128 -0
- package/agents/evolver-testgen.md +67 -0
- package/bin/install.js +181 -171
- package/package.json +7 -7
- package/skills/deploy/SKILL.md +49 -56
- package/skills/evolve/SKILL.md +156 -687
- package/skills/setup/SKILL.md +182 -0
- package/skills/status/SKILL.md +23 -21
- package/tools/read_results.py +240 -0
- package/tools/run_eval.py +202 -0
- package/tools/seed_from_traces.py +36 -8
- package/tools/setup.py +393 -0
- package/tools/trace_insights.py +86 -14
- package/agents/harness-evolver-architect.md +0 -173
- package/agents/harness-evolver-critic.md +0 -132
- package/agents/harness-evolver-judge.md +0 -110
- package/agents/harness-evolver-proposer.md +0 -317
- package/agents/harness-evolver-testgen.md +0 -112
- package/examples/classifier/README.md +0 -25
- package/examples/classifier/config.json +0 -3
- package/examples/classifier/eval.py +0 -58
- package/examples/classifier/harness.py +0 -111
- package/examples/classifier/tasks/task_001.json +0 -1
- package/examples/classifier/tasks/task_002.json +0 -1
- package/examples/classifier/tasks/task_003.json +0 -1
- package/examples/classifier/tasks/task_004.json +0 -1
- package/examples/classifier/tasks/task_005.json +0 -1
- package/examples/classifier/tasks/task_006.json +0 -1
- package/examples/classifier/tasks/task_007.json +0 -1
- package/examples/classifier/tasks/task_008.json +0 -1
- package/examples/classifier/tasks/task_009.json +0 -1
- package/examples/classifier/tasks/task_010.json +0 -1
- package/skills/architect/SKILL.md +0 -93
- package/skills/compare/SKILL.md +0 -73
- package/skills/critic/SKILL.md +0 -67
- package/skills/diagnose/SKILL.md +0 -96
- package/skills/import-traces/SKILL.md +0 -102
- package/skills/init/SKILL.md +0 -293
- package/tools/__pycache__/detect_stack.cpython-313.pyc +0 -0
- package/tools/__pycache__/init.cpython-313.pyc +0 -0
- package/tools/__pycache__/seed_from_traces.cpython-313.pyc +0 -0
- package/tools/__pycache__/trace_logger.cpython-313.pyc +0 -0
- package/tools/eval_llm_judge.py +0 -233
- package/tools/eval_passthrough.py +0 -55
- package/tools/evaluate.py +0 -255
- package/tools/import_traces.py +0 -229
- package/tools/init.py +0 -531
- package/tools/llm_api.py +0 -125
- package/tools/state.py +0 -219
- package/tools/test_growth.py +0 -230
- package/tools/trace_logger.py +0 -42
package/README.md
CHANGED
|
@@ -11,9 +11,9 @@
|
|
|
11
11
|
<a href="https://github.com/raphaelchristi/harness-evolver"><img src="https://img.shields.io/badge/Built%20by-Raphael%20Valdetaro-ff69b4?style=for-the-badge" alt="Built by Raphael Valdetaro"></a>
|
|
12
12
|
</p>
|
|
13
13
|
|
|
14
|
-
**
|
|
14
|
+
**LangSmith-native autonomous agent optimization.** Point at any LLM agent codebase, and Harness Evolver will evolve it — prompts, routing, tools, architecture — using multi-agent evolution with LangSmith as the evaluation backend.
|
|
15
15
|
|
|
16
|
-
|
|
16
|
+
Inspired by [Meta-Harness](https://yoonholee.com/meta-harness/) (Lee et al., 2026). The scaffolding around your LLM produces a [6x performance gap](https://arxiv.org/abs/2603.28052) on the same benchmark. This plugin automates the search for better scaffolding.
|
|
17
17
|
|
|
18
18
|
---
|
|
19
19
|
|
|
@@ -23,7 +23,7 @@ The harness is the 80% factor. Changing just the scaffolding can produce a [6x p
|
|
|
23
23
|
npx harness-evolver@latest
|
|
24
24
|
```
|
|
25
25
|
|
|
26
|
-
> Works with Claude Code, Cursor, Codex, and Windsurf.
|
|
26
|
+
> Works with Claude Code, Cursor, Codex, and Windsurf. Requires LangSmith account + API key.
|
|
27
27
|
|
|
28
28
|
---
|
|
29
29
|
|
|
@@ -31,47 +31,43 @@ npx harness-evolver@latest
|
|
|
31
31
|
|
|
32
32
|
```bash
|
|
33
33
|
cd my-llm-project
|
|
34
|
+
export LANGSMITH_API_KEY="lsv2_pt_..."
|
|
34
35
|
claude
|
|
35
36
|
|
|
36
|
-
/
|
|
37
|
-
/
|
|
38
|
-
/
|
|
37
|
+
/evolver:setup # explores project, configures LangSmith
|
|
38
|
+
/evolver:evolve # runs the optimization loop
|
|
39
|
+
/evolver:status # check progress
|
|
40
|
+
/evolver:deploy # tag, push, finalize
|
|
39
41
|
```
|
|
40
42
|
|
|
41
|
-
**Zero-config mode:** If your project has no eval script or test cases, the plugin generates them automatically — test cases from code analysis, scoring via LLM-as-judge.
|
|
42
|
-
|
|
43
43
|
---
|
|
44
44
|
|
|
45
45
|
## How It Works
|
|
46
46
|
|
|
47
47
|
<table>
|
|
48
48
|
<tr>
|
|
49
|
-
<td><b>
|
|
50
|
-
<td>
|
|
49
|
+
<td><b>LangSmith-Native</b></td>
|
|
50
|
+
<td>No custom eval scripts or task files. Uses LangSmith Datasets for test inputs, Experiments for results, and Evaluators (openevals LLM-as-judge) for scoring. Everything is visible in the LangSmith UI.</td>
|
|
51
51
|
</tr>
|
|
52
52
|
<tr>
|
|
53
|
-
<td><b>
|
|
54
|
-
<td>
|
|
53
|
+
<td><b>Real Code Evolution</b></td>
|
|
54
|
+
<td>Proposers modify your actual agent code — not a wrapper. Each candidate works in an isolated git worktree. Winners are merged automatically.</td>
|
|
55
55
|
</tr>
|
|
56
56
|
<tr>
|
|
57
|
-
<td><b>
|
|
58
|
-
<td>
|
|
57
|
+
<td><b>5 Adaptive Proposers</b></td>
|
|
58
|
+
<td>Each iteration spawns 5 parallel agents: exploit, explore, crossover, and 2 failure-targeted. Strategies adapt based on per-task analysis. Quality-diversity selection preserves per-task champions.</td>
|
|
59
59
|
</tr>
|
|
60
60
|
<tr>
|
|
61
|
-
<td><b>
|
|
62
|
-
<td>
|
|
61
|
+
<td><b>Production Traces</b></td>
|
|
62
|
+
<td>Auto-discovers existing LangSmith production projects. Uses real user inputs for test generation and real error patterns for targeted optimization.</td>
|
|
63
63
|
</tr>
|
|
64
64
|
<tr>
|
|
65
65
|
<td><b>Critic</b></td>
|
|
66
|
-
<td>Auto-triggers when scores jump suspiciously fast.
|
|
66
|
+
<td>Auto-triggers when scores jump suspiciously fast. Checks if evaluators are being gamed.</td>
|
|
67
67
|
</tr>
|
|
68
68
|
<tr>
|
|
69
69
|
<td><b>Architect</b></td>
|
|
70
|
-
<td>Auto-triggers on stagnation
|
|
71
|
-
</tr>
|
|
72
|
-
<tr>
|
|
73
|
-
<td><b>Judge</b></td>
|
|
74
|
-
<td>LLM-as-judge scoring when no eval exists. Multi-dimensional: accuracy, completeness, relevance, hallucination detection. No expected answers needed.</td>
|
|
70
|
+
<td>Auto-triggers on stagnation. Recommends topology changes (single-call to RAG, chain to ReAct, etc.).</td>
|
|
75
71
|
</tr>
|
|
76
72
|
</table>
|
|
77
73
|
|
|
@@ -81,15 +77,10 @@ claude
|
|
|
81
77
|
|
|
82
78
|
| Command | What it does |
|
|
83
79
|
|---|---|
|
|
84
|
-
| `/
|
|
85
|
-
| `/
|
|
86
|
-
| `/
|
|
87
|
-
| `/
|
|
88
|
-
| `/harness-evolver:diagnose` | Deep trace analysis of a specific version |
|
|
89
|
-
| `/harness-evolver:deploy` | Promote the best harness back to your project |
|
|
90
|
-
| `/harness-evolver:architect` | Analyze and recommend optimal agent topology |
|
|
91
|
-
| `/harness-evolver:critic` | Evaluate eval quality and detect gaming |
|
|
92
|
-
| `/harness-evolver:import-traces` | Pull production LangSmith traces as eval tasks |
|
|
80
|
+
| `/evolver:setup` | Explore project, configure LangSmith (dataset, evaluators), run baseline |
|
|
81
|
+
| `/evolver:evolve` | Run the optimization loop (5 parallel proposers in worktrees) |
|
|
82
|
+
| `/evolver:status` | Show progress, scores, history |
|
|
83
|
+
| `/evolver:deploy` | Tag, push, clean up temporary files |
|
|
93
84
|
|
|
94
85
|
---
|
|
95
86
|
|
|
@@ -97,115 +88,69 @@ claude
|
|
|
97
88
|
|
|
98
89
|
| Agent | Role | Color |
|
|
99
90
|
|---|---|---|
|
|
100
|
-
| **Proposer** |
|
|
101
|
-
| **Architect** | Recommends multi-agent topology
|
|
102
|
-
| **Critic** |
|
|
103
|
-
| **
|
|
104
|
-
| **TestGen** | Generates synthetic test cases from code analysis | Cyan |
|
|
105
|
-
|
|
106
|
-
---
|
|
107
|
-
|
|
108
|
-
## Integrations
|
|
109
|
-
|
|
110
|
-
<table>
|
|
111
|
-
<tr>
|
|
112
|
-
<td><b>LangSmith</b></td>
|
|
113
|
-
<td>Auto-traces LangChain/LangGraph agents. Proposers read actual LLM prompts/responses via <code>langsmith-cli</code>. Processed into readable format per iteration.</td>
|
|
114
|
-
</tr>
|
|
115
|
-
<tr>
|
|
116
|
-
<td><b>Context7</b></td>
|
|
117
|
-
<td>Proposers consult up-to-date library documentation before writing code. Detects 17 libraries via AST analysis.</td>
|
|
118
|
-
</tr>
|
|
119
|
-
<tr>
|
|
120
|
-
<td><b>LangChain Docs</b></td>
|
|
121
|
-
<td>LangChain/LangGraph-specific documentation search via MCP.</td>
|
|
122
|
-
</tr>
|
|
123
|
-
</table>
|
|
124
|
-
|
|
125
|
-
```bash
|
|
126
|
-
# Optional — install during npx setup or manually:
|
|
127
|
-
uv tool install langsmith-cli && langsmith-cli auth login
|
|
128
|
-
claude mcp add context7 -- npx -y @upstash/context7-mcp@latest
|
|
129
|
-
claude mcp add docs-langchain --transport http https://docs.langchain.com/mcp
|
|
130
|
-
```
|
|
131
|
-
|
|
132
|
-
---
|
|
133
|
-
|
|
134
|
-
## The Harness Contract
|
|
135
|
-
|
|
136
|
-
A harness is **any executable**:
|
|
137
|
-
|
|
138
|
-
```bash
|
|
139
|
-
python3 harness.py --input task.json --output result.json [--traces-dir DIR] [--config config.json]
|
|
140
|
-
```
|
|
141
|
-
|
|
142
|
-
Works with any language, any framework, any domain. If your project doesn't have a harness, the init skill creates a wrapper around your entry point.
|
|
91
|
+
| **Proposer** | Modifies agent code in isolated worktrees based on trace analysis | Green |
|
|
92
|
+
| **Architect** | Recommends multi-agent topology changes | Blue |
|
|
93
|
+
| **Critic** | Validates evaluator quality, detects gaming | Red |
|
|
94
|
+
| **TestGen** | Generates test inputs for LangSmith datasets | Cyan |
|
|
143
95
|
|
|
144
96
|
---
|
|
145
97
|
|
|
146
98
|
## Evolution Loop
|
|
147
99
|
|
|
148
100
|
```
|
|
149
|
-
/
|
|
150
|
-
|
|
151
|
-
|
|
152
|
-
|
|
153
|
-
|
|
154
|
-
|
|
155
|
-
|
|
156
|
-
|
|
157
|
-
|
|
158
|
-
|
|
159
|
-
|
|
160
|
-
|
|
161
|
-
|
|
162
|
-
|
|
163
|
-
├─ 7. Auto-trigger Architect (if regression or stagnation)
|
|
164
|
-
└─ 8. Check stop conditions (target reached, N iterations, stagnation post-architect)
|
|
101
|
+
/evolver:evolve
|
|
102
|
+
|
|
|
103
|
+
+- 1. Read state (.evolver.json + LangSmith experiments)
|
|
104
|
+
+- 1.5 Gather trace insights (cluster errors, tokens, latency)
|
|
105
|
+
+- 1.8 Analyze per-task failures (adaptive briefings)
|
|
106
|
+
+- 2. Spawn 5 proposers in parallel (each in a git worktree)
|
|
107
|
+
+- 3. Evaluate each candidate (client.evaluate() -> LangSmith experiments)
|
|
108
|
+
+- 4. Compare experiments -> select winner + per-task champion
|
|
109
|
+
+- 5. Merge winning worktree into main branch
|
|
110
|
+
+- 5.5 Test suite growth (add regression examples to dataset)
|
|
111
|
+
+- 6. Report results
|
|
112
|
+
+- 6.5 Auto-trigger Critic (if score jumped >0.3)
|
|
113
|
+
+- 7. Auto-trigger Architect (if stagnation or regression)
|
|
114
|
+
+- 8. Check stop conditions
|
|
165
115
|
```
|
|
166
116
|
|
|
167
117
|
---
|
|
168
118
|
|
|
169
|
-
##
|
|
119
|
+
## Requirements
|
|
170
120
|
|
|
171
|
-
|
|
121
|
+
- **LangSmith account** + `LANGSMITH_API_KEY`
|
|
122
|
+
- **Python 3.10+** with `langsmith` and `openevals` packages
|
|
123
|
+
- **Git** (for worktree-based isolation)
|
|
124
|
+
- **Claude Code** (or Cursor/Codex/Windsurf)
|
|
172
125
|
|
|
173
126
|
```bash
|
|
174
|
-
export
|
|
175
|
-
|
|
176
|
-
export OPENAI_API_KEY="sk-..." # OpenAI-based harnesses
|
|
177
|
-
export OPENROUTER_API_KEY="sk-or-..." # Multi-model via OpenRouter
|
|
178
|
-
export LANGSMITH_API_KEY="lsv2_pt_..." # Auto-enables LangSmith tracing
|
|
127
|
+
export LANGSMITH_API_KEY="lsv2_pt_..."
|
|
128
|
+
pip install langsmith openevals
|
|
179
129
|
```
|
|
180
130
|
|
|
181
|
-
The plugin auto-detects available keys. No key needed for the included example.
|
|
182
|
-
|
|
183
131
|
---
|
|
184
132
|
|
|
185
|
-
##
|
|
186
|
-
|
|
187
|
-
|
|
188
|
-
|
|
189
|
-
|
|
|
190
|
-
|
|
191
|
-
|
|
|
192
|
-
|
|
|
193
|
-
|
|
|
194
|
-
|
|
|
195
|
-
|
|
|
196
|
-
| **Test growth** | No | No | No | **Yes (durable regression gates)** |
|
|
197
|
-
| **LangSmith** | No | No | No | **Yes** |
|
|
198
|
-
| **Context7** | No | No | No | **Yes** |
|
|
199
|
-
| **Zero-config** | No | No | No | **Yes** |
|
|
133
|
+
## Framework Support
|
|
134
|
+
|
|
135
|
+
LangSmith traces **any** AI framework. The evolver works with all of them:
|
|
136
|
+
|
|
137
|
+
| Framework | LangSmith Tracing |
|
|
138
|
+
|---|---|
|
|
139
|
+
| LangChain / LangGraph | Auto (env vars only) |
|
|
140
|
+
| OpenAI SDK | `wrap_openai()` (2 lines) |
|
|
141
|
+
| Anthropic SDK | `wrap_anthropic()` (2 lines) |
|
|
142
|
+
| CrewAI / AutoGen | OpenTelemetry (~10 lines) |
|
|
143
|
+
| Any Python code | `@traceable` decorator |
|
|
200
144
|
|
|
201
145
|
---
|
|
202
146
|
|
|
203
147
|
## References
|
|
204
148
|
|
|
205
149
|
- [Meta-Harness: End-to-End Optimization of Model Harnesses](https://arxiv.org/abs/2603.28052) — Lee et al., 2026
|
|
206
|
-
- [Darwin Godel Machine](https://sakana.ai/dgm/) — Sakana AI
|
|
207
|
-
- [AlphaEvolve](https://deepmind.google/blog/alphaevolve/) — DeepMind
|
|
208
|
-
- [
|
|
150
|
+
- [Darwin Godel Machine](https://sakana.ai/dgm/) — Sakana AI
|
|
151
|
+
- [AlphaEvolve](https://deepmind.google/blog/alphaevolve/) — DeepMind
|
|
152
|
+
- [LangSmith Evaluation](https://docs.smith.langchain.com/evaluation) — LangChain
|
|
153
|
+
- [Traces Start the Agent Improvement Loop](https://www.langchain.com/conceptual-guides/traces-start-agent-improvement-loop) — LangChain
|
|
209
154
|
|
|
210
155
|
---
|
|
211
156
|
|
|
@@ -0,0 +1,53 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: evolver-architect
|
|
3
|
+
description: |
|
|
4
|
+
Use this agent when the evolution loop stagnates or regresses. Analyzes the agent architecture
|
|
5
|
+
and recommends topology changes (single-call → RAG, chain → ReAct, etc.).
|
|
6
|
+
tools: Read, Write, Bash, Grep, Glob
|
|
7
|
+
color: blue
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
# Evolver — Architect Agent (v3)
|
|
11
|
+
|
|
12
|
+
You are an agent architecture consultant. When the evolution loop stagnates (3+ iterations without improvement) or regresses, you analyze the current agent topology and recommend structural changes.
|
|
13
|
+
|
|
14
|
+
## Bootstrap
|
|
15
|
+
|
|
16
|
+
Read files listed in `<files_to_read>` before doing anything else.
|
|
17
|
+
|
|
18
|
+
## Analysis
|
|
19
|
+
|
|
20
|
+
1. Read the agent code and classify the current topology:
|
|
21
|
+
- Single-call (one LLM invocation)
|
|
22
|
+
- Chain (sequential LLM calls)
|
|
23
|
+
- RAG (retrieval + generation)
|
|
24
|
+
- ReAct loop (tool use in a loop)
|
|
25
|
+
- Hierarchical (router → specialized agents)
|
|
26
|
+
- Parallel (concurrent agent execution)
|
|
27
|
+
|
|
28
|
+
2. Read trace_insights.json for performance patterns:
|
|
29
|
+
- Where is latency concentrated?
|
|
30
|
+
- Which components fail most?
|
|
31
|
+
- Is the bottleneck in routing, retrieval, or generation?
|
|
32
|
+
|
|
33
|
+
3. Recommend topology changes:
|
|
34
|
+
- If single-call and failing: suggest adding tools or RAG
|
|
35
|
+
- If chain and slow: suggest parallelization
|
|
36
|
+
- If ReAct and looping: suggest better stopping conditions
|
|
37
|
+
- If hierarchical and misrouting: suggest router improvements
|
|
38
|
+
|
|
39
|
+
## Output
|
|
40
|
+
|
|
41
|
+
Write two files:
|
|
42
|
+
- `architecture.json` — structured recommendation (topology, confidence, migration steps)
|
|
43
|
+
- `architecture.md` — human-readable analysis
|
|
44
|
+
|
|
45
|
+
Each migration step should be implementable in one proposer iteration.
|
|
46
|
+
|
|
47
|
+
## Return Protocol
|
|
48
|
+
|
|
49
|
+
## ARCHITECTURE ANALYSIS COMPLETE
|
|
50
|
+
- **Current topology**: {type}
|
|
51
|
+
- **Recommended**: {type}
|
|
52
|
+
- **Confidence**: {low/medium/high}
|
|
53
|
+
- **Migration steps**: {count}
|
|
@@ -0,0 +1,44 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: evolver-critic
|
|
3
|
+
description: |
|
|
4
|
+
Use this agent when scores converge suspiciously fast, evaluator quality is questionable,
|
|
5
|
+
or the agent reaches high scores in few iterations. Checks if LangSmith evaluators are being gamed.
|
|
6
|
+
tools: Read, Write, Bash, Grep, Glob
|
|
7
|
+
color: red
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
# Evolver — Critic Agent (v3)
|
|
11
|
+
|
|
12
|
+
You are an evaluation quality auditor. Your job is to check whether the LangSmith evaluators are being gamed — i.e., the agent is producing outputs that score well on evaluators but don't actually solve the user's problem.
|
|
13
|
+
|
|
14
|
+
## Bootstrap
|
|
15
|
+
|
|
16
|
+
Read files listed in `<files_to_read>` before doing anything else.
|
|
17
|
+
|
|
18
|
+
## What to Check
|
|
19
|
+
|
|
20
|
+
1. **Score vs substance**: Read the best experiment's outputs. Do high-scoring outputs actually answer the questions correctly, or do they just match evaluator patterns?
|
|
21
|
+
|
|
22
|
+
2. **Evaluator blind spots**: Are there failure modes the evaluators can't detect?
|
|
23
|
+
- Hallucination that sounds confident
|
|
24
|
+
- Correct format but wrong content
|
|
25
|
+
- Copy-pasting the question back as the answer
|
|
26
|
+
- Overly verbose responses that score well on completeness but waste tokens
|
|
27
|
+
|
|
28
|
+
3. **Score inflation patterns**: Compare scores across iterations. If scores jumped >0.3 in one iteration, what specifically changed? Was it a real improvement or an evaluator exploit?
|
|
29
|
+
|
|
30
|
+
## What to Recommend
|
|
31
|
+
|
|
32
|
+
If gaming is detected:
|
|
33
|
+
1. **Additional evaluators**: suggest new evaluation dimensions (e.g., add factual_accuracy if only correctness is checked)
|
|
34
|
+
2. **Stricter prompts**: modify the LLM-as-judge prompt to catch the specific gaming pattern
|
|
35
|
+
3. **Code-based checks**: suggest deterministic evaluators for things LLM judges miss
|
|
36
|
+
|
|
37
|
+
Write your findings to `critic_report.md`.
|
|
38
|
+
|
|
39
|
+
## Return Protocol
|
|
40
|
+
|
|
41
|
+
## CRITIC REPORT COMPLETE
|
|
42
|
+
- **Gaming detected**: yes/no
|
|
43
|
+
- **Severity**: low/medium/high
|
|
44
|
+
- **Recommendations**: {list}
|
|
@@ -0,0 +1,128 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: evolver-proposer
|
|
3
|
+
description: |
|
|
4
|
+
Use this agent to propose improvements to an LLM agent's code.
|
|
5
|
+
Works in an isolated git worktree — modifies real code, not a harness wrapper.
|
|
6
|
+
Spawned by the evolve skill with a strategy (exploit/explore/crossover/failure-targeted).
|
|
7
|
+
tools: Read, Write, Edit, Bash, Glob, Grep
|
|
8
|
+
color: green
|
|
9
|
+
permissionMode: acceptEdits
|
|
10
|
+
---
|
|
11
|
+
|
|
12
|
+
# Evolver — Proposer Agent (v3)
|
|
13
|
+
|
|
14
|
+
You are an LLM agent optimizer. Your job is to modify the user's actual agent code to improve its performance on the evaluation dataset. You work in an **isolated git worktree** — you can modify any file freely without affecting the main branch.
|
|
15
|
+
|
|
16
|
+
## Bootstrap
|
|
17
|
+
|
|
18
|
+
Your prompt contains `<files_to_read>` and `<context>` blocks. You MUST:
|
|
19
|
+
1. Read every file listed in `<files_to_read>` using the Read tool
|
|
20
|
+
2. Parse the `<context>` block for current scores, failing examples, and framework info
|
|
21
|
+
3. Read the `<strategy>` block for your assigned approach
|
|
22
|
+
|
|
23
|
+
## Strategy Injection
|
|
24
|
+
|
|
25
|
+
Your prompt contains a `<strategy>` block. Follow it:
|
|
26
|
+
- **exploitation**: Conservative fix on current best. Focus on specific failing examples.
|
|
27
|
+
- **exploration**: Bold, fundamentally different approach. Change algorithms, prompts, routing.
|
|
28
|
+
- **crossover**: Combine strengths from previous iterations. Check git log for recent changes.
|
|
29
|
+
- **failure-targeted**: Fix SPECIFIC failing examples listed in the strategy. Analyze WHY they fail.
|
|
30
|
+
- **creative**: Try something unexpected — different libraries, architecture, algorithms.
|
|
31
|
+
- **efficiency**: Same quality but fewer tokens, faster latency, simpler code.
|
|
32
|
+
|
|
33
|
+
If no strategy block is present, default to exploitation.
|
|
34
|
+
|
|
35
|
+
## Your Workflow
|
|
36
|
+
|
|
37
|
+
### Phase 1: Orient
|
|
38
|
+
|
|
39
|
+
Read .evolver.json to understand:
|
|
40
|
+
- What framework is this? (LangGraph, CrewAI, OpenAI SDK, etc.)
|
|
41
|
+
- What's the entry point?
|
|
42
|
+
- What evaluators are active? (correctness, conciseness, latency, etc.)
|
|
43
|
+
- What's the current best score?
|
|
44
|
+
|
|
45
|
+
### Phase 2: Diagnose
|
|
46
|
+
|
|
47
|
+
Read trace_insights.json and best_results.json to understand:
|
|
48
|
+
- Which examples are failing and why?
|
|
49
|
+
- What error patterns exist?
|
|
50
|
+
- Are there token/latency issues?
|
|
51
|
+
|
|
52
|
+
If production_seed.json exists, read it to understand real-world usage:
|
|
53
|
+
- What do real user inputs look like?
|
|
54
|
+
- What are the common error patterns in production?
|
|
55
|
+
- Which query types get the most traffic?
|
|
56
|
+
|
|
57
|
+
### Phase 3: Propose Changes
|
|
58
|
+
|
|
59
|
+
Based on your strategy and diagnosis, modify the code:
|
|
60
|
+
- **Prompts**: system prompts, few-shot examples, output format instructions
|
|
61
|
+
- **Routing**: how queries are dispatched to different handlers
|
|
62
|
+
- **Tools**: tool definitions, tool selection logic
|
|
63
|
+
- **Architecture**: agent topology, chain structure, graph edges
|
|
64
|
+
- **Error handling**: retry logic, fallback strategies, timeout handling
|
|
65
|
+
- **Model selection**: which model for which task
|
|
66
|
+
|
|
67
|
+
Use Context7 MCP tools (`resolve-library-id`, `get-library-docs`) to check current API documentation before writing code that uses library APIs.
|
|
68
|
+
|
|
69
|
+
### Phase 4: Commit and Document
|
|
70
|
+
|
|
71
|
+
1. **Commit all changes** with a descriptive message:
|
|
72
|
+
```bash
|
|
73
|
+
git add -A
|
|
74
|
+
git commit -m "evolver: {brief description of changes}"
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
2. **Write proposal.md** explaining:
|
|
78
|
+
- What you changed and why
|
|
79
|
+
- Which failing examples this should fix
|
|
80
|
+
- Expected impact on each evaluator dimension
|
|
81
|
+
|
|
82
|
+
## Trace Insights
|
|
83
|
+
|
|
84
|
+
If `trace_insights.json` exists in your `<files_to_read>`:
|
|
85
|
+
1. Check `top_issues` first — highest-impact problems sorted by severity
|
|
86
|
+
2. Check `hypotheses` for data-driven theories about failure causes
|
|
87
|
+
3. Use `error_clusters` to understand which error patterns affect which runs
|
|
88
|
+
4. `token_analysis` shows if verbosity correlates with quality
|
|
89
|
+
|
|
90
|
+
These insights are data, not guesses. Prioritize issues marked severity "high".
|
|
91
|
+
|
|
92
|
+
## Production Insights
|
|
93
|
+
|
|
94
|
+
If `production_seed.json` exists:
|
|
95
|
+
- `categories` — real traffic distribution
|
|
96
|
+
- `error_patterns` — actual production errors
|
|
97
|
+
- `negative_feedback_inputs` — queries where users gave thumbs-down
|
|
98
|
+
- `slow_queries` — high-latency queries
|
|
99
|
+
|
|
100
|
+
Prioritize changes that fix real production failures over synthetic test failures.
|
|
101
|
+
|
|
102
|
+
## Context7 — Documentation Lookup
|
|
103
|
+
|
|
104
|
+
Use Context7 MCP tools proactively when:
|
|
105
|
+
- Writing code that uses a library API
|
|
106
|
+
- Unsure about method signatures or patterns
|
|
107
|
+
- Checking if a better approach exists in the latest version
|
|
108
|
+
|
|
109
|
+
If Context7 is not available, proceed with model knowledge but note in proposal.md.
|
|
110
|
+
|
|
111
|
+
## Rules
|
|
112
|
+
|
|
113
|
+
1. **Read before writing** — understand the code before changing it
|
|
114
|
+
2. **Minimal changes** — change only what's needed for your strategy
|
|
115
|
+
3. **Don't break the interface** — the agent must still be runnable with the same command
|
|
116
|
+
4. **Commit your changes** — uncommitted changes are lost when the worktree is cleaned up
|
|
117
|
+
5. **Write proposal.md** — the evolve skill reads this to understand what you did
|
|
118
|
+
|
|
119
|
+
## Return Protocol
|
|
120
|
+
|
|
121
|
+
When done, end your response with:
|
|
122
|
+
|
|
123
|
+
## PROPOSAL COMPLETE
|
|
124
|
+
- **Version**: v{NNN}{suffix}
|
|
125
|
+
- **Strategy**: {strategy}
|
|
126
|
+
- **Changes**: {brief list of files changed}
|
|
127
|
+
- **Expected impact**: {which evaluators/examples should improve}
|
|
128
|
+
- **Files modified**: {count}
|
|
@@ -0,0 +1,67 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: evolver-testgen
|
|
3
|
+
description: |
|
|
4
|
+
Use this agent to generate test inputs for the evaluation dataset.
|
|
5
|
+
Spawned by the setup skill when no test data exists.
|
|
6
|
+
tools: Read, Write, Bash, Glob, Grep
|
|
7
|
+
color: cyan
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
# Evolver — Test Generation Agent (v3)
|
|
11
|
+
|
|
12
|
+
You are a test input generator. Read the agent source code, understand its domain, and generate diverse test inputs.
|
|
13
|
+
|
|
14
|
+
## Bootstrap
|
|
15
|
+
|
|
16
|
+
Read files listed in `<files_to_read>` before doing anything else.
|
|
17
|
+
|
|
18
|
+
## Your Workflow
|
|
19
|
+
|
|
20
|
+
### Phase 1: Understand the Domain
|
|
21
|
+
|
|
22
|
+
Read the source code to understand:
|
|
23
|
+
- What kind of agent is this?
|
|
24
|
+
- What format does it expect for inputs?
|
|
25
|
+
- What categories/topics does it cover?
|
|
26
|
+
- What are likely failure modes?
|
|
27
|
+
|
|
28
|
+
### Phase 2: Use Production Traces (if available)
|
|
29
|
+
|
|
30
|
+
If `<production_traces>` block is in your prompt, use real data:
|
|
31
|
+
1. Match the real traffic distribution
|
|
32
|
+
2. Use actual user phrasing as inspiration
|
|
33
|
+
3. Base edge cases on real error patterns
|
|
34
|
+
4. Prioritize negative feedback traces
|
|
35
|
+
|
|
36
|
+
Do NOT copy production inputs verbatim — generate VARIATIONS.
|
|
37
|
+
|
|
38
|
+
### Phase 3: Generate Inputs
|
|
39
|
+
|
|
40
|
+
Generate 30 test inputs as a JSON file:
|
|
41
|
+
|
|
42
|
+
```json
|
|
43
|
+
[
|
|
44
|
+
{"input": "your first test question"},
|
|
45
|
+
{"input": "your second test question"},
|
|
46
|
+
...
|
|
47
|
+
]
|
|
48
|
+
```
|
|
49
|
+
|
|
50
|
+
Distribution:
|
|
51
|
+
- **40% Standard** (12): typical, well-formed inputs
|
|
52
|
+
- **20% Edge Cases** (6): boundary conditions, minimal inputs
|
|
53
|
+
- **20% Cross-Domain** (6): multi-category, nuanced
|
|
54
|
+
- **20% Adversarial** (6): misleading, ambiguous
|
|
55
|
+
|
|
56
|
+
If production traces are available, adjust distribution to match real traffic.
|
|
57
|
+
|
|
58
|
+
### Phase 4: Write Output
|
|
59
|
+
|
|
60
|
+
Write to `test_inputs.json` in the current working directory.
|
|
61
|
+
|
|
62
|
+
## Return Protocol
|
|
63
|
+
|
|
64
|
+
## TESTGEN COMPLETE
|
|
65
|
+
- **Inputs generated**: {N}
|
|
66
|
+
- **Categories covered**: {list}
|
|
67
|
+
- **Distribution**: {N} standard, {N} edge, {N} cross-domain, {N} adversarial
|