harness-evolver 1.7.0 → 1.9.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/agents/harness-evolver-proposer.md +10 -0
- package/package.json +1 -1
- package/skills/evolve/SKILL.md +166 -37
- package/tools/evaluate.py +6 -1
|
@@ -13,6 +13,16 @@ permissionMode: acceptEdits
|
|
|
13
13
|
If your prompt contains a `<files_to_read>` block, you MUST use the Read tool to load
|
|
14
14
|
every file listed there before performing any other actions. These files are your context.
|
|
15
15
|
|
|
16
|
+
## Strategy Injection
|
|
17
|
+
|
|
18
|
+
Your prompt may contain a `<strategy>` block defining your evolutionary role:
|
|
19
|
+
- **exploitation**: Make targeted, conservative fixes to the current best
|
|
20
|
+
- **exploration**: Try fundamentally different approaches, be bold
|
|
21
|
+
- **crossover**: Combine strengths from two parent versions
|
|
22
|
+
|
|
23
|
+
Follow the strategy. It determines your risk tolerance and parent selection.
|
|
24
|
+
If no strategy block is present, default to exploitation (conservative improvement).
|
|
25
|
+
|
|
16
26
|
## Context7 — Enrich Your Knowledge
|
|
17
27
|
|
|
18
28
|
You have access to Context7 MCP tools (`resolve-library-id` and `get-library-docs`) for looking up **current, version-specific documentation** of any library.
|
package/package.json
CHANGED
package/skills/evolve/SKILL.md
CHANGED
|
@@ -36,97 +36,215 @@ python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); pri
|
|
|
36
36
|
|
|
37
37
|
### 1.5. Gather LangSmith Traces (MANDATORY after every evaluation)
|
|
38
38
|
|
|
39
|
-
**Run these commands unconditionally after EVERY evaluation** (including baseline).
|
|
39
|
+
**Run these commands unconditionally after EVERY evaluation** (including baseline). Do NOT guess project names — discover them.
|
|
40
|
+
|
|
41
|
+
**Step 1: Find the actual LangSmith project name**
|
|
40
42
|
|
|
41
43
|
```bash
|
|
42
|
-
langsmith-cli --json
|
|
44
|
+
langsmith-cli --json projects list --name-pattern "harness-evolver*" --limit 10 2>/dev/null
|
|
45
|
+
```
|
|
46
|
+
|
|
47
|
+
This returns all projects matching the prefix. Pick the most recently updated one, or the one matching the current version. Save the project name:
|
|
43
48
|
|
|
44
|
-
|
|
49
|
+
```bash
|
|
50
|
+
LS_PROJECT=$(langsmith-cli --json projects list --name-pattern "harness-evolver*" --limit 1 2>/dev/null | python3 -c "import sys,json; data=json.load(sys.stdin); print(data[0]['name'] if data else '')" 2>/dev/null || echo "")
|
|
45
51
|
```
|
|
46
52
|
|
|
47
|
-
|
|
53
|
+
If `LS_PROJECT` is empty, langsmith-cli is not available or no projects exist — skip to step 2.
|
|
54
|
+
|
|
55
|
+
**Step 2: Gather traces from the discovered project**
|
|
56
|
+
|
|
57
|
+
```bash
|
|
58
|
+
if [ -n "$LS_PROJECT" ]; then
|
|
59
|
+
langsmith-cli --json runs list --project "$LS_PROJECT" --failed --fields id,name,error,inputs --limit 10 > .harness-evolver/langsmith_diagnosis.json 2>/dev/null || echo "[]" > .harness-evolver/langsmith_diagnosis.json
|
|
60
|
+
langsmith-cli --json runs stats --project "$LS_PROJECT" > .harness-evolver/langsmith_stats.json 2>/dev/null || echo "{}" > .harness-evolver/langsmith_stats.json
|
|
61
|
+
echo "$LS_PROJECT" > .harness-evolver/langsmith_project.txt
|
|
62
|
+
else
|
|
63
|
+
echo "[]" > .harness-evolver/langsmith_diagnosis.json
|
|
64
|
+
echo "{}" > .harness-evolver/langsmith_stats.json
|
|
65
|
+
fi
|
|
66
|
+
```
|
|
48
67
|
|
|
49
68
|
These files are included in the proposer's `<files_to_read>` so it has real trace data for diagnosis.
|
|
50
69
|
|
|
51
|
-
### 2. Propose
|
|
70
|
+
### 2. Propose (3 parallel candidates)
|
|
52
71
|
|
|
53
|
-
|
|
72
|
+
Spawn 3 proposer agents IN PARALLEL, each with a different evolutionary strategy.
|
|
73
|
+
This follows the DGM/AlphaEvolve pattern: exploit + explore + crossover.
|
|
54
74
|
|
|
55
|
-
First, read the proposer agent definition
|
|
75
|
+
First, read the proposer agent definition:
|
|
56
76
|
```bash
|
|
57
77
|
cat ~/.claude/agents/harness-evolver-proposer.md
|
|
58
78
|
```
|
|
59
79
|
|
|
60
|
-
Then
|
|
80
|
+
Then determine parents for each strategy:
|
|
81
|
+
- **Exploiter parent**: current best version (from summary.json `best.version`)
|
|
82
|
+
- **Explorer parent**: a non-best version with low offspring count (read summary.json history, pick one that scored >0 but is NOT the best and has NOT been parent to many children)
|
|
83
|
+
- **Crossover parents**: best version + a different high-scorer from a different lineage
|
|
84
|
+
|
|
85
|
+
Spawn all 3 using the Agent tool. The first 2 use `run_in_background: true`, the 3rd blocks:
|
|
61
86
|
|
|
87
|
+
**Candidate A (Exploiter)** — `run_in_background: true`:
|
|
62
88
|
```
|
|
63
89
|
Agent(
|
|
64
|
-
description: "
|
|
90
|
+
description: "Proposer A (exploit): targeted fix for {version}",
|
|
91
|
+
run_in_background: true,
|
|
65
92
|
prompt: |
|
|
66
93
|
<agent_instructions>
|
|
67
|
-
{
|
|
94
|
+
{FULL content of harness-evolver-proposer.md}
|
|
68
95
|
</agent_instructions>
|
|
69
96
|
|
|
97
|
+
<strategy>
|
|
98
|
+
APPROACH: exploitation
|
|
99
|
+
You are the EXPLOITER. Make the SMALLEST, most targeted change that fixes
|
|
100
|
+
the highest-impact failing tasks. Base your work on the current best version.
|
|
101
|
+
Do NOT restructure the code. Do NOT change the architecture.
|
|
102
|
+
Focus on: prompt tweaks, parameter tuning, fixing specific failure modes.
|
|
103
|
+
</strategy>
|
|
104
|
+
|
|
70
105
|
<objective>
|
|
71
|
-
Propose harness version {version} that improves on
|
|
106
|
+
Propose harness version {version}a that improves on {best_score}.
|
|
72
107
|
</objective>
|
|
73
108
|
|
|
74
109
|
<files_to_read>
|
|
75
110
|
- .harness-evolver/summary.json
|
|
76
111
|
- .harness-evolver/PROPOSER_HISTORY.md
|
|
77
112
|
- .harness-evolver/config.json
|
|
78
|
-
- .harness-evolver/baseline/harness.py
|
|
79
113
|
- .harness-evolver/harnesses/{best_version}/harness.py
|
|
80
114
|
- .harness-evolver/harnesses/{best_version}/scores.json
|
|
81
115
|
- .harness-evolver/harnesses/{best_version}/proposal.md
|
|
82
|
-
- .harness-evolver/langsmith_diagnosis.json (if exists
|
|
83
|
-
- .harness-evolver/langsmith_stats.json (if exists
|
|
84
|
-
- .harness-evolver/architecture.json (if exists
|
|
116
|
+
- .harness-evolver/langsmith_diagnosis.json (if exists)
|
|
117
|
+
- .harness-evolver/langsmith_stats.json (if exists)
|
|
118
|
+
- .harness-evolver/architecture.json (if exists)
|
|
85
119
|
</files_to_read>
|
|
86
120
|
|
|
87
121
|
<output>
|
|
88
|
-
Create directory .harness-evolver/harnesses/{version}/ containing:
|
|
89
|
-
- harness.py
|
|
90
|
-
- config.json (parameters, copy from parent if unchanged)
|
|
91
|
-
- proposal.md (reasoning, must start with "Based on v{PARENT}")
|
|
122
|
+
Create directory .harness-evolver/harnesses/{version}a/ containing:
|
|
123
|
+
- harness.py, config.json, proposal.md
|
|
92
124
|
</output>
|
|
125
|
+
)
|
|
126
|
+
```
|
|
93
127
|
|
|
94
|
-
|
|
95
|
-
|
|
96
|
-
|
|
97
|
-
|
|
98
|
-
|
|
99
|
-
|
|
128
|
+
**Candidate B (Explorer)** — `run_in_background: true`:
|
|
129
|
+
```
|
|
130
|
+
Agent(
|
|
131
|
+
description: "Proposer B (explore): bold change from {explorer_parent}",
|
|
132
|
+
run_in_background: true,
|
|
133
|
+
prompt: |
|
|
134
|
+
<agent_instructions>
|
|
135
|
+
{FULL content of harness-evolver-proposer.md}
|
|
136
|
+
</agent_instructions>
|
|
137
|
+
|
|
138
|
+
<strategy>
|
|
139
|
+
APPROACH: exploration
|
|
140
|
+
You are the EXPLORER. Try a FUNDAMENTALLY DIFFERENT approach.
|
|
141
|
+
Base your work on {explorer_parent} (NOT the current best — intentionally diverging).
|
|
142
|
+
Consider: different retrieval strategy, different prompt structure,
|
|
143
|
+
different output parsing, different error handling philosophy.
|
|
144
|
+
Be bold. A creative failure teaches more than a timid success.
|
|
145
|
+
</strategy>
|
|
146
|
+
|
|
147
|
+
<objective>
|
|
148
|
+
Propose harness version {version}b that takes a different approach.
|
|
149
|
+
</objective>
|
|
150
|
+
|
|
151
|
+
<files_to_read>
|
|
152
|
+
- .harness-evolver/summary.json
|
|
153
|
+
- .harness-evolver/PROPOSER_HISTORY.md
|
|
154
|
+
- .harness-evolver/config.json
|
|
155
|
+
- .harness-evolver/baseline/harness.py
|
|
156
|
+
- .harness-evolver/harnesses/{explorer_parent}/harness.py
|
|
157
|
+
- .harness-evolver/harnesses/{explorer_parent}/scores.json
|
|
158
|
+
- .harness-evolver/langsmith_diagnosis.json (if exists)
|
|
159
|
+
- .harness-evolver/architecture.json (if exists)
|
|
160
|
+
</files_to_read>
|
|
161
|
+
|
|
162
|
+
<output>
|
|
163
|
+
Create directory .harness-evolver/harnesses/{version}b/ containing:
|
|
164
|
+
- harness.py, config.json, proposal.md
|
|
165
|
+
</output>
|
|
100
166
|
)
|
|
101
167
|
```
|
|
102
168
|
|
|
103
|
-
|
|
169
|
+
**Candidate C (Crossover)** — blocks (last one):
|
|
170
|
+
```
|
|
171
|
+
Agent(
|
|
172
|
+
description: "Proposer C (crossover): combine {parent_a} + {parent_b}",
|
|
173
|
+
prompt: |
|
|
174
|
+
<agent_instructions>
|
|
175
|
+
{FULL content of harness-evolver-proposer.md}
|
|
176
|
+
</agent_instructions>
|
|
177
|
+
|
|
178
|
+
<strategy>
|
|
179
|
+
APPROACH: crossover
|
|
180
|
+
You are the CROSSOVER agent. Combine the STRENGTHS of two different versions:
|
|
181
|
+
- {parent_a} (score: {score_a}): {summary of what it does well}
|
|
182
|
+
- {parent_b} (score: {score_b}): {summary of what it does well}
|
|
183
|
+
Take the best elements from each and merge them into a single harness.
|
|
184
|
+
</strategy>
|
|
185
|
+
|
|
186
|
+
<objective>
|
|
187
|
+
Propose harness version {version}c that combines the best of {parent_a} and {parent_b}.
|
|
188
|
+
</objective>
|
|
189
|
+
|
|
190
|
+
<files_to_read>
|
|
191
|
+
- .harness-evolver/summary.json
|
|
192
|
+
- .harness-evolver/PROPOSER_HISTORY.md
|
|
193
|
+
- .harness-evolver/config.json
|
|
194
|
+
- .harness-evolver/harnesses/{parent_a}/harness.py
|
|
195
|
+
- .harness-evolver/harnesses/{parent_a}/scores.json
|
|
196
|
+
- .harness-evolver/harnesses/{parent_b}/harness.py
|
|
197
|
+
- .harness-evolver/harnesses/{parent_b}/scores.json
|
|
198
|
+
- .harness-evolver/langsmith_diagnosis.json (if exists)
|
|
199
|
+
- .harness-evolver/architecture.json (if exists)
|
|
200
|
+
</files_to_read>
|
|
201
|
+
|
|
202
|
+
<output>
|
|
203
|
+
Create directory .harness-evolver/harnesses/{version}c/ containing:
|
|
204
|
+
- harness.py, config.json, proposal.md
|
|
205
|
+
</output>
|
|
206
|
+
)
|
|
207
|
+
```
|
|
208
|
+
|
|
209
|
+
Wait for all 3 to complete. The background agents will notify when done.
|
|
210
|
+
|
|
211
|
+
**Special case — iteration 1**: Only the exploiter and explorer can run (no second parent for crossover yet). Spawn 2 agents: exploiter (from baseline) and explorer (also from baseline but with bold strategy). Skip crossover.
|
|
104
212
|
|
|
105
|
-
|
|
213
|
+
**Special case — iteration 2+**: All 3 strategies. Explorer parent = fitness-weighted random from history excluding current best.
|
|
106
214
|
|
|
215
|
+
### 3. Validate All Candidates
|
|
216
|
+
|
|
217
|
+
For each candidate (a, b, c):
|
|
107
218
|
```bash
|
|
108
|
-
python3 $TOOLS/evaluate.py validate
|
|
109
|
-
--harness .harness-evolver/harnesses/{version}/harness.py \
|
|
110
|
-
--config .harness-evolver/harnesses/{version}/config.json
|
|
219
|
+
python3 $TOOLS/evaluate.py validate --harness .harness-evolver/harnesses/{version}{suffix}/harness.py --config .harness-evolver/harnesses/{version}{suffix}/config.json
|
|
111
220
|
```
|
|
112
221
|
|
|
113
|
-
|
|
222
|
+
Remove any that fail validation.
|
|
114
223
|
|
|
115
|
-
### 4. Evaluate
|
|
224
|
+
### 4. Evaluate All Candidates
|
|
116
225
|
|
|
226
|
+
For each valid candidate:
|
|
117
227
|
```bash
|
|
118
228
|
python3 $TOOLS/evaluate.py run \
|
|
119
|
-
--harness .harness-evolver/harnesses/{version}/harness.py \
|
|
120
|
-
--config .harness-evolver/harnesses/{version}/config.json \
|
|
229
|
+
--harness .harness-evolver/harnesses/{version}{suffix}/harness.py \
|
|
230
|
+
--config .harness-evolver/harnesses/{version}{suffix}/config.json \
|
|
121
231
|
--tasks-dir .harness-evolver/eval/tasks/ \
|
|
122
232
|
--eval .harness-evolver/eval/eval.py \
|
|
123
|
-
--traces-dir .harness-evolver/harnesses/{version}/traces/ \
|
|
124
|
-
--scores .harness-evolver/harnesses/{version}/scores.json \
|
|
233
|
+
--traces-dir .harness-evolver/harnesses/{version}{suffix}/traces/ \
|
|
234
|
+
--scores .harness-evolver/harnesses/{version}{suffix}/scores.json \
|
|
125
235
|
--timeout 60
|
|
126
236
|
```
|
|
127
237
|
|
|
128
|
-
### 5. Update State
|
|
238
|
+
### 5. Select Winner + Update State
|
|
129
239
|
|
|
240
|
+
Compare scores of all evaluated candidates. The winner is the one with highest combined_score.
|
|
241
|
+
|
|
242
|
+
Rename the winner directory to the official version name:
|
|
243
|
+
```bash
|
|
244
|
+
mv .harness-evolver/harnesses/{version}{winning_suffix} .harness-evolver/harnesses/{version}
|
|
245
|
+
```
|
|
246
|
+
|
|
247
|
+
Update state with the winner:
|
|
130
248
|
```bash
|
|
131
249
|
python3 $TOOLS/state.py update \
|
|
132
250
|
--base-dir .harness-evolver \
|
|
@@ -135,6 +253,17 @@ python3 $TOOLS/state.py update \
|
|
|
135
253
|
--proposal .harness-evolver/harnesses/{version}/proposal.md
|
|
136
254
|
```
|
|
137
255
|
|
|
256
|
+
Report ALL candidates:
|
|
257
|
+
```
|
|
258
|
+
Iteration {i}/{N} — 3 candidates evaluated:
|
|
259
|
+
{version}a (exploit): {score_a} — {1-line summary from proposal.md}
|
|
260
|
+
{version}b (explore): {score_b} — {1-line summary}
|
|
261
|
+
{version}c (cross): {score_c} — {1-line summary}
|
|
262
|
+
Winner: {version}{suffix} ({score}) ← promoted to {version}
|
|
263
|
+
```
|
|
264
|
+
|
|
265
|
+
Keep losing candidates in their directories (they're part of the archive — never discard, per DGM).
|
|
266
|
+
|
|
138
267
|
### 6. Report
|
|
139
268
|
|
|
140
269
|
Read `summary.json`. Print: `Iteration {i}/{N}: {version} scored {score} (best: {best} at {best_score})`
|
package/tools/evaluate.py
CHANGED
|
@@ -118,12 +118,17 @@ def cmd_run(args):
|
|
|
118
118
|
api_key = os.environ.get(ls.get("api_key_env", "LANGSMITH_API_KEY"), "")
|
|
119
119
|
if api_key:
|
|
120
120
|
version = os.path.basename(os.path.dirname(traces_dir))
|
|
121
|
+
ls_project = f"{ls.get('project_prefix', 'harness-evolver')}-{version}"
|
|
121
122
|
langsmith_env = {
|
|
122
123
|
**os.environ,
|
|
123
124
|
"LANGCHAIN_TRACING_V2": "true",
|
|
124
125
|
"LANGCHAIN_API_KEY": api_key,
|
|
125
|
-
"LANGCHAIN_PROJECT":
|
|
126
|
+
"LANGCHAIN_PROJECT": ls_project,
|
|
126
127
|
}
|
|
128
|
+
# Write the project name so the evolve skill knows where to find traces
|
|
129
|
+
ls_project_file = os.path.join(os.path.dirname(os.path.dirname(traces_dir)), "langsmith_project.txt")
|
|
130
|
+
with open(ls_project_file, "w") as f:
|
|
131
|
+
f.write(ls_project)
|
|
127
132
|
|
|
128
133
|
for task_file in task_files:
|
|
129
134
|
task_path = os.path.join(tasks_dir, task_file)
|