harness-evolver 1.7.0 → 1.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -13,6 +13,16 @@ permissionMode: acceptEdits
13
13
  If your prompt contains a `<files_to_read>` block, you MUST use the Read tool to load
14
14
  every file listed there before performing any other actions. These files are your context.
15
15
 
16
+ ## Strategy Injection
17
+
18
+ Your prompt may contain a `<strategy>` block defining your evolutionary role:
19
+ - **exploitation**: Make targeted, conservative fixes to the current best
20
+ - **exploration**: Try fundamentally different approaches, be bold
21
+ - **crossover**: Combine strengths from two parent versions
22
+
23
+ Follow the strategy. It determines your risk tolerance and parent selection.
24
+ If no strategy block is present, default to exploitation (conservative improvement).
25
+
16
26
  ## Context7 — Enrich Your Knowledge
17
27
 
18
28
  You have access to Context7 MCP tools (`resolve-library-id` and `get-library-docs`) for looking up **current, version-specific documentation** of any library.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "harness-evolver",
3
- "version": "1.7.0",
3
+ "version": "1.9.0",
4
4
  "description": "Meta-Harness-style autonomous harness optimization for Claude Code",
5
5
  "author": "Raphael Valdetaro",
6
6
  "license": "MIT",
@@ -36,97 +36,215 @@ python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); pri
36
36
 
37
37
  ### 1.5. Gather LangSmith Traces (MANDATORY after every evaluation)
38
38
 
39
- **Run these commands unconditionally after EVERY evaluation** (including baseline). If langsmith-cli is not installed or there are no runs, the commands fail silently that's fine. But you MUST attempt them.
39
+ **Run these commands unconditionally after EVERY evaluation** (including baseline). Do NOT guess project namesdiscover them.
40
+
41
+ **Step 1: Find the actual LangSmith project name**
40
42
 
41
43
  ```bash
42
- langsmith-cli --json runs list --project harness-evolver-{last_evaluated_version} --failed --fields id,name,error,inputs --limit 10 > .harness-evolver/langsmith_diagnosis.json 2>/dev/null || echo "[]" > .harness-evolver/langsmith_diagnosis.json
44
+ langsmith-cli --json projects list --name-pattern "harness-evolver*" --limit 10 2>/dev/null
45
+ ```
46
+
47
+ This returns all projects matching the prefix. Pick the most recently updated one, or the one matching the current version. Save the project name:
43
48
 
44
- langsmith-cli --json runs stats --project harness-evolver-{last_evaluated_version} > .harness-evolver/langsmith_stats.json 2>/dev/null || echo "{}" > .harness-evolver/langsmith_stats.json
49
+ ```bash
50
+ LS_PROJECT=$(langsmith-cli --json projects list --name-pattern "harness-evolver*" --limit 1 2>/dev/null | python3 -c "import sys,json; data=json.load(sys.stdin); print(data[0]['name'] if data else '')" 2>/dev/null || echo "")
45
51
  ```
46
52
 
47
- For the first iteration, use `baseline` as the version. For subsequent iterations, use the latest evaluated version.
53
+ If `LS_PROJECT` is empty, langsmith-cli is not available or no projects exist skip to step 2.
54
+
55
+ **Step 2: Gather traces from the discovered project**
56
+
57
+ ```bash
58
+ if [ -n "$LS_PROJECT" ]; then
59
+ langsmith-cli --json runs list --project "$LS_PROJECT" --failed --fields id,name,error,inputs --limit 10 > .harness-evolver/langsmith_diagnosis.json 2>/dev/null || echo "[]" > .harness-evolver/langsmith_diagnosis.json
60
+ langsmith-cli --json runs stats --project "$LS_PROJECT" > .harness-evolver/langsmith_stats.json 2>/dev/null || echo "{}" > .harness-evolver/langsmith_stats.json
61
+ echo "$LS_PROJECT" > .harness-evolver/langsmith_project.txt
62
+ else
63
+ echo "[]" > .harness-evolver/langsmith_diagnosis.json
64
+ echo "{}" > .harness-evolver/langsmith_stats.json
65
+ fi
66
+ ```
48
67
 
49
68
  These files are included in the proposer's `<files_to_read>` so it has real trace data for diagnosis.
50
69
 
51
- ### 2. Propose
70
+ ### 2. Propose (3 parallel candidates)
52
71
 
53
- Dispatch a subagent using the **Agent tool**.
72
+ Spawn 3 proposer agents IN PARALLEL, each with a different evolutionary strategy.
73
+ This follows the DGM/AlphaEvolve pattern: exploit + explore + crossover.
54
74
 
55
- First, read the proposer agent definition to include in the prompt:
75
+ First, read the proposer agent definition:
56
76
  ```bash
57
77
  cat ~/.claude/agents/harness-evolver-proposer.md
58
78
  ```
59
79
 
60
- Then dispatch the Agent with the agent definition + structured task:
80
+ Then determine parents for each strategy:
81
+ - **Exploiter parent**: current best version (from summary.json `best.version`)
82
+ - **Explorer parent**: a non-best version with low offspring count (read summary.json history, pick one that scored >0 but is NOT the best and has NOT been parent to many children)
83
+ - **Crossover parents**: best version + a different high-scorer from a different lineage
84
+
85
+ Spawn all 3 using the Agent tool. The first 2 use `run_in_background: true`, the 3rd blocks:
61
86
 
87
+ **Candidate A (Exploiter)** — `run_in_background: true`:
62
88
  ```
63
89
  Agent(
64
- description: "Propose harness {version}",
90
+ description: "Proposer A (exploit): targeted fix for {version}",
91
+ run_in_background: true,
65
92
  prompt: |
66
93
  <agent_instructions>
67
- {paste the FULL content of harness-evolver-proposer.md here}
94
+ {FULL content of harness-evolver-proposer.md}
68
95
  </agent_instructions>
69
96
 
97
+ <strategy>
98
+ APPROACH: exploitation
99
+ You are the EXPLOITER. Make the SMALLEST, most targeted change that fixes
100
+ the highest-impact failing tasks. Base your work on the current best version.
101
+ Do NOT restructure the code. Do NOT change the architecture.
102
+ Focus on: prompt tweaks, parameter tuning, fixing specific failure modes.
103
+ </strategy>
104
+
70
105
  <objective>
71
- Propose harness version {version} that improves on the current best score of {best_score}.
106
+ Propose harness version {version}a that improves on {best_score}.
72
107
  </objective>
73
108
 
74
109
  <files_to_read>
75
110
  - .harness-evolver/summary.json
76
111
  - .harness-evolver/PROPOSER_HISTORY.md
77
112
  - .harness-evolver/config.json
78
- - .harness-evolver/baseline/harness.py
79
113
  - .harness-evolver/harnesses/{best_version}/harness.py
80
114
  - .harness-evolver/harnesses/{best_version}/scores.json
81
115
  - .harness-evolver/harnesses/{best_version}/proposal.md
82
- - .harness-evolver/langsmith_diagnosis.json (if exists — LangSmith failure analysis)
83
- - .harness-evolver/langsmith_stats.json (if exists — LangSmith aggregate stats)
84
- - .harness-evolver/architecture.json (if exists — architect topology recommendation)
116
+ - .harness-evolver/langsmith_diagnosis.json (if exists)
117
+ - .harness-evolver/langsmith_stats.json (if exists)
118
+ - .harness-evolver/architecture.json (if exists)
85
119
  </files_to_read>
86
120
 
87
121
  <output>
88
- Create directory .harness-evolver/harnesses/{version}/ containing:
89
- - harness.py (the improved harness)
90
- - config.json (parameters, copy from parent if unchanged)
91
- - proposal.md (reasoning, must start with "Based on v{PARENT}")
122
+ Create directory .harness-evolver/harnesses/{version}a/ containing:
123
+ - harness.py, config.json, proposal.md
92
124
  </output>
125
+ )
126
+ ```
93
127
 
94
- <success_criteria>
95
- - harness.py maintains CLI interface (--input, --output, --traces-dir, --config)
96
- - proposal.md documents evidence-based reasoning
97
- - If proposing API changes, MUST use Context7 (resolve-library-id + get-library-docs) to verify current docs
98
- - Changes motivated by LangSmith trace data (in langsmith_diagnosis.json) when available
99
- </success_criteria>
128
+ **Candidate B (Explorer)** — `run_in_background: true`:
129
+ ```
130
+ Agent(
131
+ description: "Proposer B (explore): bold change from {explorer_parent}",
132
+ run_in_background: true,
133
+ prompt: |
134
+ <agent_instructions>
135
+ {FULL content of harness-evolver-proposer.md}
136
+ </agent_instructions>
137
+
138
+ <strategy>
139
+ APPROACH: exploration
140
+ You are the EXPLORER. Try a FUNDAMENTALLY DIFFERENT approach.
141
+ Base your work on {explorer_parent} (NOT the current best — intentionally diverging).
142
+ Consider: different retrieval strategy, different prompt structure,
143
+ different output parsing, different error handling philosophy.
144
+ Be bold. A creative failure teaches more than a timid success.
145
+ </strategy>
146
+
147
+ <objective>
148
+ Propose harness version {version}b that takes a different approach.
149
+ </objective>
150
+
151
+ <files_to_read>
152
+ - .harness-evolver/summary.json
153
+ - .harness-evolver/PROPOSER_HISTORY.md
154
+ - .harness-evolver/config.json
155
+ - .harness-evolver/baseline/harness.py
156
+ - .harness-evolver/harnesses/{explorer_parent}/harness.py
157
+ - .harness-evolver/harnesses/{explorer_parent}/scores.json
158
+ - .harness-evolver/langsmith_diagnosis.json (if exists)
159
+ - .harness-evolver/architecture.json (if exists)
160
+ </files_to_read>
161
+
162
+ <output>
163
+ Create directory .harness-evolver/harnesses/{version}b/ containing:
164
+ - harness.py, config.json, proposal.md
165
+ </output>
100
166
  )
101
167
  ```
102
168
 
103
- Wait for `## PROPOSAL COMPLETE` in the response.
169
+ **Candidate C (Crossover)** blocks (last one):
170
+ ```
171
+ Agent(
172
+ description: "Proposer C (crossover): combine {parent_a} + {parent_b}",
173
+ prompt: |
174
+ <agent_instructions>
175
+ {FULL content of harness-evolver-proposer.md}
176
+ </agent_instructions>
177
+
178
+ <strategy>
179
+ APPROACH: crossover
180
+ You are the CROSSOVER agent. Combine the STRENGTHS of two different versions:
181
+ - {parent_a} (score: {score_a}): {summary of what it does well}
182
+ - {parent_b} (score: {score_b}): {summary of what it does well}
183
+ Take the best elements from each and merge them into a single harness.
184
+ </strategy>
185
+
186
+ <objective>
187
+ Propose harness version {version}c that combines the best of {parent_a} and {parent_b}.
188
+ </objective>
189
+
190
+ <files_to_read>
191
+ - .harness-evolver/summary.json
192
+ - .harness-evolver/PROPOSER_HISTORY.md
193
+ - .harness-evolver/config.json
194
+ - .harness-evolver/harnesses/{parent_a}/harness.py
195
+ - .harness-evolver/harnesses/{parent_a}/scores.json
196
+ - .harness-evolver/harnesses/{parent_b}/harness.py
197
+ - .harness-evolver/harnesses/{parent_b}/scores.json
198
+ - .harness-evolver/langsmith_diagnosis.json (if exists)
199
+ - .harness-evolver/architecture.json (if exists)
200
+ </files_to_read>
201
+
202
+ <output>
203
+ Create directory .harness-evolver/harnesses/{version}c/ containing:
204
+ - harness.py, config.json, proposal.md
205
+ </output>
206
+ )
207
+ ```
208
+
209
+ Wait for all 3 to complete. The background agents will notify when done.
210
+
211
+ **Special case — iteration 1**: Only the exploiter and explorer can run (no second parent for crossover yet). Spawn 2 agents: exploiter (from baseline) and explorer (also from baseline but with bold strategy). Skip crossover.
104
212
 
105
- ### 3. Validate
213
+ **Special case — iteration 2+**: All 3 strategies. Explorer parent = fitness-weighted random from history excluding current best.
106
214
 
215
+ ### 3. Validate All Candidates
216
+
217
+ For each candidate (a, b, c):
107
218
  ```bash
108
- python3 $TOOLS/evaluate.py validate \
109
- --harness .harness-evolver/harnesses/{version}/harness.py \
110
- --config .harness-evolver/harnesses/{version}/config.json
219
+ python3 $TOOLS/evaluate.py validate --harness .harness-evolver/harnesses/{version}{suffix}/harness.py --config .harness-evolver/harnesses/{version}{suffix}/config.json
111
220
  ```
112
221
 
113
- If fails: one retry via proposer. If still fails: score 0.0, continue.
222
+ Remove any that fail validation.
114
223
 
115
- ### 4. Evaluate
224
+ ### 4. Evaluate All Candidates
116
225
 
226
+ For each valid candidate:
117
227
  ```bash
118
228
  python3 $TOOLS/evaluate.py run \
119
- --harness .harness-evolver/harnesses/{version}/harness.py \
120
- --config .harness-evolver/harnesses/{version}/config.json \
229
+ --harness .harness-evolver/harnesses/{version}{suffix}/harness.py \
230
+ --config .harness-evolver/harnesses/{version}{suffix}/config.json \
121
231
  --tasks-dir .harness-evolver/eval/tasks/ \
122
232
  --eval .harness-evolver/eval/eval.py \
123
- --traces-dir .harness-evolver/harnesses/{version}/traces/ \
124
- --scores .harness-evolver/harnesses/{version}/scores.json \
233
+ --traces-dir .harness-evolver/harnesses/{version}{suffix}/traces/ \
234
+ --scores .harness-evolver/harnesses/{version}{suffix}/scores.json \
125
235
  --timeout 60
126
236
  ```
127
237
 
128
- ### 5. Update State
238
+ ### 5. Select Winner + Update State
129
239
 
240
+ Compare scores of all evaluated candidates. The winner is the one with highest combined_score.
241
+
242
+ Rename the winner directory to the official version name:
243
+ ```bash
244
+ mv .harness-evolver/harnesses/{version}{winning_suffix} .harness-evolver/harnesses/{version}
245
+ ```
246
+
247
+ Update state with the winner:
130
248
  ```bash
131
249
  python3 $TOOLS/state.py update \
132
250
  --base-dir .harness-evolver \
@@ -135,6 +253,17 @@ python3 $TOOLS/state.py update \
135
253
  --proposal .harness-evolver/harnesses/{version}/proposal.md
136
254
  ```
137
255
 
256
+ Report ALL candidates:
257
+ ```
258
+ Iteration {i}/{N} — 3 candidates evaluated:
259
+ {version}a (exploit): {score_a} — {1-line summary from proposal.md}
260
+ {version}b (explore): {score_b} — {1-line summary}
261
+ {version}c (cross): {score_c} — {1-line summary}
262
+ Winner: {version}{suffix} ({score}) ← promoted to {version}
263
+ ```
264
+
265
+ Keep losing candidates in their directories (they're part of the archive — never discard, per DGM).
266
+
138
267
  ### 6. Report
139
268
 
140
269
  Read `summary.json`. Print: `Iteration {i}/{N}: {version} scored {score} (best: {best} at {best_score})`
package/tools/evaluate.py CHANGED
@@ -118,12 +118,17 @@ def cmd_run(args):
118
118
  api_key = os.environ.get(ls.get("api_key_env", "LANGSMITH_API_KEY"), "")
119
119
  if api_key:
120
120
  version = os.path.basename(os.path.dirname(traces_dir))
121
+ ls_project = f"{ls.get('project_prefix', 'harness-evolver')}-{version}"
121
122
  langsmith_env = {
122
123
  **os.environ,
123
124
  "LANGCHAIN_TRACING_V2": "true",
124
125
  "LANGCHAIN_API_KEY": api_key,
125
- "LANGCHAIN_PROJECT": f"{ls.get('project_prefix', 'harness-evolver')}-{version}",
126
+ "LANGCHAIN_PROJECT": ls_project,
126
127
  }
128
+ # Write the project name so the evolve skill knows where to find traces
129
+ ls_project_file = os.path.join(os.path.dirname(os.path.dirname(traces_dir)), "langsmith_project.txt")
130
+ with open(ls_project_file, "w") as f:
131
+ f.write(ls_project)
127
132
 
128
133
  for task_file in task_files:
129
134
  task_path = os.path.join(tasks_dir, task_file)