harness-evolver 1.7.0 → 1.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -13,6 +13,16 @@ permissionMode: acceptEdits
13
13
  If your prompt contains a `<files_to_read>` block, you MUST use the Read tool to load
14
14
  every file listed there before performing any other actions. These files are your context.
15
15
 
16
+ ## Strategy Injection
17
+
18
+ Your prompt may contain a `<strategy>` block defining your evolutionary role:
19
+ - **exploitation**: Make targeted, conservative fixes to the current best
20
+ - **exploration**: Try fundamentally different approaches, be bold
21
+ - **crossover**: Combine strengths from two parent versions
22
+
23
+ Follow the strategy. It determines your risk tolerance and parent selection.
24
+ If no strategy block is present, default to exploitation (conservative improvement).
25
+
16
26
  ## Context7 — Enrich Your Knowledge
17
27
 
18
28
  You have access to Context7 MCP tools (`resolve-library-id` and `get-library-docs`) for looking up **current, version-specific documentation** of any library.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "harness-evolver",
3
- "version": "1.7.0",
3
+ "version": "1.8.0",
4
4
  "description": "Meta-Harness-style autonomous harness optimization for Claude Code",
5
5
  "author": "Raphael Valdetaro",
6
6
  "license": "MIT",
@@ -48,85 +48,184 @@ For the first iteration, use `baseline` as the version. For subsequent iteration
48
48
 
49
49
  These files are included in the proposer's `<files_to_read>` so it has real trace data for diagnosis.
50
50
 
51
- ### 2. Propose
51
+ ### 2. Propose (3 parallel candidates)
52
52
 
53
- Dispatch a subagent using the **Agent tool**.
53
+ Spawn 3 proposer agents IN PARALLEL, each with a different evolutionary strategy.
54
+ This follows the DGM/AlphaEvolve pattern: exploit + explore + crossover.
54
55
 
55
- First, read the proposer agent definition to include in the prompt:
56
+ First, read the proposer agent definition:
56
57
  ```bash
57
58
  cat ~/.claude/agents/harness-evolver-proposer.md
58
59
  ```
59
60
 
60
- Then dispatch the Agent with the agent definition + structured task:
61
+ Then determine parents for each strategy:
62
+ - **Exploiter parent**: current best version (from summary.json `best.version`)
63
+ - **Explorer parent**: a non-best version with low offspring count (read summary.json history, pick one that scored >0 but is NOT the best and has NOT been parent to many children)
64
+ - **Crossover parents**: best version + a different high-scorer from a different lineage
61
65
 
66
+ Spawn all 3 using the Agent tool. The first 2 use `run_in_background: true`, the 3rd blocks:
67
+
68
+ **Candidate A (Exploiter)** — `run_in_background: true`:
62
69
  ```
63
70
  Agent(
64
- description: "Propose harness {version}",
71
+ description: "Proposer A (exploit): targeted fix for {version}",
72
+ run_in_background: true,
65
73
  prompt: |
66
74
  <agent_instructions>
67
- {paste the FULL content of harness-evolver-proposer.md here}
75
+ {FULL content of harness-evolver-proposer.md}
68
76
  </agent_instructions>
69
77
 
78
+ <strategy>
79
+ APPROACH: exploitation
80
+ You are the EXPLOITER. Make the SMALLEST, most targeted change that fixes
81
+ the highest-impact failing tasks. Base your work on the current best version.
82
+ Do NOT restructure the code. Do NOT change the architecture.
83
+ Focus on: prompt tweaks, parameter tuning, fixing specific failure modes.
84
+ </strategy>
85
+
70
86
  <objective>
71
- Propose harness version {version} that improves on the current best score of {best_score}.
87
+ Propose harness version {version}a that improves on {best_score}.
72
88
  </objective>
73
89
 
74
90
  <files_to_read>
75
91
  - .harness-evolver/summary.json
76
92
  - .harness-evolver/PROPOSER_HISTORY.md
77
93
  - .harness-evolver/config.json
78
- - .harness-evolver/baseline/harness.py
79
94
  - .harness-evolver/harnesses/{best_version}/harness.py
80
95
  - .harness-evolver/harnesses/{best_version}/scores.json
81
96
  - .harness-evolver/harnesses/{best_version}/proposal.md
82
- - .harness-evolver/langsmith_diagnosis.json (if exists — LangSmith failure analysis)
83
- - .harness-evolver/langsmith_stats.json (if exists — LangSmith aggregate stats)
84
- - .harness-evolver/architecture.json (if exists — architect topology recommendation)
97
+ - .harness-evolver/langsmith_diagnosis.json (if exists)
98
+ - .harness-evolver/langsmith_stats.json (if exists)
99
+ - .harness-evolver/architecture.json (if exists)
85
100
  </files_to_read>
86
101
 
87
102
  <output>
88
- Create directory .harness-evolver/harnesses/{version}/ containing:
89
- - harness.py (the improved harness)
90
- - config.json (parameters, copy from parent if unchanged)
91
- - proposal.md (reasoning, must start with "Based on v{PARENT}")
103
+ Create directory .harness-evolver/harnesses/{version}a/ containing:
104
+ - harness.py, config.json, proposal.md
92
105
  </output>
106
+ )
107
+ ```
93
108
 
94
- <success_criteria>
95
- - harness.py maintains CLI interface (--input, --output, --traces-dir, --config)
96
- - proposal.md documents evidence-based reasoning
97
- - If proposing API changes, MUST use Context7 (resolve-library-id + get-library-docs) to verify current docs
98
- - Changes motivated by LangSmith trace data (in langsmith_diagnosis.json) when available
99
- </success_criteria>
109
+ **Candidate B (Explorer)** — `run_in_background: true`:
110
+ ```
111
+ Agent(
112
+ description: "Proposer B (explore): bold change from {explorer_parent}",
113
+ run_in_background: true,
114
+ prompt: |
115
+ <agent_instructions>
116
+ {FULL content of harness-evolver-proposer.md}
117
+ </agent_instructions>
118
+
119
+ <strategy>
120
+ APPROACH: exploration
121
+ You are the EXPLORER. Try a FUNDAMENTALLY DIFFERENT approach.
122
+ Base your work on {explorer_parent} (NOT the current best — intentionally diverging).
123
+ Consider: different retrieval strategy, different prompt structure,
124
+ different output parsing, different error handling philosophy.
125
+ Be bold. A creative failure teaches more than a timid success.
126
+ </strategy>
127
+
128
+ <objective>
129
+ Propose harness version {version}b that takes a different approach.
130
+ </objective>
131
+
132
+ <files_to_read>
133
+ - .harness-evolver/summary.json
134
+ - .harness-evolver/PROPOSER_HISTORY.md
135
+ - .harness-evolver/config.json
136
+ - .harness-evolver/baseline/harness.py
137
+ - .harness-evolver/harnesses/{explorer_parent}/harness.py
138
+ - .harness-evolver/harnesses/{explorer_parent}/scores.json
139
+ - .harness-evolver/langsmith_diagnosis.json (if exists)
140
+ - .harness-evolver/architecture.json (if exists)
141
+ </files_to_read>
142
+
143
+ <output>
144
+ Create directory .harness-evolver/harnesses/{version}b/ containing:
145
+ - harness.py, config.json, proposal.md
146
+ </output>
100
147
  )
101
148
  ```
102
149
 
103
- Wait for `## PROPOSAL COMPLETE` in the response.
150
+ **Candidate C (Crossover)** blocks (last one):
151
+ ```
152
+ Agent(
153
+ description: "Proposer C (crossover): combine {parent_a} + {parent_b}",
154
+ prompt: |
155
+ <agent_instructions>
156
+ {FULL content of harness-evolver-proposer.md}
157
+ </agent_instructions>
158
+
159
+ <strategy>
160
+ APPROACH: crossover
161
+ You are the CROSSOVER agent. Combine the STRENGTHS of two different versions:
162
+ - {parent_a} (score: {score_a}): {summary of what it does well}
163
+ - {parent_b} (score: {score_b}): {summary of what it does well}
164
+ Take the best elements from each and merge them into a single harness.
165
+ </strategy>
166
+
167
+ <objective>
168
+ Propose harness version {version}c that combines the best of {parent_a} and {parent_b}.
169
+ </objective>
170
+
171
+ <files_to_read>
172
+ - .harness-evolver/summary.json
173
+ - .harness-evolver/PROPOSER_HISTORY.md
174
+ - .harness-evolver/config.json
175
+ - .harness-evolver/harnesses/{parent_a}/harness.py
176
+ - .harness-evolver/harnesses/{parent_a}/scores.json
177
+ - .harness-evolver/harnesses/{parent_b}/harness.py
178
+ - .harness-evolver/harnesses/{parent_b}/scores.json
179
+ - .harness-evolver/langsmith_diagnosis.json (if exists)
180
+ - .harness-evolver/architecture.json (if exists)
181
+ </files_to_read>
182
+
183
+ <output>
184
+ Create directory .harness-evolver/harnesses/{version}c/ containing:
185
+ - harness.py, config.json, proposal.md
186
+ </output>
187
+ )
188
+ ```
189
+
190
+ Wait for all 3 to complete. The background agents will notify when done.
191
+
192
+ **Special case — iteration 1**: Only the exploiter and explorer can run (no second parent for crossover yet). Spawn 2 agents: exploiter (from baseline) and explorer (also from baseline but with bold strategy). Skip crossover.
193
+
194
+ **Special case — iteration 2+**: All 3 strategies. Explorer parent = fitness-weighted random from history excluding current best.
104
195
 
105
- ### 3. Validate
196
+ ### 3. Validate All Candidates
106
197
 
198
+ For each candidate (a, b, c):
107
199
  ```bash
108
- python3 $TOOLS/evaluate.py validate \
109
- --harness .harness-evolver/harnesses/{version}/harness.py \
110
- --config .harness-evolver/harnesses/{version}/config.json
200
+ python3 $TOOLS/evaluate.py validate --harness .harness-evolver/harnesses/{version}{suffix}/harness.py --config .harness-evolver/harnesses/{version}{suffix}/config.json
111
201
  ```
112
202
 
113
- If fails: one retry via proposer. If still fails: score 0.0, continue.
203
+ Remove any that fail validation.
114
204
 
115
- ### 4. Evaluate
205
+ ### 4. Evaluate All Candidates
116
206
 
207
+ For each valid candidate:
117
208
  ```bash
118
209
  python3 $TOOLS/evaluate.py run \
119
- --harness .harness-evolver/harnesses/{version}/harness.py \
120
- --config .harness-evolver/harnesses/{version}/config.json \
210
+ --harness .harness-evolver/harnesses/{version}{suffix}/harness.py \
211
+ --config .harness-evolver/harnesses/{version}{suffix}/config.json \
121
212
  --tasks-dir .harness-evolver/eval/tasks/ \
122
213
  --eval .harness-evolver/eval/eval.py \
123
- --traces-dir .harness-evolver/harnesses/{version}/traces/ \
124
- --scores .harness-evolver/harnesses/{version}/scores.json \
214
+ --traces-dir .harness-evolver/harnesses/{version}{suffix}/traces/ \
215
+ --scores .harness-evolver/harnesses/{version}{suffix}/scores.json \
125
216
  --timeout 60
126
217
  ```
127
218
 
128
- ### 5. Update State
219
+ ### 5. Select Winner + Update State
220
+
221
+ Compare scores of all evaluated candidates. The winner is the one with highest combined_score.
129
222
 
223
+ Rename the winner directory to the official version name:
224
+ ```bash
225
+ mv .harness-evolver/harnesses/{version}{winning_suffix} .harness-evolver/harnesses/{version}
226
+ ```
227
+
228
+ Update state with the winner:
130
229
  ```bash
131
230
  python3 $TOOLS/state.py update \
132
231
  --base-dir .harness-evolver \
@@ -135,6 +234,17 @@ python3 $TOOLS/state.py update \
135
234
  --proposal .harness-evolver/harnesses/{version}/proposal.md
136
235
  ```
137
236
 
237
+ Report ALL candidates:
238
+ ```
239
+ Iteration {i}/{N} — 3 candidates evaluated:
240
+ {version}a (exploit): {score_a} — {1-line summary from proposal.md}
241
+ {version}b (explore): {score_b} — {1-line summary}
242
+ {version}c (cross): {score_c} — {1-line summary}
243
+ Winner: {version}{suffix} ({score}) ← promoted to {version}
244
+ ```
245
+
246
+ Keep losing candidates in their directories (they're part of the archive — never discard, per DGM).
247
+
138
248
  ### 6. Report
139
249
 
140
250
  Read `summary.json`. Print: `Iteration {i}/{N}: {version} scored {score} (best: {best} at {best_score})`