harness-evolver 1.6.0 → 1.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -13,6 +13,16 @@ permissionMode: acceptEdits
13
13
  If your prompt contains a `<files_to_read>` block, you MUST use the Read tool to load
14
14
  every file listed there before performing any other actions. These files are your context.
15
15
 
16
+ ## Strategy Injection
17
+
18
+ Your prompt may contain a `<strategy>` block defining your evolutionary role:
19
+ - **exploitation**: Make targeted, conservative fixes to the current best
20
+ - **exploration**: Try fundamentally different approaches, be bold
21
+ - **crossover**: Combine strengths from two parent versions
22
+
23
+ Follow the strategy. It determines your risk tolerance and parent selection.
24
+ If no strategy block is present, default to exploitation (conservative improvement).
25
+
16
26
  ## Context7 — Enrich Your Knowledge
17
27
 
18
28
  You have access to Context7 MCP tools (`resolve-library-id` and `get-library-docs`) for looking up **current, version-specific documentation** of any library.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "harness-evolver",
3
- "version": "1.6.0",
3
+ "version": "1.8.0",
4
4
  "description": "Meta-Harness-style autonomous harness optimization for Claude Code",
5
5
  "author": "Raphael Valdetaro",
6
6
  "license": "MIT",
@@ -48,13 +48,21 @@ python3 $TOOLS/analyze_architecture.py \
48
48
  -o .harness-evolver/architecture_signals.json
49
49
  ```
50
50
 
51
- 3. Dispatch subagent using the **Agent tool** with `subagent_type: "harness-evolver-architect"`:
51
+ 3. Read the architect agent definition:
52
+ ```bash
53
+ cat ~/.claude/agents/harness-evolver-architect.md
54
+ ```
55
+
56
+ 4. Dispatch using the Agent tool — include the agent definition in the prompt:
52
57
 
53
58
  ```
54
59
  Agent(
55
- subagent_type: "harness-evolver-architect",
56
60
  description: "Architect: topology analysis",
57
61
  prompt: |
62
+ <agent_instructions>
63
+ {paste the FULL content of harness-evolver-architect.md here}
64
+ </agent_instructions>
65
+
58
66
  <objective>
59
67
  Analyze the harness architecture and recommend the optimal multi-agent topology.
60
68
  {If called from evolve: "The evolution loop stagnated/regressed after N iterations."}
@@ -22,13 +22,21 @@ TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo
22
22
 
23
23
  1. Read `summary.json` and identify the suspicious pattern (score jump, premature convergence).
24
24
 
25
- 2. Dispatch subagent using the **Agent tool** with `subagent_type: "harness-evolver-critic"`:
25
+ 2. Read the critic agent definition:
26
+ ```bash
27
+ cat ~/.claude/agents/harness-evolver-critic.md
28
+ ```
29
+
30
+ 3. Dispatch using the Agent tool — include the agent definition in the prompt:
26
31
 
27
32
  ```
28
33
  Agent(
29
- subagent_type: "harness-evolver-critic",
30
34
  description: "Critic: analyze eval quality",
31
35
  prompt: |
36
+ <agent_instructions>
37
+ {paste the FULL content of harness-evolver-critic.md here}
38
+ </agent_instructions>
39
+
32
40
  <objective>
33
41
  Analyze eval quality for this harness evolution project.
34
42
  The best version is {version} with score {score} achieved in {iterations} iteration(s).
@@ -48,78 +48,184 @@ For the first iteration, use `baseline` as the version. For subsequent iteration
48
48
 
49
49
  These files are included in the proposer's `<files_to_read>` so it has real trace data for diagnosis.
50
50
 
51
- ### 2. Propose
51
+ ### 2. Propose (3 parallel candidates)
52
52
 
53
- Dispatch a subagent using the **Agent tool** with `subagent_type: "harness-evolver-proposer"`.
53
+ Spawn 3 proposer agents IN PARALLEL, each with a different evolutionary strategy.
54
+ This follows the DGM/AlphaEvolve pattern: exploit + explore + crossover.
54
55
 
55
- The prompt MUST be the full XML block below. The `<files_to_read>` MUST include the LangSmith/Context7 files if they were gathered in step 1.5.
56
+ First, read the proposer agent definition:
57
+ ```bash
58
+ cat ~/.claude/agents/harness-evolver-proposer.md
59
+ ```
60
+
61
+ Then determine parents for each strategy:
62
+ - **Exploiter parent**: current best version (from summary.json `best.version`)
63
+ - **Explorer parent**: a non-best version with low offspring count (read summary.json history, pick one that scored >0 but is NOT the best and has NOT been parent to many children)
64
+ - **Crossover parents**: best version + a different high-scorer from a different lineage
56
65
 
66
+ Spawn all 3 using the Agent tool. The first 2 use `run_in_background: true`, the 3rd blocks:
67
+
68
+ **Candidate A (Exploiter)** — `run_in_background: true`:
57
69
  ```
58
70
  Agent(
59
- subagent_type: "harness-evolver-proposer",
60
- description: "Propose harness {version}",
71
+ description: "Proposer A (exploit): targeted fix for {version}",
72
+ run_in_background: true,
61
73
  prompt: |
74
+ <agent_instructions>
75
+ {FULL content of harness-evolver-proposer.md}
76
+ </agent_instructions>
77
+
78
+ <strategy>
79
+ APPROACH: exploitation
80
+ You are the EXPLOITER. Make the SMALLEST, most targeted change that fixes
81
+ the highest-impact failing tasks. Base your work on the current best version.
82
+ Do NOT restructure the code. Do NOT change the architecture.
83
+ Focus on: prompt tweaks, parameter tuning, fixing specific failure modes.
84
+ </strategy>
85
+
62
86
  <objective>
63
- Propose harness version {version} that improves on the current best score of {best_score}.
87
+ Propose harness version {version}a that improves on {best_score}.
64
88
  </objective>
65
89
 
66
90
  <files_to_read>
67
91
  - .harness-evolver/summary.json
68
92
  - .harness-evolver/PROPOSER_HISTORY.md
69
93
  - .harness-evolver/config.json
70
- - .harness-evolver/baseline/harness.py
71
94
  - .harness-evolver/harnesses/{best_version}/harness.py
72
95
  - .harness-evolver/harnesses/{best_version}/scores.json
73
96
  - .harness-evolver/harnesses/{best_version}/proposal.md
74
- - .harness-evolver/langsmith_diagnosis.json (if exists — LangSmith failure analysis)
75
- - .harness-evolver/langsmith_stats.json (if exists — LangSmith aggregate stats)
76
- - .harness-evolver/context7_docs.md (if exists — current library documentation)
77
- - .harness-evolver/architecture.json (if exists — architect topology recommendation)
97
+ - .harness-evolver/langsmith_diagnosis.json (if exists)
98
+ - .harness-evolver/langsmith_stats.json (if exists)
99
+ - .harness-evolver/architecture.json (if exists)
78
100
  </files_to_read>
79
101
 
80
102
  <output>
81
- Create directory .harness-evolver/harnesses/{version}/ containing:
82
- - harness.py (the improved harness)
83
- - config.json (parameters, copy from parent if unchanged)
84
- - proposal.md (reasoning, must start with "Based on v{PARENT}")
103
+ Create directory .harness-evolver/harnesses/{version}a/ containing:
104
+ - harness.py, config.json, proposal.md
85
105
  </output>
106
+ )
107
+ ```
86
108
 
87
- <success_criteria>
88
- - harness.py maintains CLI interface (--input, --output, --traces-dir, --config)
89
- - proposal.md documents evidence-based reasoning
90
- - Changes are motivated by trace analysis (LangSmith data if available), not guesswork
91
- - If context7_docs.md was provided, API usage must match current documentation
92
- </success_criteria>
109
+ **Candidate B (Explorer)** — `run_in_background: true`:
110
+ ```
111
+ Agent(
112
+ description: "Proposer B (explore): bold change from {explorer_parent}",
113
+ run_in_background: true,
114
+ prompt: |
115
+ <agent_instructions>
116
+ {FULL content of harness-evolver-proposer.md}
117
+ </agent_instructions>
118
+
119
+ <strategy>
120
+ APPROACH: exploration
121
+ You are the EXPLORER. Try a FUNDAMENTALLY DIFFERENT approach.
122
+ Base your work on {explorer_parent} (NOT the current best — intentionally diverging).
123
+ Consider: different retrieval strategy, different prompt structure,
124
+ different output parsing, different error handling philosophy.
125
+ Be bold. A creative failure teaches more than a timid success.
126
+ </strategy>
127
+
128
+ <objective>
129
+ Propose harness version {version}b that takes a different approach.
130
+ </objective>
131
+
132
+ <files_to_read>
133
+ - .harness-evolver/summary.json
134
+ - .harness-evolver/PROPOSER_HISTORY.md
135
+ - .harness-evolver/config.json
136
+ - .harness-evolver/baseline/harness.py
137
+ - .harness-evolver/harnesses/{explorer_parent}/harness.py
138
+ - .harness-evolver/harnesses/{explorer_parent}/scores.json
139
+ - .harness-evolver/langsmith_diagnosis.json (if exists)
140
+ - .harness-evolver/architecture.json (if exists)
141
+ </files_to_read>
142
+
143
+ <output>
144
+ Create directory .harness-evolver/harnesses/{version}b/ containing:
145
+ - harness.py, config.json, proposal.md
146
+ </output>
147
+ )
148
+ ```
149
+
150
+ **Candidate C (Crossover)** — blocks (last one):
151
+ ```
152
+ Agent(
153
+ description: "Proposer C (crossover): combine {parent_a} + {parent_b}",
154
+ prompt: |
155
+ <agent_instructions>
156
+ {FULL content of harness-evolver-proposer.md}
157
+ </agent_instructions>
158
+
159
+ <strategy>
160
+ APPROACH: crossover
161
+ You are the CROSSOVER agent. Combine the STRENGTHS of two different versions:
162
+ - {parent_a} (score: {score_a}): {summary of what it does well}
163
+ - {parent_b} (score: {score_b}): {summary of what it does well}
164
+ Take the best elements from each and merge them into a single harness.
165
+ </strategy>
166
+
167
+ <objective>
168
+ Propose harness version {version}c that combines the best of {parent_a} and {parent_b}.
169
+ </objective>
170
+
171
+ <files_to_read>
172
+ - .harness-evolver/summary.json
173
+ - .harness-evolver/PROPOSER_HISTORY.md
174
+ - .harness-evolver/config.json
175
+ - .harness-evolver/harnesses/{parent_a}/harness.py
176
+ - .harness-evolver/harnesses/{parent_a}/scores.json
177
+ - .harness-evolver/harnesses/{parent_b}/harness.py
178
+ - .harness-evolver/harnesses/{parent_b}/scores.json
179
+ - .harness-evolver/langsmith_diagnosis.json (if exists)
180
+ - .harness-evolver/architecture.json (if exists)
181
+ </files_to_read>
182
+
183
+ <output>
184
+ Create directory .harness-evolver/harnesses/{version}c/ containing:
185
+ - harness.py, config.json, proposal.md
186
+ </output>
93
187
  )
94
188
  ```
95
189
 
96
- Wait for `## PROPOSAL COMPLETE` in the response. The subagent gets a fresh context window and loads the agent definition from `~/.claude/agents/harness-evolver-proposer.md`.
190
+ Wait for all 3 to complete. The background agents will notify when done.
191
+
192
+ **Special case — iteration 1**: Only the exploiter and explorer can run (no second parent for crossover yet). Spawn 2 agents: exploiter (from baseline) and explorer (also from baseline but with bold strategy). Skip crossover.
97
193
 
98
- ### 3. Validate
194
+ **Special case — iteration 2+**: All 3 strategies. Explorer parent = fitness-weighted random from history excluding current best.
99
195
 
196
+ ### 3. Validate All Candidates
197
+
198
+ For each candidate (a, b, c):
100
199
  ```bash
101
- python3 $TOOLS/evaluate.py validate \
102
- --harness .harness-evolver/harnesses/{version}/harness.py \
103
- --config .harness-evolver/harnesses/{version}/config.json
200
+ python3 $TOOLS/evaluate.py validate --harness .harness-evolver/harnesses/{version}{suffix}/harness.py --config .harness-evolver/harnesses/{version}{suffix}/config.json
104
201
  ```
105
202
 
106
- If fails: one retry via proposer. If still fails: score 0.0, continue.
203
+ Remove any that fail validation.
107
204
 
108
- ### 4. Evaluate
205
+ ### 4. Evaluate All Candidates
109
206
 
207
+ For each valid candidate:
110
208
  ```bash
111
209
  python3 $TOOLS/evaluate.py run \
112
- --harness .harness-evolver/harnesses/{version}/harness.py \
113
- --config .harness-evolver/harnesses/{version}/config.json \
210
+ --harness .harness-evolver/harnesses/{version}{suffix}/harness.py \
211
+ --config .harness-evolver/harnesses/{version}{suffix}/config.json \
114
212
  --tasks-dir .harness-evolver/eval/tasks/ \
115
213
  --eval .harness-evolver/eval/eval.py \
116
- --traces-dir .harness-evolver/harnesses/{version}/traces/ \
117
- --scores .harness-evolver/harnesses/{version}/scores.json \
214
+ --traces-dir .harness-evolver/harnesses/{version}{suffix}/traces/ \
215
+ --scores .harness-evolver/harnesses/{version}{suffix}/scores.json \
118
216
  --timeout 60
119
217
  ```
120
218
 
121
- ### 5. Update State
219
+ ### 5. Select Winner + Update State
122
220
 
221
+ Compare scores of all evaluated candidates. The winner is the one with highest combined_score.
222
+
223
+ Rename the winner directory to the official version name:
224
+ ```bash
225
+ mv .harness-evolver/harnesses/{version}{winning_suffix} .harness-evolver/harnesses/{version}
226
+ ```
227
+
228
+ Update state with the winner:
123
229
  ```bash
124
230
  python3 $TOOLS/state.py update \
125
231
  --base-dir .harness-evolver \
@@ -128,6 +234,17 @@ python3 $TOOLS/state.py update \
128
234
  --proposal .harness-evolver/harnesses/{version}/proposal.md
129
235
  ```
130
236
 
237
+ Report ALL candidates:
238
+ ```
239
+ Iteration {i}/{N} — 3 candidates evaluated:
240
+ {version}a (exploit): {score_a} — {1-line summary from proposal.md}
241
+ {version}b (explore): {score_b} — {1-line summary}
242
+ {version}c (cross): {score_c} — {1-line summary}
243
+ Winner: {version}{suffix} ({score}) ← promoted to {version}
244
+ ```
245
+
246
+ Keep losing candidates in their directories (they're part of the archive — never discard, per DGM).
247
+
131
248
  ### 6. Report
132
249
 
133
250
  Read `summary.json`. Print: `Iteration {i}/{N}: {version} scored {score} (best: {best} at {best_score})`
@@ -150,13 +267,21 @@ python3 $TOOLS/evaluate.py run \
150
267
  --timeout 60
151
268
  ```
152
269
 
153
- Dispatch critic subagent using the **Agent tool** with `subagent_type: "harness-evolver-critic"`:
270
+ First read the critic agent definition:
271
+ ```bash
272
+ cat ~/.claude/agents/harness-evolver-critic.md
273
+ ```
274
+
275
+ Then dispatch:
154
276
 
155
277
  ```
156
278
  Agent(
157
- subagent_type: "harness-evolver-critic",
158
279
  description: "Critic: analyze eval quality",
159
280
  prompt: |
281
+ <agent_instructions>
282
+ {paste the FULL content of harness-evolver-critic.md here}
283
+ </agent_instructions>
284
+
160
285
  <objective>
161
286
  EVAL GAMING DETECTED: Score jumped from {parent_score} to {score} in one iteration.
162
287
  Analyze the eval quality and propose a stricter eval.
@@ -219,13 +344,21 @@ python3 $TOOLS/analyze_architecture.py \
219
344
  -o .harness-evolver/architecture_signals.json
220
345
  ```
221
346
 
222
- Dispatch architect subagent using the **Agent tool** with `subagent_type: "harness-evolver-architect"`:
347
+ First read the architect agent definition:
348
+ ```bash
349
+ cat ~/.claude/agents/harness-evolver-architect.md
350
+ ```
351
+
352
+ Then dispatch:
223
353
 
224
354
  ```
225
355
  Agent(
226
- subagent_type: "harness-evolver-architect",
227
356
  description: "Architect: analyze topology after {stagnation/regression}",
228
357
  prompt: |
358
+ <agent_instructions>
359
+ {paste the FULL content of harness-evolver-architect.md here}
360
+ </agent_instructions>
361
+
229
362
  <objective>
230
363
  The evolution loop has {stagnated/regressed} after {iterations} iterations (best: {best_score}).
231
364
  Analyze the harness architecture and recommend a topology change.