harness-evolver 1.4.0 → 1.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "harness-evolver",
3
- "version": "1.4.0",
3
+ "version": "1.5.0",
4
4
  "description": "Meta-Harness-style autonomous harness optimization for Claude Code",
5
5
  "author": "Raphael Valdetaro",
6
6
  "license": "MIT",
@@ -48,35 +48,40 @@ python3 $TOOLS/analyze_architecture.py \
48
48
  -o .harness-evolver/architecture_signals.json
49
49
  ```
50
50
 
51
- 3. Spawn the `harness-evolver-architect` agent:
52
-
53
- ```xml
54
- <objective>
55
- Analyze the harness architecture and recommend the optimal multi-agent topology.
56
- {If called from evolve: "The evolution loop stagnated/regressed after N iterations."}
57
- {If called by user: "The user requested an architecture analysis."}
58
- </objective>
59
-
60
- <files_to_read>
61
- - .harness-evolver/architecture_signals.json
62
- - .harness-evolver/config.json
63
- - .harness-evolver/baseline/harness.py
64
- - .harness-evolver/summary.json (if exists)
65
- - .harness-evolver/PROPOSER_HISTORY.md (if exists)
66
- </files_to_read>
67
-
68
- <output>
69
- Write:
70
- - .harness-evolver/architecture.json
71
- - .harness-evolver/architecture.md
72
- </output>
73
-
74
- <success_criteria>
75
- - Classifies current topology correctly
76
- - Recommendation includes migration path with concrete steps
77
- - Considers detected stack and API key availability
78
- - Confidence rating is honest (low/medium/high)
79
- </success_criteria>
51
+ 3. Dispatch subagent using the **Agent tool** with `subagent_type: "harness-evolver-architect"`:
52
+
53
+ ```
54
+ Agent(
55
+ subagent_type: "harness-evolver-architect",
56
+ description: "Architect: topology analysis",
57
+ prompt: |
58
+ <objective>
59
+ Analyze the harness architecture and recommend the optimal multi-agent topology.
60
+ {If called from evolve: "The evolution loop stagnated/regressed after N iterations."}
61
+ {If called by user: "The user requested an architecture analysis."}
62
+ </objective>
63
+
64
+ <files_to_read>
65
+ - .harness-evolver/architecture_signals.json
66
+ - .harness-evolver/config.json
67
+ - .harness-evolver/baseline/harness.py
68
+ - .harness-evolver/summary.json (if exists)
69
+ - .harness-evolver/PROPOSER_HISTORY.md (if exists)
70
+ </files_to_read>
71
+
72
+ <output>
73
+ Write:
74
+ - .harness-evolver/architecture.json
75
+ - .harness-evolver/architecture.md
76
+ </output>
77
+
78
+ <success_criteria>
79
+ - Classifies current topology correctly
80
+ - Recommendation includes migration path with concrete steps
81
+ - Considers detected stack and API key availability
82
+ - Confidence rating is honest (low/medium/high)
83
+ </success_criteria>
84
+ )
80
85
  ```
81
86
 
82
87
  4. Wait for `## ARCHITECTURE ANALYSIS COMPLETE`.
@@ -22,36 +22,42 @@ TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo
22
22
 
23
23
  1. Read `summary.json` and identify the suspicious pattern (score jump, premature convergence).
24
24
 
25
- 2. Spawn the `harness-evolver-critic` agent:
26
-
27
- ```xml
28
- <objective>
29
- Analyze eval quality for this harness evolution project.
30
- The best version is {version} with score {score} achieved in {iterations} iteration(s).
31
- {Specific concern: "Score jumped from X to Y in one iteration" or "Perfect score in N iterations"}
32
- </objective>
33
-
34
- <files_to_read>
35
- - .harness-evolver/eval/eval.py
36
- - .harness-evolver/summary.json
37
- - .harness-evolver/harnesses/{best_version}/scores.json
38
- - .harness-evolver/harnesses/{best_version}/harness.py
39
- - .harness-evolver/harnesses/{best_version}/proposal.md
40
- - .harness-evolver/config.json
41
- </files_to_read>
42
-
43
- <output>
44
- Write:
45
- - .harness-evolver/critic_report.md (human-readable analysis)
46
- - .harness-evolver/eval/eval_improved.py (if weaknesses found)
47
- </output>
48
-
49
- <success_criteria>
50
- - Identifies specific weaknesses in eval.py with examples
51
- - If gaming detected, shows exact tasks/outputs that expose the weakness
52
- - Improved eval preserves the --results-dir/--tasks-dir/--scores interface
53
- - Re-scores the best version with improved eval to quantify the difference
54
- </success_criteria>
25
+ 2. Dispatch subagent using the **Agent tool** with `subagent_type: "harness-evolver-critic"`:
26
+
27
+ ```
28
+ Agent(
29
+ subagent_type: "harness-evolver-critic",
30
+ description: "Critic: analyze eval quality",
31
+ prompt: |
32
+ <objective>
33
+ Analyze eval quality for this harness evolution project.
34
+ The best version is {version} with score {score} achieved in {iterations} iteration(s).
35
+ {Specific concern: "Score jumped from X to Y in one iteration" or "Perfect score in N iterations"}
36
+ </objective>
37
+
38
+ <files_to_read>
39
+ - .harness-evolver/eval/eval.py
40
+ - .harness-evolver/summary.json
41
+ - .harness-evolver/harnesses/{best_version}/scores.json
42
+ - .harness-evolver/harnesses/{best_version}/harness.py
43
+ - .harness-evolver/harnesses/{best_version}/proposal.md
44
+ - .harness-evolver/config.json
45
+ - .harness-evolver/langsmith_stats.json (if exists)
46
+ </files_to_read>
47
+
48
+ <output>
49
+ Write:
50
+ - .harness-evolver/critic_report.md (human-readable analysis)
51
+ - .harness-evolver/eval/eval_improved.py (if weaknesses found)
52
+ </output>
53
+
54
+ <success_criteria>
55
+ - Identifies specific weaknesses in eval.py with examples
56
+ - If gaming detected, shows exact tasks/outputs that expose the weakness
57
+ - Improved eval preserves the --results-dir/--tasks-dir/--scores interface
58
+ - Re-scores the best version with improved eval to quantify the difference
59
+ </success_criteria>
60
+ )
55
61
  ```
56
62
 
57
63
  3. Wait for `## CRITIC REPORT COMPLETE`.
@@ -70,45 +70,50 @@ If Context7 MCP is not available, skip silently.
70
70
 
71
71
  ### 2. Propose
72
72
 
73
- Spawn the `harness-evolver-proposer` agent with a structured prompt.
74
-
75
- The `<files_to_read>` MUST include the LangSmith/Context7 files if they were gathered:
76
-
77
- ```xml
78
- <objective>
79
- Propose harness version {version} that improves on the current best score of {best_score}.
80
- </objective>
81
-
82
- <files_to_read>
83
- - .harness-evolver/summary.json
84
- - .harness-evolver/PROPOSER_HISTORY.md
85
- - .harness-evolver/config.json
86
- - .harness-evolver/baseline/harness.py
87
- - .harness-evolver/harnesses/{best_version}/harness.py
88
- - .harness-evolver/harnesses/{best_version}/scores.json
89
- - .harness-evolver/harnesses/{best_version}/proposal.md
90
- - .harness-evolver/langsmith_diagnosis.json (if exists — LangSmith failure analysis)
91
- - .harness-evolver/langsmith_stats.json (if exists — LangSmith aggregate stats)
92
- - .harness-evolver/context7_docs.md (if exists — current library documentation)
93
- - .harness-evolver/architecture.json (if exists — architect topology recommendation)
94
- </files_to_read>
95
-
96
- <output>
97
- Create directory .harness-evolver/harnesses/{version}/ containing:
98
- - harness.py (the improved harness)
99
- - config.json (parameters, copy from parent if unchanged)
100
- - proposal.md (reasoning, must start with "Based on v{PARENT}")
101
- </output>
102
-
103
- <success_criteria>
104
- - harness.py maintains CLI interface (--input, --output, --traces-dir, --config)
105
- - proposal.md documents evidence-based reasoning
106
- - Changes are motivated by trace analysis (LangSmith data if available), not guesswork
107
- - If context7_docs.md was provided, API usage must match current documentation
108
- </success_criteria>
73
+ Dispatch a subagent using the **Agent tool** with `subagent_type: "harness-evolver-proposer"`.
74
+
75
+ The prompt MUST be the full XML block below. The `<files_to_read>` MUST include the LangSmith/Context7 files if they were gathered in step 1.5.
76
+
77
+ ```
78
+ Agent(
79
+ subagent_type: "harness-evolver-proposer",
80
+ description: "Propose harness {version}",
81
+ prompt: |
82
+ <objective>
83
+ Propose harness version {version} that improves on the current best score of {best_score}.
84
+ </objective>
85
+
86
+ <files_to_read>
87
+ - .harness-evolver/summary.json
88
+ - .harness-evolver/PROPOSER_HISTORY.md
89
+ - .harness-evolver/config.json
90
+ - .harness-evolver/baseline/harness.py
91
+ - .harness-evolver/harnesses/{best_version}/harness.py
92
+ - .harness-evolver/harnesses/{best_version}/scores.json
93
+ - .harness-evolver/harnesses/{best_version}/proposal.md
94
+ - .harness-evolver/langsmith_diagnosis.json (if exists — LangSmith failure analysis)
95
+ - .harness-evolver/langsmith_stats.json (if exists — LangSmith aggregate stats)
96
+ - .harness-evolver/context7_docs.md (if exists — current library documentation)
97
+ - .harness-evolver/architecture.json (if exists — architect topology recommendation)
98
+ </files_to_read>
99
+
100
+ <output>
101
+ Create directory .harness-evolver/harnesses/{version}/ containing:
102
+ - harness.py (the improved harness)
103
+ - config.json (parameters, copy from parent if unchanged)
104
+ - proposal.md (reasoning, must start with "Based on v{PARENT}")
105
+ </output>
106
+
107
+ <success_criteria>
108
+ - harness.py maintains CLI interface (--input, --output, --traces-dir, --config)
109
+ - proposal.md documents evidence-based reasoning
110
+ - Changes are motivated by trace analysis (LangSmith data if available), not guesswork
111
+ - If context7_docs.md was provided, API usage must match current documentation
112
+ </success_criteria>
113
+ )
109
114
  ```
110
115
 
111
- Wait for the agent to complete. Look for `## PROPOSAL COMPLETE` in the response.
116
+ Wait for `## PROPOSAL COMPLETE` in the response. The subagent gets a fresh context window and loads the agent definition from `~/.claude/agents/harness-evolver-proposer.md`.
112
117
 
113
118
  ### 3. Validate
114
119
 
@@ -165,36 +170,41 @@ python3 $TOOLS/evaluate.py run \
165
170
  --timeout 60
166
171
  ```
167
172
 
168
- Spawn the `harness-evolver-critic` agent:
169
-
170
- ```xml
171
- <objective>
172
- EVAL GAMING DETECTED: Score jumped from {parent_score} to {score} in one iteration.
173
- Analyze the eval quality and propose a stricter eval.
174
- </objective>
175
-
176
- <files_to_read>
177
- - .harness-evolver/eval/eval.py
178
- - .harness-evolver/summary.json
179
- - .harness-evolver/harnesses/{version}/scores.json
180
- - .harness-evolver/harnesses/{version}/harness.py
181
- - .harness-evolver/harnesses/{version}/proposal.md
182
- - .harness-evolver/config.json
183
- - .harness-evolver/langsmith_stats.json (if exists)
184
- </files_to_read>
185
-
186
- <output>
187
- Write:
188
- - .harness-evolver/critic_report.md
189
- - .harness-evolver/eval/eval_improved.py (if weaknesses found)
190
- </output>
191
-
192
- <success_criteria>
193
- - Identifies specific weaknesses in eval.py with task/output examples
194
- - If gaming detected, shows exact tasks that expose the weakness
195
- - Improved eval preserves the --results-dir/--tasks-dir/--scores interface
196
- - Re-scores the best version with improved eval to show the difference
197
- </success_criteria>
173
+ Dispatch critic subagent using the **Agent tool** with `subagent_type: "harness-evolver-critic"`:
174
+
175
+ ```
176
+ Agent(
177
+ subagent_type: "harness-evolver-critic",
178
+ description: "Critic: analyze eval quality",
179
+ prompt: |
180
+ <objective>
181
+ EVAL GAMING DETECTED: Score jumped from {parent_score} to {score} in one iteration.
182
+ Analyze the eval quality and propose a stricter eval.
183
+ </objective>
184
+
185
+ <files_to_read>
186
+ - .harness-evolver/eval/eval.py
187
+ - .harness-evolver/summary.json
188
+ - .harness-evolver/harnesses/{version}/scores.json
189
+ - .harness-evolver/harnesses/{version}/harness.py
190
+ - .harness-evolver/harnesses/{version}/proposal.md
191
+ - .harness-evolver/config.json
192
+ - .harness-evolver/langsmith_stats.json (if exists)
193
+ </files_to_read>
194
+
195
+ <output>
196
+ Write:
197
+ - .harness-evolver/critic_report.md
198
+ - .harness-evolver/eval/eval_improved.py (if weaknesses found)
199
+ </output>
200
+
201
+ <success_criteria>
202
+ - Identifies specific weaknesses in eval.py with task/output examples
203
+ - If gaming detected, shows exact tasks that expose the weakness
204
+ - Improved eval preserves the --results-dir/--tasks-dir/--scores interface
205
+ - Re-scores the best version with improved eval to show the difference
206
+ </success_criteria>
207
+ )
198
208
  ```
199
209
 
200
210
  Wait for `## CRITIC REPORT COMPLETE`.
@@ -229,35 +239,40 @@ python3 $TOOLS/analyze_architecture.py \
229
239
  -o .harness-evolver/architecture_signals.json
230
240
  ```
231
241
 
232
- Spawn the `harness-evolver-architect` agent:
233
-
234
- ```xml
235
- <objective>
236
- The evolution loop has {stagnated/regressed} after {iterations} iterations (best: {best_score}).
237
- Analyze the harness architecture and recommend a topology change.
238
- </objective>
239
-
240
- <files_to_read>
241
- - .harness-evolver/architecture_signals.json
242
- - .harness-evolver/summary.json
243
- - .harness-evolver/PROPOSER_HISTORY.md
244
- - .harness-evolver/config.json
245
- - .harness-evolver/harnesses/{best_version}/harness.py
246
- - .harness-evolver/harnesses/{best_version}/scores.json
247
- - .harness-evolver/context7_docs.md (if exists)
248
- </files_to_read>
249
-
250
- <output>
251
- Write:
252
- - .harness-evolver/architecture.json (structured recommendation)
253
- - .harness-evolver/architecture.md (human-readable analysis)
254
- </output>
255
-
256
- <success_criteria>
257
- - Recommendation includes concrete migration steps
258
- - Each step is implementable in one proposer iteration
259
- - Considers detected stack and available API keys
260
- </success_criteria>
242
+ Dispatch architect subagent using the **Agent tool** with `subagent_type: "harness-evolver-architect"`:
243
+
244
+ ```
245
+ Agent(
246
+ subagent_type: "harness-evolver-architect",
247
+ description: "Architect: analyze topology after {stagnation/regression}",
248
+ prompt: |
249
+ <objective>
250
+ The evolution loop has {stagnated/regressed} after {iterations} iterations (best: {best_score}).
251
+ Analyze the harness architecture and recommend a topology change.
252
+ </objective>
253
+
254
+ <files_to_read>
255
+ - .harness-evolver/architecture_signals.json
256
+ - .harness-evolver/summary.json
257
+ - .harness-evolver/PROPOSER_HISTORY.md
258
+ - .harness-evolver/config.json
259
+ - .harness-evolver/harnesses/{best_version}/harness.py
260
+ - .harness-evolver/harnesses/{best_version}/scores.json
261
+ - .harness-evolver/context7_docs.md (if exists)
262
+ </files_to_read>
263
+
264
+ <output>
265
+ Write:
266
+ - .harness-evolver/architecture.json (structured recommendation)
267
+ - .harness-evolver/architecture.md (human-readable analysis)
268
+ </output>
269
+
270
+ <success_criteria>
271
+ - Recommendation includes concrete migration steps
272
+ - Each step is implementable in one proposer iteration
273
+ - Considers detected stack and available API keys
274
+ </success_criteria>
275
+ )
261
276
  ```
262
277
 
263
278
  Wait for `## ARCHITECTURE ANALYSIS COMPLETE`.