harness-evolver 1.4.0 → 1.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +1 -1
- package/skills/architect/SKILL.md +34 -29
- package/skills/critic/SKILL.md +36 -30
- package/skills/evolve/SKILL.md +111 -96
package/package.json
CHANGED
|
@@ -48,35 +48,40 @@ python3 $TOOLS/analyze_architecture.py \
|
|
|
48
48
|
-o .harness-evolver/architecture_signals.json
|
|
49
49
|
```
|
|
50
50
|
|
|
51
|
-
3.
|
|
52
|
-
|
|
53
|
-
```
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
|
|
62
|
-
|
|
63
|
-
|
|
64
|
-
|
|
65
|
-
- .harness-evolver/
|
|
66
|
-
|
|
67
|
-
|
|
68
|
-
|
|
69
|
-
|
|
70
|
-
|
|
71
|
-
|
|
72
|
-
|
|
73
|
-
|
|
74
|
-
|
|
75
|
-
-
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
|
|
79
|
-
|
|
51
|
+
3. Dispatch subagent using the **Agent tool** with `subagent_type: "harness-evolver-architect"`:
|
|
52
|
+
|
|
53
|
+
```
|
|
54
|
+
Agent(
|
|
55
|
+
subagent_type: "harness-evolver-architect",
|
|
56
|
+
description: "Architect: topology analysis",
|
|
57
|
+
prompt: |
|
|
58
|
+
<objective>
|
|
59
|
+
Analyze the harness architecture and recommend the optimal multi-agent topology.
|
|
60
|
+
{If called from evolve: "The evolution loop stagnated/regressed after N iterations."}
|
|
61
|
+
{If called by user: "The user requested an architecture analysis."}
|
|
62
|
+
</objective>
|
|
63
|
+
|
|
64
|
+
<files_to_read>
|
|
65
|
+
- .harness-evolver/architecture_signals.json
|
|
66
|
+
- .harness-evolver/config.json
|
|
67
|
+
- .harness-evolver/baseline/harness.py
|
|
68
|
+
- .harness-evolver/summary.json (if exists)
|
|
69
|
+
- .harness-evolver/PROPOSER_HISTORY.md (if exists)
|
|
70
|
+
</files_to_read>
|
|
71
|
+
|
|
72
|
+
<output>
|
|
73
|
+
Write:
|
|
74
|
+
- .harness-evolver/architecture.json
|
|
75
|
+
- .harness-evolver/architecture.md
|
|
76
|
+
</output>
|
|
77
|
+
|
|
78
|
+
<success_criteria>
|
|
79
|
+
- Classifies current topology correctly
|
|
80
|
+
- Recommendation includes migration path with concrete steps
|
|
81
|
+
- Considers detected stack and API key availability
|
|
82
|
+
- Confidence rating is honest (low/medium/high)
|
|
83
|
+
</success_criteria>
|
|
84
|
+
)
|
|
80
85
|
```
|
|
81
86
|
|
|
82
87
|
4. Wait for `## ARCHITECTURE ANALYSIS COMPLETE`.
|
package/skills/critic/SKILL.md
CHANGED
|
@@ -22,36 +22,42 @@ TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo
|
|
|
22
22
|
|
|
23
23
|
1. Read `summary.json` and identify the suspicious pattern (score jump, premature convergence).
|
|
24
24
|
|
|
25
|
-
2.
|
|
26
|
-
|
|
27
|
-
```
|
|
28
|
-
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
|
|
35
|
-
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
|
|
39
|
-
- .harness-evolver/
|
|
40
|
-
- .harness-evolver/
|
|
41
|
-
|
|
42
|
-
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
- .harness-evolver/
|
|
46
|
-
|
|
47
|
-
|
|
48
|
-
|
|
49
|
-
|
|
50
|
-
-
|
|
51
|
-
-
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
|
|
25
|
+
2. Dispatch subagent using the **Agent tool** with `subagent_type: "harness-evolver-critic"`:
|
|
26
|
+
|
|
27
|
+
```
|
|
28
|
+
Agent(
|
|
29
|
+
subagent_type: "harness-evolver-critic",
|
|
30
|
+
description: "Critic: analyze eval quality",
|
|
31
|
+
prompt: |
|
|
32
|
+
<objective>
|
|
33
|
+
Analyze eval quality for this harness evolution project.
|
|
34
|
+
The best version is {version} with score {score} achieved in {iterations} iteration(s).
|
|
35
|
+
{Specific concern: "Score jumped from X to Y in one iteration" or "Perfect score in N iterations"}
|
|
36
|
+
</objective>
|
|
37
|
+
|
|
38
|
+
<files_to_read>
|
|
39
|
+
- .harness-evolver/eval/eval.py
|
|
40
|
+
- .harness-evolver/summary.json
|
|
41
|
+
- .harness-evolver/harnesses/{best_version}/scores.json
|
|
42
|
+
- .harness-evolver/harnesses/{best_version}/harness.py
|
|
43
|
+
- .harness-evolver/harnesses/{best_version}/proposal.md
|
|
44
|
+
- .harness-evolver/config.json
|
|
45
|
+
- .harness-evolver/langsmith_stats.json (if exists)
|
|
46
|
+
</files_to_read>
|
|
47
|
+
|
|
48
|
+
<output>
|
|
49
|
+
Write:
|
|
50
|
+
- .harness-evolver/critic_report.md (human-readable analysis)
|
|
51
|
+
- .harness-evolver/eval/eval_improved.py (if weaknesses found)
|
|
52
|
+
</output>
|
|
53
|
+
|
|
54
|
+
<success_criteria>
|
|
55
|
+
- Identifies specific weaknesses in eval.py with examples
|
|
56
|
+
- If gaming detected, shows exact tasks/outputs that expose the weakness
|
|
57
|
+
- Improved eval preserves the --results-dir/--tasks-dir/--scores interface
|
|
58
|
+
- Re-scores the best version with improved eval to quantify the difference
|
|
59
|
+
</success_criteria>
|
|
60
|
+
)
|
|
55
61
|
```
|
|
56
62
|
|
|
57
63
|
3. Wait for `## CRITIC REPORT COMPLETE`.
|
package/skills/evolve/SKILL.md
CHANGED
|
@@ -70,45 +70,50 @@ If Context7 MCP is not available, skip silently.
|
|
|
70
70
|
|
|
71
71
|
### 2. Propose
|
|
72
72
|
|
|
73
|
-
|
|
74
|
-
|
|
75
|
-
The `<files_to_read>` MUST include the LangSmith/Context7 files if they were gathered
|
|
76
|
-
|
|
77
|
-
```
|
|
78
|
-
|
|
79
|
-
|
|
80
|
-
|
|
81
|
-
|
|
82
|
-
<
|
|
83
|
-
|
|
84
|
-
|
|
85
|
-
|
|
86
|
-
|
|
87
|
-
- .harness-evolver/
|
|
88
|
-
- .harness-evolver/
|
|
89
|
-
- .harness-evolver/
|
|
90
|
-
- .harness-evolver/
|
|
91
|
-
- .harness-evolver/
|
|
92
|
-
- .harness-evolver/
|
|
93
|
-
- .harness-evolver/
|
|
94
|
-
|
|
95
|
-
|
|
96
|
-
|
|
97
|
-
|
|
98
|
-
|
|
99
|
-
|
|
100
|
-
|
|
101
|
-
|
|
102
|
-
|
|
103
|
-
|
|
104
|
-
-
|
|
105
|
-
|
|
106
|
-
|
|
107
|
-
|
|
108
|
-
|
|
73
|
+
Dispatch a subagent using the **Agent tool** with `subagent_type: "harness-evolver-proposer"`.
|
|
74
|
+
|
|
75
|
+
The prompt MUST be the full XML block below. The `<files_to_read>` MUST include the LangSmith/Context7 files if they were gathered in step 1.5.
|
|
76
|
+
|
|
77
|
+
```
|
|
78
|
+
Agent(
|
|
79
|
+
subagent_type: "harness-evolver-proposer",
|
|
80
|
+
description: "Propose harness {version}",
|
|
81
|
+
prompt: |
|
|
82
|
+
<objective>
|
|
83
|
+
Propose harness version {version} that improves on the current best score of {best_score}.
|
|
84
|
+
</objective>
|
|
85
|
+
|
|
86
|
+
<files_to_read>
|
|
87
|
+
- .harness-evolver/summary.json
|
|
88
|
+
- .harness-evolver/PROPOSER_HISTORY.md
|
|
89
|
+
- .harness-evolver/config.json
|
|
90
|
+
- .harness-evolver/baseline/harness.py
|
|
91
|
+
- .harness-evolver/harnesses/{best_version}/harness.py
|
|
92
|
+
- .harness-evolver/harnesses/{best_version}/scores.json
|
|
93
|
+
- .harness-evolver/harnesses/{best_version}/proposal.md
|
|
94
|
+
- .harness-evolver/langsmith_diagnosis.json (if exists — LangSmith failure analysis)
|
|
95
|
+
- .harness-evolver/langsmith_stats.json (if exists — LangSmith aggregate stats)
|
|
96
|
+
- .harness-evolver/context7_docs.md (if exists — current library documentation)
|
|
97
|
+
- .harness-evolver/architecture.json (if exists — architect topology recommendation)
|
|
98
|
+
</files_to_read>
|
|
99
|
+
|
|
100
|
+
<output>
|
|
101
|
+
Create directory .harness-evolver/harnesses/{version}/ containing:
|
|
102
|
+
- harness.py (the improved harness)
|
|
103
|
+
- config.json (parameters, copy from parent if unchanged)
|
|
104
|
+
- proposal.md (reasoning, must start with "Based on v{PARENT}")
|
|
105
|
+
</output>
|
|
106
|
+
|
|
107
|
+
<success_criteria>
|
|
108
|
+
- harness.py maintains CLI interface (--input, --output, --traces-dir, --config)
|
|
109
|
+
- proposal.md documents evidence-based reasoning
|
|
110
|
+
- Changes are motivated by trace analysis (LangSmith data if available), not guesswork
|
|
111
|
+
- If context7_docs.md was provided, API usage must match current documentation
|
|
112
|
+
</success_criteria>
|
|
113
|
+
)
|
|
109
114
|
```
|
|
110
115
|
|
|
111
|
-
Wait for
|
|
116
|
+
Wait for `## PROPOSAL COMPLETE` in the response. The subagent gets a fresh context window and loads the agent definition from `~/.claude/agents/harness-evolver-proposer.md`.
|
|
112
117
|
|
|
113
118
|
### 3. Validate
|
|
114
119
|
|
|
@@ -165,36 +170,41 @@ python3 $TOOLS/evaluate.py run \
|
|
|
165
170
|
--timeout 60
|
|
166
171
|
```
|
|
167
172
|
|
|
168
|
-
|
|
169
|
-
|
|
170
|
-
```
|
|
171
|
-
|
|
172
|
-
|
|
173
|
-
|
|
174
|
-
|
|
175
|
-
|
|
176
|
-
|
|
177
|
-
|
|
178
|
-
|
|
179
|
-
|
|
180
|
-
|
|
181
|
-
- .harness-evolver/
|
|
182
|
-
- .harness-evolver/
|
|
183
|
-
- .harness-evolver/
|
|
184
|
-
|
|
185
|
-
|
|
186
|
-
|
|
187
|
-
|
|
188
|
-
|
|
189
|
-
|
|
190
|
-
|
|
191
|
-
|
|
192
|
-
|
|
193
|
-
-
|
|
194
|
-
|
|
195
|
-
|
|
196
|
-
|
|
197
|
-
|
|
173
|
+
Dispatch critic subagent using the **Agent tool** with `subagent_type: "harness-evolver-critic"`:
|
|
174
|
+
|
|
175
|
+
```
|
|
176
|
+
Agent(
|
|
177
|
+
subagent_type: "harness-evolver-critic",
|
|
178
|
+
description: "Critic: analyze eval quality",
|
|
179
|
+
prompt: |
|
|
180
|
+
<objective>
|
|
181
|
+
EVAL GAMING DETECTED: Score jumped from {parent_score} to {score} in one iteration.
|
|
182
|
+
Analyze the eval quality and propose a stricter eval.
|
|
183
|
+
</objective>
|
|
184
|
+
|
|
185
|
+
<files_to_read>
|
|
186
|
+
- .harness-evolver/eval/eval.py
|
|
187
|
+
- .harness-evolver/summary.json
|
|
188
|
+
- .harness-evolver/harnesses/{version}/scores.json
|
|
189
|
+
- .harness-evolver/harnesses/{version}/harness.py
|
|
190
|
+
- .harness-evolver/harnesses/{version}/proposal.md
|
|
191
|
+
- .harness-evolver/config.json
|
|
192
|
+
- .harness-evolver/langsmith_stats.json (if exists)
|
|
193
|
+
</files_to_read>
|
|
194
|
+
|
|
195
|
+
<output>
|
|
196
|
+
Write:
|
|
197
|
+
- .harness-evolver/critic_report.md
|
|
198
|
+
- .harness-evolver/eval/eval_improved.py (if weaknesses found)
|
|
199
|
+
</output>
|
|
200
|
+
|
|
201
|
+
<success_criteria>
|
|
202
|
+
- Identifies specific weaknesses in eval.py with task/output examples
|
|
203
|
+
- If gaming detected, shows exact tasks that expose the weakness
|
|
204
|
+
- Improved eval preserves the --results-dir/--tasks-dir/--scores interface
|
|
205
|
+
- Re-scores the best version with improved eval to show the difference
|
|
206
|
+
</success_criteria>
|
|
207
|
+
)
|
|
198
208
|
```
|
|
199
209
|
|
|
200
210
|
Wait for `## CRITIC REPORT COMPLETE`.
|
|
@@ -229,35 +239,40 @@ python3 $TOOLS/analyze_architecture.py \
|
|
|
229
239
|
-o .harness-evolver/architecture_signals.json
|
|
230
240
|
```
|
|
231
241
|
|
|
232
|
-
|
|
233
|
-
|
|
234
|
-
```
|
|
235
|
-
|
|
236
|
-
|
|
237
|
-
|
|
238
|
-
|
|
239
|
-
|
|
240
|
-
|
|
241
|
-
|
|
242
|
-
|
|
243
|
-
|
|
244
|
-
|
|
245
|
-
- .harness-evolver/
|
|
246
|
-
- .harness-evolver/
|
|
247
|
-
- .harness-evolver/
|
|
248
|
-
|
|
249
|
-
|
|
250
|
-
|
|
251
|
-
|
|
252
|
-
|
|
253
|
-
|
|
254
|
-
|
|
255
|
-
|
|
256
|
-
|
|
257
|
-
-
|
|
258
|
-
|
|
259
|
-
|
|
260
|
-
|
|
242
|
+
Dispatch architect subagent using the **Agent tool** with `subagent_type: "harness-evolver-architect"`:
|
|
243
|
+
|
|
244
|
+
```
|
|
245
|
+
Agent(
|
|
246
|
+
subagent_type: "harness-evolver-architect",
|
|
247
|
+
description: "Architect: analyze topology after {stagnation/regression}",
|
|
248
|
+
prompt: |
|
|
249
|
+
<objective>
|
|
250
|
+
The evolution loop has {stagnated/regressed} after {iterations} iterations (best: {best_score}).
|
|
251
|
+
Analyze the harness architecture and recommend a topology change.
|
|
252
|
+
</objective>
|
|
253
|
+
|
|
254
|
+
<files_to_read>
|
|
255
|
+
- .harness-evolver/architecture_signals.json
|
|
256
|
+
- .harness-evolver/summary.json
|
|
257
|
+
- .harness-evolver/PROPOSER_HISTORY.md
|
|
258
|
+
- .harness-evolver/config.json
|
|
259
|
+
- .harness-evolver/harnesses/{best_version}/harness.py
|
|
260
|
+
- .harness-evolver/harnesses/{best_version}/scores.json
|
|
261
|
+
- .harness-evolver/context7_docs.md (if exists)
|
|
262
|
+
</files_to_read>
|
|
263
|
+
|
|
264
|
+
<output>
|
|
265
|
+
Write:
|
|
266
|
+
- .harness-evolver/architecture.json (structured recommendation)
|
|
267
|
+
- .harness-evolver/architecture.md (human-readable analysis)
|
|
268
|
+
</output>
|
|
269
|
+
|
|
270
|
+
<success_criteria>
|
|
271
|
+
- Recommendation includes concrete migration steps
|
|
272
|
+
- Each step is implementable in one proposer iteration
|
|
273
|
+
- Considers detected stack and available API keys
|
|
274
|
+
</success_criteria>
|
|
275
|
+
)
|
|
261
276
|
```
|
|
262
277
|
|
|
263
278
|
Wait for `## ARCHITECTURE ANALYSIS COMPLETE`.
|