harness-evolver 1.3.0 → 1.5.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +1 -1
- package/skills/architect/SKILL.md +34 -29
- package/skills/critic/SKILL.md +36 -30
- package/skills/evolve/SKILL.md +173 -73
package/package.json
CHANGED
|
@@ -48,35 +48,40 @@ python3 $TOOLS/analyze_architecture.py \
|
|
|
48
48
|
-o .harness-evolver/architecture_signals.json
|
|
49
49
|
```
|
|
50
50
|
|
|
51
|
-
3.
|
|
52
|
-
|
|
53
|
-
```
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
|
|
62
|
-
|
|
63
|
-
|
|
64
|
-
|
|
65
|
-
- .harness-evolver/
|
|
66
|
-
|
|
67
|
-
|
|
68
|
-
|
|
69
|
-
|
|
70
|
-
|
|
71
|
-
|
|
72
|
-
|
|
73
|
-
|
|
74
|
-
|
|
75
|
-
-
|
|
76
|
-
|
|
77
|
-
|
|
78
|
-
|
|
79
|
-
|
|
51
|
+
3. Dispatch subagent using the **Agent tool** with `subagent_type: "harness-evolver-architect"`:
|
|
52
|
+
|
|
53
|
+
```
|
|
54
|
+
Agent(
|
|
55
|
+
subagent_type: "harness-evolver-architect",
|
|
56
|
+
description: "Architect: topology analysis",
|
|
57
|
+
prompt: |
|
|
58
|
+
<objective>
|
|
59
|
+
Analyze the harness architecture and recommend the optimal multi-agent topology.
|
|
60
|
+
{If called from evolve: "The evolution loop stagnated/regressed after N iterations."}
|
|
61
|
+
{If called by user: "The user requested an architecture analysis."}
|
|
62
|
+
</objective>
|
|
63
|
+
|
|
64
|
+
<files_to_read>
|
|
65
|
+
- .harness-evolver/architecture_signals.json
|
|
66
|
+
- .harness-evolver/config.json
|
|
67
|
+
- .harness-evolver/baseline/harness.py
|
|
68
|
+
- .harness-evolver/summary.json (if exists)
|
|
69
|
+
- .harness-evolver/PROPOSER_HISTORY.md (if exists)
|
|
70
|
+
</files_to_read>
|
|
71
|
+
|
|
72
|
+
<output>
|
|
73
|
+
Write:
|
|
74
|
+
- .harness-evolver/architecture.json
|
|
75
|
+
- .harness-evolver/architecture.md
|
|
76
|
+
</output>
|
|
77
|
+
|
|
78
|
+
<success_criteria>
|
|
79
|
+
- Classifies current topology correctly
|
|
80
|
+
- Recommendation includes migration path with concrete steps
|
|
81
|
+
- Considers detected stack and API key availability
|
|
82
|
+
- Confidence rating is honest (low/medium/high)
|
|
83
|
+
</success_criteria>
|
|
84
|
+
)
|
|
80
85
|
```
|
|
81
86
|
|
|
82
87
|
4. Wait for `## ARCHITECTURE ANALYSIS COMPLETE`.
|
package/skills/critic/SKILL.md
CHANGED
|
@@ -22,36 +22,42 @@ TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo
|
|
|
22
22
|
|
|
23
23
|
1. Read `summary.json` and identify the suspicious pattern (score jump, premature convergence).
|
|
24
24
|
|
|
25
|
-
2.
|
|
26
|
-
|
|
27
|
-
```
|
|
28
|
-
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
|
|
35
|
-
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
|
|
39
|
-
- .harness-evolver/
|
|
40
|
-
- .harness-evolver/
|
|
41
|
-
|
|
42
|
-
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
- .harness-evolver/
|
|
46
|
-
|
|
47
|
-
|
|
48
|
-
|
|
49
|
-
|
|
50
|
-
-
|
|
51
|
-
-
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
|
|
25
|
+
2. Dispatch subagent using the **Agent tool** with `subagent_type: "harness-evolver-critic"`:
|
|
26
|
+
|
|
27
|
+
```
|
|
28
|
+
Agent(
|
|
29
|
+
subagent_type: "harness-evolver-critic",
|
|
30
|
+
description: "Critic: analyze eval quality",
|
|
31
|
+
prompt: |
|
|
32
|
+
<objective>
|
|
33
|
+
Analyze eval quality for this harness evolution project.
|
|
34
|
+
The best version is {version} with score {score} achieved in {iterations} iteration(s).
|
|
35
|
+
{Specific concern: "Score jumped from X to Y in one iteration" or "Perfect score in N iterations"}
|
|
36
|
+
</objective>
|
|
37
|
+
|
|
38
|
+
<files_to_read>
|
|
39
|
+
- .harness-evolver/eval/eval.py
|
|
40
|
+
- .harness-evolver/summary.json
|
|
41
|
+
- .harness-evolver/harnesses/{best_version}/scores.json
|
|
42
|
+
- .harness-evolver/harnesses/{best_version}/harness.py
|
|
43
|
+
- .harness-evolver/harnesses/{best_version}/proposal.md
|
|
44
|
+
- .harness-evolver/config.json
|
|
45
|
+
- .harness-evolver/langsmith_stats.json (if exists)
|
|
46
|
+
</files_to_read>
|
|
47
|
+
|
|
48
|
+
<output>
|
|
49
|
+
Write:
|
|
50
|
+
- .harness-evolver/critic_report.md (human-readable analysis)
|
|
51
|
+
- .harness-evolver/eval/eval_improved.py (if weaknesses found)
|
|
52
|
+
</output>
|
|
53
|
+
|
|
54
|
+
<success_criteria>
|
|
55
|
+
- Identifies specific weaknesses in eval.py with examples
|
|
56
|
+
- If gaming detected, shows exact tasks/outputs that expose the weakness
|
|
57
|
+
- Improved eval preserves the --results-dir/--tasks-dir/--scores interface
|
|
58
|
+
- Re-scores the best version with improved eval to quantify the difference
|
|
59
|
+
</success_criteria>
|
|
60
|
+
)
|
|
55
61
|
```
|
|
56
62
|
|
|
57
63
|
3. Wait for `## CRITIC REPORT COMPLETE`.
|
package/skills/evolve/SKILL.md
CHANGED
|
@@ -5,7 +5,7 @@ argument-hint: "[--iterations N]"
|
|
|
5
5
|
allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent]
|
|
6
6
|
---
|
|
7
7
|
|
|
8
|
-
# /harness-evolve
|
|
8
|
+
# /harness-evolver:evolve
|
|
9
9
|
|
|
10
10
|
Run the autonomous propose-evaluate-iterate loop.
|
|
11
11
|
|
|
@@ -34,40 +34,86 @@ For each iteration:
|
|
|
34
34
|
python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(f'v{s[\"iterations\"]+1:03d}')"
|
|
35
35
|
```
|
|
36
36
|
|
|
37
|
-
###
|
|
37
|
+
### 1.5. Gather Diagnostic Context (LangSmith + Context7)
|
|
38
|
+
|
|
39
|
+
**This step is MANDATORY before every propose.** The orchestrator gathers data so the proposer receives it as files.
|
|
40
|
+
|
|
41
|
+
**LangSmith (if enabled):**
|
|
42
|
+
|
|
43
|
+
Check if LangSmith is enabled and langsmith-cli is available:
|
|
44
|
+
```bash
|
|
45
|
+
cat .harness-evolver/config.json | python3 -c "import sys,json; print(json.load(sys.stdin).get('eval',{}).get('langsmith',{}).get('enabled',False))"
|
|
46
|
+
which langsmith-cli 2>/dev/null
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
If BOTH are true AND at least one iteration has run, gather LangSmith data:
|
|
50
|
+
```bash
|
|
51
|
+
langsmith-cli --json runs list --project harness-evolver-{best_version} --failed --fields id,name,error,inputs --limit 10 > .harness-evolver/langsmith_diagnosis.json 2>/dev/null || echo "[]" > .harness-evolver/langsmith_diagnosis.json
|
|
52
|
+
|
|
53
|
+
langsmith-cli --json runs stats --project harness-evolver-{best_version} > .harness-evolver/langsmith_stats.json 2>/dev/null || echo "{}" > .harness-evolver/langsmith_stats.json
|
|
54
|
+
```
|
|
38
55
|
|
|
39
|
-
|
|
56
|
+
**Context7 (if available):**
|
|
40
57
|
|
|
41
|
-
|
|
42
|
-
<objective>
|
|
43
|
-
Propose harness version {version} that improves on the current best score of {best_score}.
|
|
44
|
-
</objective>
|
|
58
|
+
Check `config.json` field `stack.detected`. For each detected library, use the Context7 MCP tools to fetch relevant documentation:
|
|
45
59
|
|
|
46
|
-
|
|
47
|
-
|
|
48
|
-
-
|
|
49
|
-
-
|
|
50
|
-
|
|
51
|
-
|
|
52
|
-
- .harness-evolver/harnesses/{best_version}/scores.json
|
|
53
|
-
- .harness-evolver/harnesses/{best_version}/proposal.md
|
|
54
|
-
</files_to_read>
|
|
60
|
+
```
|
|
61
|
+
For each library in stack.detected:
|
|
62
|
+
1. resolve-library-id with the context7_id
|
|
63
|
+
2. get-library-docs with a query relevant to the current failure modes
|
|
64
|
+
3. Save output to .harness-evolver/context7_docs.md (append each library's docs)
|
|
65
|
+
```
|
|
55
66
|
|
|
56
|
-
|
|
57
|
-
Create directory .harness-evolver/harnesses/{version}/ containing:
|
|
58
|
-
- harness.py (the improved harness)
|
|
59
|
-
- config.json (parameters, copy from parent if unchanged)
|
|
60
|
-
- proposal.md (reasoning, must start with "Based on v{PARENT}")
|
|
61
|
-
</output>
|
|
67
|
+
This runs ONCE per iteration, not per library. Focus on the library most relevant to the current failures.
|
|
62
68
|
|
|
63
|
-
|
|
64
|
-
|
|
65
|
-
|
|
66
|
-
|
|
67
|
-
|
|
69
|
+
If Context7 MCP is not available, skip silently.
|
|
70
|
+
|
|
71
|
+
### 2. Propose
|
|
72
|
+
|
|
73
|
+
Dispatch a subagent using the **Agent tool** with `subagent_type: "harness-evolver-proposer"`.
|
|
74
|
+
|
|
75
|
+
The prompt MUST be the full XML block below. The `<files_to_read>` MUST include the LangSmith/Context7 files if they were gathered in step 1.5.
|
|
76
|
+
|
|
77
|
+
```
|
|
78
|
+
Agent(
|
|
79
|
+
subagent_type: "harness-evolver-proposer",
|
|
80
|
+
description: "Propose harness {version}",
|
|
81
|
+
prompt: |
|
|
82
|
+
<objective>
|
|
83
|
+
Propose harness version {version} that improves on the current best score of {best_score}.
|
|
84
|
+
</objective>
|
|
85
|
+
|
|
86
|
+
<files_to_read>
|
|
87
|
+
- .harness-evolver/summary.json
|
|
88
|
+
- .harness-evolver/PROPOSER_HISTORY.md
|
|
89
|
+
- .harness-evolver/config.json
|
|
90
|
+
- .harness-evolver/baseline/harness.py
|
|
91
|
+
- .harness-evolver/harnesses/{best_version}/harness.py
|
|
92
|
+
- .harness-evolver/harnesses/{best_version}/scores.json
|
|
93
|
+
- .harness-evolver/harnesses/{best_version}/proposal.md
|
|
94
|
+
- .harness-evolver/langsmith_diagnosis.json (if exists — LangSmith failure analysis)
|
|
95
|
+
- .harness-evolver/langsmith_stats.json (if exists — LangSmith aggregate stats)
|
|
96
|
+
- .harness-evolver/context7_docs.md (if exists — current library documentation)
|
|
97
|
+
- .harness-evolver/architecture.json (if exists — architect topology recommendation)
|
|
98
|
+
</files_to_read>
|
|
99
|
+
|
|
100
|
+
<output>
|
|
101
|
+
Create directory .harness-evolver/harnesses/{version}/ containing:
|
|
102
|
+
- harness.py (the improved harness)
|
|
103
|
+
- config.json (parameters, copy from parent if unchanged)
|
|
104
|
+
- proposal.md (reasoning, must start with "Based on v{PARENT}")
|
|
105
|
+
</output>
|
|
106
|
+
|
|
107
|
+
<success_criteria>
|
|
108
|
+
- harness.py maintains CLI interface (--input, --output, --traces-dir, --config)
|
|
109
|
+
- proposal.md documents evidence-based reasoning
|
|
110
|
+
- Changes are motivated by trace analysis (LangSmith data if available), not guesswork
|
|
111
|
+
- If context7_docs.md was provided, API usage must match current documentation
|
|
112
|
+
</success_criteria>
|
|
113
|
+
)
|
|
68
114
|
```
|
|
69
115
|
|
|
70
|
-
Wait for
|
|
116
|
+
Wait for `## PROPOSAL COMPLETE` in the response. The subagent gets a fresh context window and loads the agent definition from `~/.claude/agents/harness-evolver-proposer.md`.
|
|
71
117
|
|
|
72
118
|
### 3. Validate
|
|
73
119
|
|
|
@@ -106,26 +152,78 @@ python3 $TOOLS/state.py update \
|
|
|
106
152
|
|
|
107
153
|
Read `summary.json`. Print: `Iteration {i}/{N}: {version} scored {score} (best: {best} at {best_score})`
|
|
108
154
|
|
|
109
|
-
### 6.5.
|
|
155
|
+
### 6.5. Auto-trigger Critic (on eval gaming)
|
|
110
156
|
|
|
111
|
-
|
|
157
|
+
Read `summary.json` and check:
|
|
112
158
|
- Did the score jump >0.3 from parent version?
|
|
113
159
|
- Did we reach 1.0 in fewer than 3 total iterations?
|
|
114
160
|
|
|
115
|
-
If
|
|
161
|
+
If EITHER is true, **AUTO-SPAWN the critic agent** (do not just suggest — actually spawn it):
|
|
116
162
|
|
|
117
|
-
|
|
118
|
-
|
|
163
|
+
```bash
|
|
164
|
+
python3 $TOOLS/evaluate.py run \
|
|
165
|
+
--harness .harness-evolver/harnesses/{version}/harness.py \
|
|
166
|
+
--tasks-dir .harness-evolver/eval/tasks/ \
|
|
167
|
+
--eval .harness-evolver/eval/eval.py \
|
|
168
|
+
--traces-dir /tmp/critic-check/ \
|
|
169
|
+
--scores /tmp/critic-check-scores.json \
|
|
170
|
+
--timeout 60
|
|
171
|
+
```
|
|
172
|
+
|
|
173
|
+
Dispatch critic subagent using the **Agent tool** with `subagent_type: "harness-evolver-critic"`:
|
|
174
|
+
|
|
175
|
+
```
|
|
176
|
+
Agent(
|
|
177
|
+
subagent_type: "harness-evolver-critic",
|
|
178
|
+
description: "Critic: analyze eval quality",
|
|
179
|
+
prompt: |
|
|
180
|
+
<objective>
|
|
181
|
+
EVAL GAMING DETECTED: Score jumped from {parent_score} to {score} in one iteration.
|
|
182
|
+
Analyze the eval quality and propose a stricter eval.
|
|
183
|
+
</objective>
|
|
184
|
+
|
|
185
|
+
<files_to_read>
|
|
186
|
+
- .harness-evolver/eval/eval.py
|
|
187
|
+
- .harness-evolver/summary.json
|
|
188
|
+
- .harness-evolver/harnesses/{version}/scores.json
|
|
189
|
+
- .harness-evolver/harnesses/{version}/harness.py
|
|
190
|
+
- .harness-evolver/harnesses/{version}/proposal.md
|
|
191
|
+
- .harness-evolver/config.json
|
|
192
|
+
- .harness-evolver/langsmith_stats.json (if exists)
|
|
193
|
+
</files_to_read>
|
|
194
|
+
|
|
195
|
+
<output>
|
|
196
|
+
Write:
|
|
197
|
+
- .harness-evolver/critic_report.md
|
|
198
|
+
- .harness-evolver/eval/eval_improved.py (if weaknesses found)
|
|
199
|
+
</output>
|
|
200
|
+
|
|
201
|
+
<success_criteria>
|
|
202
|
+
- Identifies specific weaknesses in eval.py with task/output examples
|
|
203
|
+
- If gaming detected, shows exact tasks that expose the weakness
|
|
204
|
+
- Improved eval preserves the --results-dir/--tasks-dir/--scores interface
|
|
205
|
+
- Re-scores the best version with improved eval to show the difference
|
|
206
|
+
</success_criteria>
|
|
207
|
+
)
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
Wait for `## CRITIC REPORT COMPLETE`.
|
|
119
211
|
|
|
120
|
-
If
|
|
212
|
+
If critic wrote `eval_improved.py`:
|
|
213
|
+
- Re-score the best harness with the improved eval
|
|
214
|
+
- Show the score difference (e.g., "Current eval: 1.0. Improved eval: 0.45")
|
|
215
|
+
- **AUTO-ADOPT the improved eval**: copy `eval_improved.py` to `eval/eval.py`
|
|
216
|
+
- Re-run baseline with new eval and update `summary.json`
|
|
217
|
+
- Print: "Eval upgraded. Resuming evolution with stricter eval."
|
|
218
|
+
- **Continue the loop** with the new eval
|
|
121
219
|
|
|
122
|
-
|
|
123
|
-
|
|
124
|
-
|
|
220
|
+
If critic did NOT write `eval_improved.py` (eval is fine):
|
|
221
|
+
- Print the critic's assessment
|
|
222
|
+
- Continue the loop normally
|
|
125
223
|
|
|
126
224
|
### 7. Auto-trigger Architect (on stagnation or regression)
|
|
127
225
|
|
|
128
|
-
Check if the architect should be auto-spawned
|
|
226
|
+
Check if the architect should be auto-spawned:
|
|
129
227
|
- **Stagnation**: 3 consecutive iterations within 1% of each other
|
|
130
228
|
- **Regression**: score dropped below parent score (even once)
|
|
131
229
|
|
|
@@ -141,57 +239,59 @@ python3 $TOOLS/analyze_architecture.py \
|
|
|
141
239
|
-o .harness-evolver/architecture_signals.json
|
|
142
240
|
```
|
|
143
241
|
|
|
144
|
-
|
|
145
|
-
|
|
146
|
-
```xml
|
|
147
|
-
<objective>
|
|
148
|
-
The evolution loop has {stagnated/regressed} after {iterations} iterations (best: {best_score}).
|
|
149
|
-
Analyze the harness architecture and recommend a topology change.
|
|
150
|
-
</objective>
|
|
151
|
-
|
|
152
|
-
<files_to_read>
|
|
153
|
-
- .harness-evolver/architecture_signals.json
|
|
154
|
-
- .harness-evolver/summary.json
|
|
155
|
-
- .harness-evolver/PROPOSER_HISTORY.md
|
|
156
|
-
- .harness-evolver/config.json
|
|
157
|
-
- .harness-evolver/harnesses/{best_version}/harness.py
|
|
158
|
-
- .harness-evolver/harnesses/{best_version}/scores.json
|
|
159
|
-
</files_to_read>
|
|
242
|
+
Dispatch architect subagent using the **Agent tool** with `subagent_type: "harness-evolver-architect"`:
|
|
160
243
|
|
|
161
|
-
|
|
162
|
-
|
|
163
|
-
|
|
164
|
-
|
|
165
|
-
|
|
166
|
-
|
|
167
|
-
|
|
168
|
-
|
|
169
|
-
|
|
170
|
-
|
|
171
|
-
|
|
244
|
+
```
|
|
245
|
+
Agent(
|
|
246
|
+
subagent_type: "harness-evolver-architect",
|
|
247
|
+
description: "Architect: analyze topology after {stagnation/regression}",
|
|
248
|
+
prompt: |
|
|
249
|
+
<objective>
|
|
250
|
+
The evolution loop has {stagnated/regressed} after {iterations} iterations (best: {best_score}).
|
|
251
|
+
Analyze the harness architecture and recommend a topology change.
|
|
252
|
+
</objective>
|
|
253
|
+
|
|
254
|
+
<files_to_read>
|
|
255
|
+
- .harness-evolver/architecture_signals.json
|
|
256
|
+
- .harness-evolver/summary.json
|
|
257
|
+
- .harness-evolver/PROPOSER_HISTORY.md
|
|
258
|
+
- .harness-evolver/config.json
|
|
259
|
+
- .harness-evolver/harnesses/{best_version}/harness.py
|
|
260
|
+
- .harness-evolver/harnesses/{best_version}/scores.json
|
|
261
|
+
- .harness-evolver/context7_docs.md (if exists)
|
|
262
|
+
</files_to_read>
|
|
263
|
+
|
|
264
|
+
<output>
|
|
265
|
+
Write:
|
|
266
|
+
- .harness-evolver/architecture.json (structured recommendation)
|
|
267
|
+
- .harness-evolver/architecture.md (human-readable analysis)
|
|
268
|
+
</output>
|
|
269
|
+
|
|
270
|
+
<success_criteria>
|
|
271
|
+
- Recommendation includes concrete migration steps
|
|
272
|
+
- Each step is implementable in one proposer iteration
|
|
273
|
+
- Considers detected stack and available API keys
|
|
274
|
+
</success_criteria>
|
|
275
|
+
)
|
|
172
276
|
```
|
|
173
277
|
|
|
174
278
|
Wait for `## ARCHITECTURE ANALYSIS COMPLETE`.
|
|
175
279
|
|
|
176
|
-
|
|
177
|
-
|
|
178
|
-
> Architect recommends: {current} → {recommended} ({confidence} confidence)
|
|
179
|
-
> Migration path: {N} steps. Continuing evolution with architecture guidance.
|
|
180
|
-
|
|
181
|
-
Then **continue the loop** — the proposer will read `architecture.json` in the next iteration.
|
|
280
|
+
Report: `Architect recommends: {current} → {recommended} ({confidence} confidence)`
|
|
182
281
|
|
|
183
|
-
|
|
282
|
+
Then **continue the loop** — the proposer reads `architecture.json` in the next iteration.
|
|
184
283
|
|
|
185
284
|
### 8. Check Stop Conditions
|
|
186
285
|
|
|
187
286
|
- **Target**: `combined_score >= target_score` → stop
|
|
188
287
|
- **N reached**: done
|
|
189
|
-
- **Stagnation post-architect**: 3 more iterations without improvement AFTER architect ran → stop
|
|
288
|
+
- **Stagnation post-architect**: 3 more iterations without improvement AFTER architect ran → stop
|
|
190
289
|
|
|
191
290
|
## When Loop Ends — Final Report
|
|
192
291
|
|
|
193
292
|
- Best version and score
|
|
194
293
|
- Improvement over baseline (absolute and %)
|
|
195
294
|
- Total iterations run
|
|
295
|
+
- Whether critic was triggered and eval was upgraded
|
|
196
296
|
- Whether architect was triggered and what it recommended
|
|
197
297
|
- Suggest: "The best harness is at `.harness-evolver/harnesses/{best}/harness.py`. Copy it to your project."
|