harness-evolver 1.3.0 → 1.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +1 -1
- package/skills/evolve/SKILL.md +107 -22
package/package.json
CHANGED
package/skills/evolve/SKILL.md
CHANGED
|
@@ -5,7 +5,7 @@ argument-hint: "[--iterations N]"
|
|
|
5
5
|
allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent]
|
|
6
6
|
---
|
|
7
7
|
|
|
8
|
-
# /harness-evolve
|
|
8
|
+
# /harness-evolver:evolve
|
|
9
9
|
|
|
10
10
|
Run the autonomous propose-evaluate-iterate loop.
|
|
11
11
|
|
|
@@ -34,9 +34,45 @@ For each iteration:
|
|
|
34
34
|
python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(f'v{s[\"iterations\"]+1:03d}')"
|
|
35
35
|
```
|
|
36
36
|
|
|
37
|
+
### 1.5. Gather Diagnostic Context (LangSmith + Context7)
|
|
38
|
+
|
|
39
|
+
**This step is MANDATORY before every propose.** The orchestrator gathers data so the proposer receives it as files.
|
|
40
|
+
|
|
41
|
+
**LangSmith (if enabled):**
|
|
42
|
+
|
|
43
|
+
Check if LangSmith is enabled and langsmith-cli is available:
|
|
44
|
+
```bash
|
|
45
|
+
cat .harness-evolver/config.json | python3 -c "import sys,json; print(json.load(sys.stdin).get('eval',{}).get('langsmith',{}).get('enabled',False))"
|
|
46
|
+
which langsmith-cli 2>/dev/null
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
If BOTH are true AND at least one iteration has run, gather LangSmith data:
|
|
50
|
+
```bash
|
|
51
|
+
langsmith-cli --json runs list --project harness-evolver-{best_version} --failed --fields id,name,error,inputs --limit 10 > .harness-evolver/langsmith_diagnosis.json 2>/dev/null || echo "[]" > .harness-evolver/langsmith_diagnosis.json
|
|
52
|
+
|
|
53
|
+
langsmith-cli --json runs stats --project harness-evolver-{best_version} > .harness-evolver/langsmith_stats.json 2>/dev/null || echo "{}" > .harness-evolver/langsmith_stats.json
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
**Context7 (if available):**
|
|
57
|
+
|
|
58
|
+
Check `config.json` field `stack.detected`. For each detected library, use the Context7 MCP tools to fetch relevant documentation:
|
|
59
|
+
|
|
60
|
+
```
|
|
61
|
+
For each library in stack.detected:
|
|
62
|
+
1. resolve-library-id with the context7_id
|
|
63
|
+
2. get-library-docs with a query relevant to the current failure modes
|
|
64
|
+
3. Save output to .harness-evolver/context7_docs.md (append each library's docs)
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
This runs ONCE per iteration, not per library. Focus on the library most relevant to the current failures.
|
|
68
|
+
|
|
69
|
+
If Context7 MCP is not available, skip silently.
|
|
70
|
+
|
|
37
71
|
### 2. Propose
|
|
38
72
|
|
|
39
|
-
Spawn the `harness-evolver-proposer` agent with a structured prompt
|
|
73
|
+
Spawn the `harness-evolver-proposer` agent with a structured prompt.
|
|
74
|
+
|
|
75
|
+
The `<files_to_read>` MUST include the LangSmith/Context7 files if they were gathered:
|
|
40
76
|
|
|
41
77
|
```xml
|
|
42
78
|
<objective>
|
|
@@ -51,6 +87,10 @@ Propose harness version {version} that improves on the current best score of {be
|
|
|
51
87
|
- .harness-evolver/harnesses/{best_version}/harness.py
|
|
52
88
|
- .harness-evolver/harnesses/{best_version}/scores.json
|
|
53
89
|
- .harness-evolver/harnesses/{best_version}/proposal.md
|
|
90
|
+
- .harness-evolver/langsmith_diagnosis.json (if exists — LangSmith failure analysis)
|
|
91
|
+
- .harness-evolver/langsmith_stats.json (if exists — LangSmith aggregate stats)
|
|
92
|
+
- .harness-evolver/context7_docs.md (if exists — current library documentation)
|
|
93
|
+
- .harness-evolver/architecture.json (if exists — architect topology recommendation)
|
|
54
94
|
</files_to_read>
|
|
55
95
|
|
|
56
96
|
<output>
|
|
@@ -63,7 +103,8 @@ Create directory .harness-evolver/harnesses/{version}/ containing:
|
|
|
63
103
|
<success_criteria>
|
|
64
104
|
- harness.py maintains CLI interface (--input, --output, --traces-dir, --config)
|
|
65
105
|
- proposal.md documents evidence-based reasoning
|
|
66
|
-
- Changes are motivated by trace analysis, not guesswork
|
|
106
|
+
- Changes are motivated by trace analysis (LangSmith data if available), not guesswork
|
|
107
|
+
- If context7_docs.md was provided, API usage must match current documentation
|
|
67
108
|
</success_criteria>
|
|
68
109
|
```
|
|
69
110
|
|
|
@@ -106,26 +147,73 @@ python3 $TOOLS/state.py update \
|
|
|
106
147
|
|
|
107
148
|
Read `summary.json`. Print: `Iteration {i}/{N}: {version} scored {score} (best: {best} at {best_score})`
|
|
108
149
|
|
|
109
|
-
### 6.5.
|
|
150
|
+
### 6.5. Auto-trigger Critic (on eval gaming)
|
|
110
151
|
|
|
111
|
-
|
|
152
|
+
Read `summary.json` and check:
|
|
112
153
|
- Did the score jump >0.3 from parent version?
|
|
113
154
|
- Did we reach 1.0 in fewer than 3 total iterations?
|
|
114
155
|
|
|
115
|
-
If
|
|
156
|
+
If EITHER is true, **AUTO-SPAWN the critic agent** (do not just suggest — actually spawn it):
|
|
157
|
+
|
|
158
|
+
```bash
|
|
159
|
+
python3 $TOOLS/evaluate.py run \
|
|
160
|
+
--harness .harness-evolver/harnesses/{version}/harness.py \
|
|
161
|
+
--tasks-dir .harness-evolver/eval/tasks/ \
|
|
162
|
+
--eval .harness-evolver/eval/eval.py \
|
|
163
|
+
--traces-dir /tmp/critic-check/ \
|
|
164
|
+
--scores /tmp/critic-check-scores.json \
|
|
165
|
+
--timeout 60
|
|
166
|
+
```
|
|
167
|
+
|
|
168
|
+
Spawn the `harness-evolver-critic` agent:
|
|
116
169
|
|
|
117
|
-
|
|
118
|
-
>
|
|
170
|
+
```xml
|
|
171
|
+
<objective>
|
|
172
|
+
EVAL GAMING DETECTED: Score jumped from {parent_score} to {score} in one iteration.
|
|
173
|
+
Analyze the eval quality and propose a stricter eval.
|
|
174
|
+
</objective>
|
|
119
175
|
|
|
120
|
-
|
|
176
|
+
<files_to_read>
|
|
177
|
+
- .harness-evolver/eval/eval.py
|
|
178
|
+
- .harness-evolver/summary.json
|
|
179
|
+
- .harness-evolver/harnesses/{version}/scores.json
|
|
180
|
+
- .harness-evolver/harnesses/{version}/harness.py
|
|
181
|
+
- .harness-evolver/harnesses/{version}/proposal.md
|
|
182
|
+
- .harness-evolver/config.json
|
|
183
|
+
- .harness-evolver/langsmith_stats.json (if exists)
|
|
184
|
+
</files_to_read>
|
|
121
185
|
|
|
122
|
-
>
|
|
123
|
-
|
|
124
|
-
|
|
186
|
+
<output>
|
|
187
|
+
Write:
|
|
188
|
+
- .harness-evolver/critic_report.md
|
|
189
|
+
- .harness-evolver/eval/eval_improved.py (if weaknesses found)
|
|
190
|
+
</output>
|
|
191
|
+
|
|
192
|
+
<success_criteria>
|
|
193
|
+
- Identifies specific weaknesses in eval.py with task/output examples
|
|
194
|
+
- If gaming detected, shows exact tasks that expose the weakness
|
|
195
|
+
- Improved eval preserves the --results-dir/--tasks-dir/--scores interface
|
|
196
|
+
- Re-scores the best version with improved eval to show the difference
|
|
197
|
+
</success_criteria>
|
|
198
|
+
```
|
|
199
|
+
|
|
200
|
+
Wait for `## CRITIC REPORT COMPLETE`.
|
|
201
|
+
|
|
202
|
+
If critic wrote `eval_improved.py`:
|
|
203
|
+
- Re-score the best harness with the improved eval
|
|
204
|
+
- Show the score difference (e.g., "Current eval: 1.0. Improved eval: 0.45")
|
|
205
|
+
- **AUTO-ADOPT the improved eval**: copy `eval_improved.py` to `eval/eval.py`
|
|
206
|
+
- Re-run baseline with new eval and update `summary.json`
|
|
207
|
+
- Print: "Eval upgraded. Resuming evolution with stricter eval."
|
|
208
|
+
- **Continue the loop** with the new eval
|
|
209
|
+
|
|
210
|
+
If critic did NOT write `eval_improved.py` (eval is fine):
|
|
211
|
+
- Print the critic's assessment
|
|
212
|
+
- Continue the loop normally
|
|
125
213
|
|
|
126
214
|
### 7. Auto-trigger Architect (on stagnation or regression)
|
|
127
215
|
|
|
128
|
-
Check if the architect should be auto-spawned
|
|
216
|
+
Check if the architect should be auto-spawned:
|
|
129
217
|
- **Stagnation**: 3 consecutive iterations within 1% of each other
|
|
130
218
|
- **Regression**: score dropped below parent score (even once)
|
|
131
219
|
|
|
@@ -141,7 +229,7 @@ python3 $TOOLS/analyze_architecture.py \
|
|
|
141
229
|
-o .harness-evolver/architecture_signals.json
|
|
142
230
|
```
|
|
143
231
|
|
|
144
|
-
|
|
232
|
+
Spawn the `harness-evolver-architect` agent:
|
|
145
233
|
|
|
146
234
|
```xml
|
|
147
235
|
<objective>
|
|
@@ -156,6 +244,7 @@ Analyze the harness architecture and recommend a topology change.
|
|
|
156
244
|
- .harness-evolver/config.json
|
|
157
245
|
- .harness-evolver/harnesses/{best_version}/harness.py
|
|
158
246
|
- .harness-evolver/harnesses/{best_version}/scores.json
|
|
247
|
+
- .harness-evolver/context7_docs.md (if exists)
|
|
159
248
|
</files_to_read>
|
|
160
249
|
|
|
161
250
|
<output>
|
|
@@ -173,25 +262,21 @@ Write:
|
|
|
173
262
|
|
|
174
263
|
Wait for `## ARCHITECTURE ANALYSIS COMPLETE`.
|
|
175
264
|
|
|
176
|
-
|
|
177
|
-
|
|
178
|
-
> Architect recommends: {current} → {recommended} ({confidence} confidence)
|
|
179
|
-
> Migration path: {N} steps. Continuing evolution with architecture guidance.
|
|
180
|
-
|
|
181
|
-
Then **continue the loop** — the proposer will read `architecture.json` in the next iteration.
|
|
265
|
+
Report: `Architect recommends: {current} → {recommended} ({confidence} confidence)`
|
|
182
266
|
|
|
183
|
-
|
|
267
|
+
Then **continue the loop** — the proposer reads `architecture.json` in the next iteration.
|
|
184
268
|
|
|
185
269
|
### 8. Check Stop Conditions
|
|
186
270
|
|
|
187
271
|
- **Target**: `combined_score >= target_score` → stop
|
|
188
272
|
- **N reached**: done
|
|
189
|
-
- **Stagnation post-architect**: 3 more iterations without improvement AFTER architect ran → stop
|
|
273
|
+
- **Stagnation post-architect**: 3 more iterations without improvement AFTER architect ran → stop
|
|
190
274
|
|
|
191
275
|
## When Loop Ends — Final Report
|
|
192
276
|
|
|
193
277
|
- Best version and score
|
|
194
278
|
- Improvement over baseline (absolute and %)
|
|
195
279
|
- Total iterations run
|
|
280
|
+
- Whether critic was triggered and eval was upgraded
|
|
196
281
|
- Whether architect was triggered and what it recommended
|
|
197
282
|
- Suggest: "The best harness is at `.harness-evolver/harnesses/{best}/harness.py`. Copy it to your project."
|