harness-evolver 2.1.0 → 2.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +1 -1
- package/skills/architect/SKILL.md +2 -10
- package/skills/critic/SKILL.md +2 -10
- package/skills/evolve/SKILL.md +80 -54
- package/skills/init/SKILL.md +2 -9
package/package.json
CHANGED
|
@@ -48,21 +48,13 @@ python3 $TOOLS/analyze_architecture.py \
|
|
|
48
48
|
-o .harness-evolver/architecture_signals.json
|
|
49
49
|
```
|
|
50
50
|
|
|
51
|
-
3.
|
|
52
|
-
```bash
|
|
53
|
-
cat ~/.claude/agents/harness-evolver-architect.md
|
|
54
|
-
```
|
|
55
|
-
|
|
56
|
-
4. Dispatch using the Agent tool — include the agent definition in the prompt:
|
|
51
|
+
3. Dispatch using the Agent tool with `subagent_type: "harness-evolver-architect"`:
|
|
57
52
|
|
|
58
53
|
```
|
|
59
54
|
Agent(
|
|
55
|
+
subagent_type: "harness-evolver-architect",
|
|
60
56
|
description: "Architect: topology analysis",
|
|
61
57
|
prompt: |
|
|
62
|
-
<agent_instructions>
|
|
63
|
-
{paste the FULL content of harness-evolver-architect.md here}
|
|
64
|
-
</agent_instructions>
|
|
65
|
-
|
|
66
58
|
<objective>
|
|
67
59
|
Analyze the harness architecture and recommend the optimal multi-agent topology.
|
|
68
60
|
{If called from evolve: "The evolution loop stagnated/regressed after N iterations."}
|
package/skills/critic/SKILL.md
CHANGED
|
@@ -22,21 +22,13 @@ TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo
|
|
|
22
22
|
|
|
23
23
|
1. Read `summary.json` and identify the suspicious pattern (score jump, premature convergence).
|
|
24
24
|
|
|
25
|
-
2.
|
|
26
|
-
```bash
|
|
27
|
-
cat ~/.claude/agents/harness-evolver-critic.md
|
|
28
|
-
```
|
|
29
|
-
|
|
30
|
-
3. Dispatch using the Agent tool — include the agent definition in the prompt:
|
|
25
|
+
2. Dispatch using the Agent tool with `subagent_type: "harness-evolver-critic"`:
|
|
31
26
|
|
|
32
27
|
```
|
|
33
28
|
Agent(
|
|
29
|
+
subagent_type: "harness-evolver-critic",
|
|
34
30
|
description: "Critic: analyze eval quality",
|
|
35
31
|
prompt: |
|
|
36
|
-
<agent_instructions>
|
|
37
|
-
{paste the FULL content of harness-evolver-critic.md here}
|
|
38
|
-
</agent_instructions>
|
|
39
|
-
|
|
40
32
|
<objective>
|
|
41
33
|
Analyze eval quality for this harness evolution project.
|
|
42
34
|
The best version is {version} with score {score} achieved in {iterations} iteration(s).
|
package/skills/evolve/SKILL.md
CHANGED
|
@@ -52,49 +52,105 @@ LS_PROJECT=$(langsmith-cli --json projects list --name-pattern "harness-evolver*
|
|
|
52
52
|
|
|
53
53
|
If `LS_PROJECT` is empty, langsmith-cli is not available or no projects exist — skip to step 2.
|
|
54
54
|
|
|
55
|
-
**Step 2: Gather traces from the discovered project**
|
|
55
|
+
**Step 2: Gather raw traces from the discovered project**
|
|
56
56
|
|
|
57
57
|
```bash
|
|
58
58
|
if [ -n "$LS_PROJECT" ]; then
|
|
59
|
-
langsmith-cli --json runs list --project "$LS_PROJECT" --
|
|
60
|
-
langsmith-cli --json runs list --project "$LS_PROJECT" --fields id,name,inputs,outputs,latency_ms,total_tokens --limit 20 > .harness-evolver/langsmith_runs.json 2>/dev/null || echo "[]" > .harness-evolver/langsmith_runs.json
|
|
59
|
+
langsmith-cli --json runs list --project "$LS_PROJECT" --recent --fields id,name,inputs,outputs,error,total_tokens --limit 30 > /tmp/langsmith_raw.json 2>/dev/null || echo "[]" > /tmp/langsmith_raw.json
|
|
61
60
|
langsmith-cli --json runs stats --project "$LS_PROJECT" > .harness-evolver/langsmith_stats.json 2>/dev/null || echo "{}" > .harness-evolver/langsmith_stats.json
|
|
62
61
|
echo "$LS_PROJECT" > .harness-evolver/langsmith_project.txt
|
|
63
62
|
else
|
|
64
|
-
echo "[]" >
|
|
63
|
+
echo "[]" > /tmp/langsmith_raw.json
|
|
65
64
|
echo "{}" > .harness-evolver/langsmith_stats.json
|
|
66
65
|
fi
|
|
67
66
|
```
|
|
68
67
|
|
|
69
|
-
|
|
68
|
+
**Step 3: Process raw LangSmith data into a readable format for proposers**
|
|
69
|
+
|
|
70
|
+
The raw langsmith data has LangChain-serialized messages that are hard to read. Process it into a clean summary:
|
|
71
|
+
|
|
72
|
+
```bash
|
|
73
|
+
python3 -c "
|
|
74
|
+
import json, sys
|
|
75
|
+
|
|
76
|
+
raw = json.load(open('/tmp/langsmith_raw.json'))
|
|
77
|
+
if not raw:
|
|
78
|
+
json.dump([], open('.harness-evolver/langsmith_runs.json', 'w'))
|
|
79
|
+
sys.exit(0)
|
|
80
|
+
|
|
81
|
+
clean = []
|
|
82
|
+
for r in raw:
|
|
83
|
+
entry = {'name': r.get('name', '?'), 'tokens': r.get('total_tokens', 0), 'error': r.get('error')}
|
|
84
|
+
|
|
85
|
+
# Extract readable prompt from LangChain serialized inputs
|
|
86
|
+
inputs = r.get('inputs', {})
|
|
87
|
+
if isinstance(inputs, dict) and 'messages' in inputs:
|
|
88
|
+
msgs = inputs['messages']
|
|
89
|
+
for msg_group in (msgs if isinstance(msgs, list) else [msgs]):
|
|
90
|
+
for msg in (msg_group if isinstance(msg_group, list) else [msg_group]):
|
|
91
|
+
if isinstance(msg, dict):
|
|
92
|
+
kwargs = msg.get('kwargs', msg)
|
|
93
|
+
content = kwargs.get('content', '')
|
|
94
|
+
msg_type = msg.get('id', ['','','',''])[3] if isinstance(msg.get('id'), list) else 'unknown'
|
|
95
|
+
if 'Human' in str(msg_type) or 'user' in str(msg_type).lower():
|
|
96
|
+
entry['user_message'] = str(content)[:300]
|
|
97
|
+
elif 'System' in str(msg_type):
|
|
98
|
+
entry['system_prompt_preview'] = str(content)[:200]
|
|
99
|
+
|
|
100
|
+
# Extract readable output
|
|
101
|
+
outputs = r.get('outputs', {})
|
|
102
|
+
if isinstance(outputs, dict) and 'generations' in outputs:
|
|
103
|
+
gens = outputs['generations']
|
|
104
|
+
if gens and isinstance(gens, list) and gens[0]:
|
|
105
|
+
gen = gens[0][0] if isinstance(gens[0], list) else gens[0]
|
|
106
|
+
if isinstance(gen, dict):
|
|
107
|
+
msg = gen.get('message', gen)
|
|
108
|
+
if isinstance(msg, dict):
|
|
109
|
+
kwargs = msg.get('kwargs', msg)
|
|
110
|
+
entry['llm_response'] = str(kwargs.get('content', ''))[:300]
|
|
111
|
+
|
|
112
|
+
clean.append(entry)
|
|
113
|
+
|
|
114
|
+
json.dump(clean, open('.harness-evolver/langsmith_runs.json', 'w'), indent=2, ensure_ascii=False)
|
|
115
|
+
print(f'Processed {len(clean)} LangSmith runs into readable format')
|
|
116
|
+
" 2>/dev/null || echo "[]" > .harness-evolver/langsmith_runs.json
|
|
117
|
+
```
|
|
118
|
+
|
|
119
|
+
The resulting `langsmith_runs.json` has clean, readable entries:
|
|
120
|
+
```json
|
|
121
|
+
[
|
|
122
|
+
{
|
|
123
|
+
"name": "ChatGoogleGenerativeAI",
|
|
124
|
+
"tokens": 1332,
|
|
125
|
+
"error": null,
|
|
126
|
+
"user_message": "Analise este texto: Bom dia pessoal...",
|
|
127
|
+
"system_prompt_preview": "Você é um moderador de conteúdo...",
|
|
128
|
+
"llm_response": "{\"categories\": [\"safe\"], \"severity\": \"safe\"...}"
|
|
129
|
+
}
|
|
130
|
+
]
|
|
131
|
+
```
|
|
132
|
+
|
|
133
|
+
These files are included in the proposer's `<files_to_read>` so it has readable trace data for diagnosis.
|
|
70
134
|
|
|
71
135
|
### 2. Propose (3 parallel candidates)
|
|
72
136
|
|
|
73
137
|
Spawn 3 proposer agents IN PARALLEL, each with a different evolutionary strategy.
|
|
74
138
|
This follows the DGM/AlphaEvolve pattern: exploit + explore + crossover.
|
|
75
139
|
|
|
76
|
-
|
|
77
|
-
```bash
|
|
78
|
-
cat ~/.claude/agents/harness-evolver-proposer.md
|
|
79
|
-
```
|
|
80
|
-
|
|
81
|
-
Then determine parents for each strategy:
|
|
140
|
+
Determine parents for each strategy:
|
|
82
141
|
- **Exploiter parent**: current best version (from summary.json `best.version`)
|
|
83
142
|
- **Explorer parent**: a non-best version with low offspring count (read summary.json history, pick one that scored >0 but is NOT the best and has NOT been parent to many children)
|
|
84
143
|
- **Crossover parents**: best version + a different high-scorer from a different lineage
|
|
85
144
|
|
|
86
|
-
Spawn all 3 using the Agent tool
|
|
145
|
+
Spawn all 3 using the Agent tool with `subagent_type: "harness-evolver-proposer"`. The first 2 use `run_in_background: true`, the 3rd blocks:
|
|
87
146
|
|
|
88
147
|
**Candidate A (Exploiter)** — `run_in_background: true`:
|
|
89
148
|
```
|
|
90
149
|
Agent(
|
|
150
|
+
subagent_type: "harness-evolver-proposer",
|
|
91
151
|
description: "Proposer A (exploit): targeted fix for {version}",
|
|
92
152
|
run_in_background: true,
|
|
93
153
|
prompt: |
|
|
94
|
-
<agent_instructions>
|
|
95
|
-
{FULL content of harness-evolver-proposer.md}
|
|
96
|
-
</agent_instructions>
|
|
97
|
-
|
|
98
154
|
<strategy>
|
|
99
155
|
APPROACH: exploitation
|
|
100
156
|
You are the EXPLOITER. Make the SMALLEST, most targeted change that fixes
|
|
@@ -130,13 +186,10 @@ Agent(
|
|
|
130
186
|
**Candidate B (Explorer)** — `run_in_background: true`:
|
|
131
187
|
```
|
|
132
188
|
Agent(
|
|
189
|
+
subagent_type: "harness-evolver-proposer",
|
|
133
190
|
description: "Proposer B (explore): bold change from {explorer_parent}",
|
|
134
191
|
run_in_background: true,
|
|
135
192
|
prompt: |
|
|
136
|
-
<agent_instructions>
|
|
137
|
-
{FULL content of harness-evolver-proposer.md}
|
|
138
|
-
</agent_instructions>
|
|
139
|
-
|
|
140
193
|
<strategy>
|
|
141
194
|
APPROACH: exploration
|
|
142
195
|
You are the EXPLORER. Try a FUNDAMENTALLY DIFFERENT approach.
|
|
@@ -172,12 +225,9 @@ Agent(
|
|
|
172
225
|
**Candidate C (Crossover)** — blocks (last one):
|
|
173
226
|
```
|
|
174
227
|
Agent(
|
|
228
|
+
subagent_type: "harness-evolver-proposer",
|
|
175
229
|
description: "Proposer C (crossover): combine {parent_a} + {parent_b}",
|
|
176
230
|
prompt: |
|
|
177
|
-
<agent_instructions>
|
|
178
|
-
{FULL content of harness-evolver-proposer.md}
|
|
179
|
-
</agent_instructions>
|
|
180
|
-
|
|
181
231
|
<strategy>
|
|
182
232
|
APPROACH: crossover
|
|
183
233
|
You are the CROSSOVER agent. Combine the STRENGTHS of two different versions:
|
|
@@ -269,21 +319,13 @@ python3 $TOOLS/evaluate.py run \
|
|
|
269
319
|
|
|
270
320
|
For each evaluated candidate, read its scores.json. If `eval_type` is `"pending-judge"` (combined_score == -1), the eval was a passthrough and needs judge scoring.
|
|
271
321
|
|
|
272
|
-
|
|
273
|
-
```bash
|
|
274
|
-
cat ~/.claude/agents/harness-evolver-judge.md
|
|
275
|
-
```
|
|
276
|
-
|
|
277
|
-
Spawn judge subagent for EACH candidate that needs judging:
|
|
322
|
+
Spawn judge subagent with `subagent_type: "harness-evolver-judge"` for EACH candidate that needs judging:
|
|
278
323
|
|
|
279
324
|
```
|
|
280
325
|
Agent(
|
|
326
|
+
subagent_type: "harness-evolver-judge",
|
|
281
327
|
description: "Judge: score {version}{suffix} outputs",
|
|
282
328
|
prompt: |
|
|
283
|
-
<agent_instructions>
|
|
284
|
-
{FULL content of harness-evolver-judge.md}
|
|
285
|
-
</agent_instructions>
|
|
286
|
-
|
|
287
329
|
<objective>
|
|
288
330
|
Score the outputs of harness version {version}{suffix} across all {N} tasks.
|
|
289
331
|
</objective>
|
|
@@ -354,21 +396,13 @@ python3 $TOOLS/evaluate.py run \
|
|
|
354
396
|
--timeout 60
|
|
355
397
|
```
|
|
356
398
|
|
|
357
|
-
|
|
358
|
-
```bash
|
|
359
|
-
cat ~/.claude/agents/harness-evolver-critic.md
|
|
360
|
-
```
|
|
361
|
-
|
|
362
|
-
Then dispatch:
|
|
399
|
+
Dispatch the critic agent:
|
|
363
400
|
|
|
364
401
|
```
|
|
365
402
|
Agent(
|
|
403
|
+
subagent_type: "harness-evolver-critic",
|
|
366
404
|
description: "Critic: analyze eval quality",
|
|
367
405
|
prompt: |
|
|
368
|
-
<agent_instructions>
|
|
369
|
-
{paste the FULL content of harness-evolver-critic.md here}
|
|
370
|
-
</agent_instructions>
|
|
371
|
-
|
|
372
406
|
<objective>
|
|
373
407
|
EVAL GAMING DETECTED: Score jumped from {parent_score} to {score} in one iteration.
|
|
374
408
|
Analyze the eval quality and propose a stricter eval.
|
|
@@ -431,21 +465,13 @@ python3 $TOOLS/analyze_architecture.py \
|
|
|
431
465
|
-o .harness-evolver/architecture_signals.json
|
|
432
466
|
```
|
|
433
467
|
|
|
434
|
-
|
|
435
|
-
```bash
|
|
436
|
-
cat ~/.claude/agents/harness-evolver-architect.md
|
|
437
|
-
```
|
|
438
|
-
|
|
439
|
-
Then dispatch:
|
|
468
|
+
Dispatch the architect agent:
|
|
440
469
|
|
|
441
470
|
```
|
|
442
471
|
Agent(
|
|
472
|
+
subagent_type: "harness-evolver-architect",
|
|
443
473
|
description: "Architect: analyze topology after {stagnation/regression}",
|
|
444
474
|
prompt: |
|
|
445
|
-
<agent_instructions>
|
|
446
|
-
{paste the FULL content of harness-evolver-architect.md here}
|
|
447
|
-
</agent_instructions>
|
|
448
|
-
|
|
449
475
|
<objective>
|
|
450
476
|
The evolution loop has {stagnated/regressed} after {iterations} iterations (best: {best_score}).
|
|
451
477
|
Analyze the harness architecture and recommend a topology change.
|
package/skills/init/SKILL.md
CHANGED
|
@@ -49,19 +49,12 @@ If NO eval exists:
|
|
|
49
49
|
**Tasks** (`tasks/`): If test tasks exist, use them.
|
|
50
50
|
|
|
51
51
|
If NO tasks exist:
|
|
52
|
-
-
|
|
53
|
-
```bash
|
|
54
|
-
cat ~/.claude/agents/harness-evolver-testgen.md
|
|
55
|
-
```
|
|
56
|
-
- Spawn testgen subagent:
|
|
52
|
+
- Spawn testgen subagent with `subagent_type: "harness-evolver-testgen"`:
|
|
57
53
|
```
|
|
58
54
|
Agent(
|
|
55
|
+
subagent_type: "harness-evolver-testgen",
|
|
59
56
|
description: "TestGen: generate test cases for this project",
|
|
60
57
|
prompt: |
|
|
61
|
-
<agent_instructions>
|
|
62
|
-
{FULL content of harness-evolver-testgen.md}
|
|
63
|
-
</agent_instructions>
|
|
64
|
-
|
|
65
58
|
<objective>
|
|
66
59
|
Generate 30 diverse test cases for this project. Write them to tasks/ directory.
|
|
67
60
|
</objective>
|