harness-evolver 1.5.0 → 1.7.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/agents/harness-evolver-architect.md +1 -0
- package/agents/harness-evolver-critic.md +1 -0
- package/agents/harness-evolver-proposer.md +19 -0
- package/package.json +1 -1
- package/skills/architect/SKILL.md +10 -2
- package/skills/critic/SKILL.md +10 -2
- package/skills/evolve/SKILL.md +40 -37
- package/tools/init.py +19 -3
|
@@ -4,6 +4,7 @@ description: |
|
|
|
4
4
|
Use this agent when the evolve skill needs to propose a new harness candidate.
|
|
5
5
|
Navigates the .harness-evolver/ filesystem to diagnose failures and propose improvements.
|
|
6
6
|
tools: Read, Write, Edit, Bash, Glob, Grep
|
|
7
|
+
color: green
|
|
7
8
|
permissionMode: acceptEdits
|
|
8
9
|
---
|
|
9
10
|
|
|
@@ -12,6 +13,24 @@ permissionMode: acceptEdits
|
|
|
12
13
|
If your prompt contains a `<files_to_read>` block, you MUST use the Read tool to load
|
|
13
14
|
every file listed there before performing any other actions. These files are your context.
|
|
14
15
|
|
|
16
|
+
## Context7 — Enrich Your Knowledge
|
|
17
|
+
|
|
18
|
+
You have access to Context7 MCP tools (`resolve-library-id` and `get-library-docs`) for looking up **current, version-specific documentation** of any library.
|
|
19
|
+
|
|
20
|
+
**USE CONTEXT7 PROACTIVELY whenever you:**
|
|
21
|
+
- Are about to write code that uses a library API (LangGraph, LangChain, OpenAI, etc.)
|
|
22
|
+
- Are unsure about the correct method signature, parameters, or patterns
|
|
23
|
+
- Want to check if a better approach exists in the latest version
|
|
24
|
+
- See an error in traces that might be caused by using a deprecated API
|
|
25
|
+
|
|
26
|
+
**How to use:**
|
|
27
|
+
1. `resolve-library-id` with the library name (e.g., "langchain", "langgraph")
|
|
28
|
+
2. `get-library-docs` with a specific query (e.g., "StateGraph conditional edges", "ChatGoogleGenerativeAI streaming")
|
|
29
|
+
|
|
30
|
+
**Do NOT skip this.** Your training data may be outdated. Context7 gives you the current docs. Even if you're confident about an API, a quick check takes seconds and prevents proposing deprecated patterns.
|
|
31
|
+
|
|
32
|
+
If Context7 is not available, proceed with model knowledge but note in `proposal.md`: "API not verified against current docs."
|
|
33
|
+
|
|
15
34
|
## Return Protocol
|
|
16
35
|
|
|
17
36
|
When done, end your response with:
|
package/package.json
CHANGED
|
@@ -48,13 +48,21 @@ python3 $TOOLS/analyze_architecture.py \
|
|
|
48
48
|
-o .harness-evolver/architecture_signals.json
|
|
49
49
|
```
|
|
50
50
|
|
|
51
|
-
3.
|
|
51
|
+
3. Read the architect agent definition:
|
|
52
|
+
```bash
|
|
53
|
+
cat ~/.claude/agents/harness-evolver-architect.md
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
4. Dispatch using the Agent tool — include the agent definition in the prompt:
|
|
52
57
|
|
|
53
58
|
```
|
|
54
59
|
Agent(
|
|
55
|
-
subagent_type: "harness-evolver-architect",
|
|
56
60
|
description: "Architect: topology analysis",
|
|
57
61
|
prompt: |
|
|
62
|
+
<agent_instructions>
|
|
63
|
+
{paste the FULL content of harness-evolver-architect.md here}
|
|
64
|
+
</agent_instructions>
|
|
65
|
+
|
|
58
66
|
<objective>
|
|
59
67
|
Analyze the harness architecture and recommend the optimal multi-agent topology.
|
|
60
68
|
{If called from evolve: "The evolution loop stagnated/regressed after N iterations."}
|
package/skills/critic/SKILL.md
CHANGED
|
@@ -22,13 +22,21 @@ TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo
|
|
|
22
22
|
|
|
23
23
|
1. Read `summary.json` and identify the suspicious pattern (score jump, premature convergence).
|
|
24
24
|
|
|
25
|
-
2.
|
|
25
|
+
2. Read the critic agent definition:
|
|
26
|
+
```bash
|
|
27
|
+
cat ~/.claude/agents/harness-evolver-critic.md
|
|
28
|
+
```
|
|
29
|
+
|
|
30
|
+
3. Dispatch using the Agent tool — include the agent definition in the prompt:
|
|
26
31
|
|
|
27
32
|
```
|
|
28
33
|
Agent(
|
|
29
|
-
subagent_type: "harness-evolver-critic",
|
|
30
34
|
description: "Critic: analyze eval quality",
|
|
31
35
|
prompt: |
|
|
36
|
+
<agent_instructions>
|
|
37
|
+
{paste the FULL content of harness-evolver-critic.md here}
|
|
38
|
+
</agent_instructions>
|
|
39
|
+
|
|
32
40
|
<objective>
|
|
33
41
|
Analyze eval quality for this harness evolution project.
|
|
34
42
|
The best version is {version} with score {score} achieved in {iterations} iteration(s).
|
package/skills/evolve/SKILL.md
CHANGED
|
@@ -34,51 +34,39 @@ For each iteration:
|
|
|
34
34
|
python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(f'v{s[\"iterations\"]+1:03d}')"
|
|
35
35
|
```
|
|
36
36
|
|
|
37
|
-
### 1.5. Gather
|
|
37
|
+
### 1.5. Gather LangSmith Traces (MANDATORY after every evaluation)
|
|
38
38
|
|
|
39
|
-
**
|
|
39
|
+
**Run these commands unconditionally after EVERY evaluation** (including baseline). If langsmith-cli is not installed or there are no runs, the commands fail silently — that's fine. But you MUST attempt them.
|
|
40
40
|
|
|
41
|
-
**LangSmith (if enabled):**
|
|
42
|
-
|
|
43
|
-
Check if LangSmith is enabled and langsmith-cli is available:
|
|
44
|
-
```bash
|
|
45
|
-
cat .harness-evolver/config.json | python3 -c "import sys,json; print(json.load(sys.stdin).get('eval',{}).get('langsmith',{}).get('enabled',False))"
|
|
46
|
-
which langsmith-cli 2>/dev/null
|
|
47
|
-
```
|
|
48
|
-
|
|
49
|
-
If BOTH are true AND at least one iteration has run, gather LangSmith data:
|
|
50
41
|
```bash
|
|
51
|
-
langsmith-cli --json runs list --project harness-evolver-{
|
|
42
|
+
langsmith-cli --json runs list --project harness-evolver-{last_evaluated_version} --failed --fields id,name,error,inputs --limit 10 > .harness-evolver/langsmith_diagnosis.json 2>/dev/null || echo "[]" > .harness-evolver/langsmith_diagnosis.json
|
|
52
43
|
|
|
53
|
-
langsmith-cli --json runs stats --project harness-evolver-{
|
|
44
|
+
langsmith-cli --json runs stats --project harness-evolver-{last_evaluated_version} > .harness-evolver/langsmith_stats.json 2>/dev/null || echo "{}" > .harness-evolver/langsmith_stats.json
|
|
54
45
|
```
|
|
55
46
|
|
|
56
|
-
|
|
47
|
+
For the first iteration, use `baseline` as the version. For subsequent iterations, use the latest evaluated version.
|
|
57
48
|
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
```
|
|
61
|
-
For each library in stack.detected:
|
|
62
|
-
1. resolve-library-id with the context7_id
|
|
63
|
-
2. get-library-docs with a query relevant to the current failure modes
|
|
64
|
-
3. Save output to .harness-evolver/context7_docs.md (append each library's docs)
|
|
65
|
-
```
|
|
66
|
-
|
|
67
|
-
This runs ONCE per iteration, not per library. Focus on the library most relevant to the current failures.
|
|
68
|
-
|
|
69
|
-
If Context7 MCP is not available, skip silently.
|
|
49
|
+
These files are included in the proposer's `<files_to_read>` so it has real trace data for diagnosis.
|
|
70
50
|
|
|
71
51
|
### 2. Propose
|
|
72
52
|
|
|
73
|
-
Dispatch a subagent using the **Agent tool
|
|
53
|
+
Dispatch a subagent using the **Agent tool**.
|
|
74
54
|
|
|
75
|
-
|
|
55
|
+
First, read the proposer agent definition to include in the prompt:
|
|
56
|
+
```bash
|
|
57
|
+
cat ~/.claude/agents/harness-evolver-proposer.md
|
|
58
|
+
```
|
|
59
|
+
|
|
60
|
+
Then dispatch the Agent with the agent definition + structured task:
|
|
76
61
|
|
|
77
62
|
```
|
|
78
63
|
Agent(
|
|
79
|
-
subagent_type: "harness-evolver-proposer",
|
|
80
64
|
description: "Propose harness {version}",
|
|
81
65
|
prompt: |
|
|
66
|
+
<agent_instructions>
|
|
67
|
+
{paste the FULL content of harness-evolver-proposer.md here}
|
|
68
|
+
</agent_instructions>
|
|
69
|
+
|
|
82
70
|
<objective>
|
|
83
71
|
Propose harness version {version} that improves on the current best score of {best_score}.
|
|
84
72
|
</objective>
|
|
@@ -93,7 +81,6 @@ Agent(
|
|
|
93
81
|
- .harness-evolver/harnesses/{best_version}/proposal.md
|
|
94
82
|
- .harness-evolver/langsmith_diagnosis.json (if exists — LangSmith failure analysis)
|
|
95
83
|
- .harness-evolver/langsmith_stats.json (if exists — LangSmith aggregate stats)
|
|
96
|
-
- .harness-evolver/context7_docs.md (if exists — current library documentation)
|
|
97
84
|
- .harness-evolver/architecture.json (if exists — architect topology recommendation)
|
|
98
85
|
</files_to_read>
|
|
99
86
|
|
|
@@ -107,13 +94,13 @@ Agent(
|
|
|
107
94
|
<success_criteria>
|
|
108
95
|
- harness.py maintains CLI interface (--input, --output, --traces-dir, --config)
|
|
109
96
|
- proposal.md documents evidence-based reasoning
|
|
110
|
-
-
|
|
111
|
-
-
|
|
97
|
+
- If proposing API changes, MUST use Context7 (resolve-library-id + get-library-docs) to verify current docs
|
|
98
|
+
- Changes motivated by LangSmith trace data (in langsmith_diagnosis.json) when available
|
|
112
99
|
</success_criteria>
|
|
113
100
|
)
|
|
114
101
|
```
|
|
115
102
|
|
|
116
|
-
Wait for `## PROPOSAL COMPLETE` in the response.
|
|
103
|
+
Wait for `## PROPOSAL COMPLETE` in the response.
|
|
117
104
|
|
|
118
105
|
### 3. Validate
|
|
119
106
|
|
|
@@ -170,13 +157,21 @@ python3 $TOOLS/evaluate.py run \
|
|
|
170
157
|
--timeout 60
|
|
171
158
|
```
|
|
172
159
|
|
|
173
|
-
|
|
160
|
+
First read the critic agent definition:
|
|
161
|
+
```bash
|
|
162
|
+
cat ~/.claude/agents/harness-evolver-critic.md
|
|
163
|
+
```
|
|
164
|
+
|
|
165
|
+
Then dispatch:
|
|
174
166
|
|
|
175
167
|
```
|
|
176
168
|
Agent(
|
|
177
|
-
subagent_type: "harness-evolver-critic",
|
|
178
169
|
description: "Critic: analyze eval quality",
|
|
179
170
|
prompt: |
|
|
171
|
+
<agent_instructions>
|
|
172
|
+
{paste the FULL content of harness-evolver-critic.md here}
|
|
173
|
+
</agent_instructions>
|
|
174
|
+
|
|
180
175
|
<objective>
|
|
181
176
|
EVAL GAMING DETECTED: Score jumped from {parent_score} to {score} in one iteration.
|
|
182
177
|
Analyze the eval quality and propose a stricter eval.
|
|
@@ -239,13 +234,21 @@ python3 $TOOLS/analyze_architecture.py \
|
|
|
239
234
|
-o .harness-evolver/architecture_signals.json
|
|
240
235
|
```
|
|
241
236
|
|
|
242
|
-
|
|
237
|
+
First read the architect agent definition:
|
|
238
|
+
```bash
|
|
239
|
+
cat ~/.claude/agents/harness-evolver-architect.md
|
|
240
|
+
```
|
|
241
|
+
|
|
242
|
+
Then dispatch:
|
|
243
243
|
|
|
244
244
|
```
|
|
245
245
|
Agent(
|
|
246
|
-
subagent_type: "harness-evolver-architect",
|
|
247
246
|
description: "Architect: analyze topology after {stagnation/regression}",
|
|
248
247
|
prompt: |
|
|
248
|
+
<agent_instructions>
|
|
249
|
+
{paste the FULL content of harness-evolver-architect.md here}
|
|
250
|
+
</agent_instructions>
|
|
251
|
+
|
|
249
252
|
<objective>
|
|
250
253
|
The evolution loop has {stagnated/regressed} after {iterations} iterations (best: {best_score}).
|
|
251
254
|
Analyze the harness architecture and recommend a topology change.
|
package/tools/init.py
CHANGED
|
@@ -298,10 +298,26 @@ def main():
|
|
|
298
298
|
print(" Recommendation: install langsmith-cli for rich trace analysis:")
|
|
299
299
|
print(" uv tool install langsmith-cli && langsmith-cli auth login")
|
|
300
300
|
|
|
301
|
-
# Detect stack
|
|
302
|
-
stack = _detect_stack(args.harness)
|
|
301
|
+
# Detect stack — try original harness first, then baseline copy, then scan entire source dir
|
|
302
|
+
stack = _detect_stack(os.path.abspath(args.harness))
|
|
303
|
+
if not stack:
|
|
304
|
+
stack = _detect_stack(os.path.join(base, "baseline", "harness.py"))
|
|
305
|
+
if not stack:
|
|
306
|
+
# Scan the original directory for any .py files with known imports
|
|
307
|
+
harness_dir = os.path.dirname(os.path.abspath(args.harness))
|
|
308
|
+
detect_stack_py = os.path.join(os.path.dirname(__file__), "detect_stack.py")
|
|
309
|
+
if os.path.exists(detect_stack_py):
|
|
310
|
+
try:
|
|
311
|
+
r = subprocess.run(
|
|
312
|
+
["python3", detect_stack_py, harness_dir],
|
|
313
|
+
capture_output=True, text=True, timeout=30,
|
|
314
|
+
)
|
|
315
|
+
if r.returncode == 0 and r.stdout.strip():
|
|
316
|
+
stack = json.loads(r.stdout)
|
|
317
|
+
except Exception:
|
|
318
|
+
pass
|
|
303
319
|
config["stack"] = {
|
|
304
|
-
"detected": stack,
|
|
320
|
+
"detected": stack if stack else {},
|
|
305
321
|
"documentation_hint": "use context7",
|
|
306
322
|
"auto_detected": True,
|
|
307
323
|
}
|