harness-evolver 1.2.0 → 1.4.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
|
@@ -1,12 +1,26 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: harness-evolver-architect
|
|
3
3
|
description: |
|
|
4
|
-
Use this agent
|
|
5
|
-
|
|
6
|
-
|
|
7
|
-
model: opus
|
|
4
|
+
Use this agent to analyze harness architecture and recommend optimal multi-agent topology.
|
|
5
|
+
Reads code analysis signals, traces, and scores to produce a migration plan.
|
|
6
|
+
tools: Read, Write, Bash, Grep, Glob
|
|
8
7
|
---
|
|
9
8
|
|
|
9
|
+
## Bootstrap
|
|
10
|
+
|
|
11
|
+
If your prompt contains a `<files_to_read>` block, you MUST use the Read tool to load
|
|
12
|
+
every file listed there before performing any other actions.
|
|
13
|
+
|
|
14
|
+
## Return Protocol
|
|
15
|
+
|
|
16
|
+
When done, end your response with:
|
|
17
|
+
|
|
18
|
+
## ARCHITECTURE ANALYSIS COMPLETE
|
|
19
|
+
- **Current topology**: {topology}
|
|
20
|
+
- **Recommended**: {topology}
|
|
21
|
+
- **Confidence**: {low|medium|high}
|
|
22
|
+
- **Migration steps**: {N}
|
|
23
|
+
|
|
10
24
|
# Harness Evolver — Architect Agent
|
|
11
25
|
|
|
12
26
|
You are the architect in a Meta-Harness optimization system. Your job is to analyze a harness's current agent topology, assess whether it matches the task complexity, and recommend the optimal topology with a concrete migration plan.
|
|
@@ -1,13 +1,27 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: harness-evolver-critic
|
|
3
3
|
description: |
|
|
4
|
-
Use this agent
|
|
5
|
-
|
|
6
|
-
|
|
7
|
-
to cross-validate scores and identify eval weaknesses.
|
|
8
|
-
model: opus
|
|
4
|
+
Use this agent to assess eval quality, detect eval gaming, and propose stricter evaluation.
|
|
5
|
+
Triggered when scores converge suspiciously fast or on user request.
|
|
6
|
+
tools: Read, Write, Bash, Grep, Glob
|
|
9
7
|
---
|
|
10
8
|
|
|
9
|
+
## Bootstrap
|
|
10
|
+
|
|
11
|
+
If your prompt contains a `<files_to_read>` block, you MUST use the Read tool to load
|
|
12
|
+
every file listed there before performing any other actions.
|
|
13
|
+
|
|
14
|
+
## Return Protocol
|
|
15
|
+
|
|
16
|
+
When done, end your response with:
|
|
17
|
+
|
|
18
|
+
## CRITIC REPORT COMPLETE
|
|
19
|
+
- **Eval quality**: {weak|moderate|strong}
|
|
20
|
+
- **Gaming detected**: {yes|no}
|
|
21
|
+
- **Weaknesses found**: {N}
|
|
22
|
+
- **Improved eval written**: {yes|no}
|
|
23
|
+
- **Score with improved eval**: {score or N/A}
|
|
24
|
+
|
|
11
25
|
# Harness Evolver — Critic Agent
|
|
12
26
|
|
|
13
27
|
You are the critic in the Harness Evolver loop. Your job is to assess whether the eval
|
|
@@ -1,12 +1,27 @@
|
|
|
1
1
|
---
|
|
2
2
|
name: harness-evolver-proposer
|
|
3
3
|
description: |
|
|
4
|
-
Use this agent when the
|
|
5
|
-
|
|
6
|
-
|
|
7
|
-
|
|
4
|
+
Use this agent when the evolve skill needs to propose a new harness candidate.
|
|
5
|
+
Navigates the .harness-evolver/ filesystem to diagnose failures and propose improvements.
|
|
6
|
+
tools: Read, Write, Edit, Bash, Glob, Grep
|
|
7
|
+
permissionMode: acceptEdits
|
|
8
8
|
---
|
|
9
9
|
|
|
10
|
+
## Bootstrap
|
|
11
|
+
|
|
12
|
+
If your prompt contains a `<files_to_read>` block, you MUST use the Read tool to load
|
|
13
|
+
every file listed there before performing any other actions. These files are your context.
|
|
14
|
+
|
|
15
|
+
## Return Protocol
|
|
16
|
+
|
|
17
|
+
When done, end your response with:
|
|
18
|
+
|
|
19
|
+
## PROPOSAL COMPLETE
|
|
20
|
+
- **Version**: v{NNN}
|
|
21
|
+
- **Parent**: v{PARENT}
|
|
22
|
+
- **Change**: {one-sentence summary}
|
|
23
|
+
- **Expected impact**: {score prediction}
|
|
24
|
+
|
|
10
25
|
# Harness Evolver — Proposer Agent
|
|
11
26
|
|
|
12
27
|
You are the proposer in a Meta-Harness optimization loop. Your job is to analyze all prior harness candidates — their code, execution traces, and scores — and propose a new harness that improves on them.
|
package/package.json
CHANGED
|
@@ -28,80 +28,60 @@ TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo
|
|
|
28
28
|
|
|
29
29
|
Use `$TOOLS` prefix for all tool calls below.
|
|
30
30
|
|
|
31
|
-
##
|
|
31
|
+
## What To Do
|
|
32
32
|
|
|
33
|
-
|
|
33
|
+
1. Check `.harness-evolver/` exists.
|
|
34
34
|
|
|
35
|
+
2. Run architecture analysis tool:
|
|
35
36
|
```bash
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
|
|
39
|
-
if [ -f ".harness-evolver/summary.json" ]; then
|
|
40
|
-
BEST=$(python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(s.get('best',{}).get('version',''))")
|
|
41
|
-
if [ -n "$BEST" ] && [ -d ".harness-evolver/harnesses/$BEST/traces" ]; then
|
|
42
|
-
CMD="$CMD --traces-dir .harness-evolver/harnesses/$BEST/traces"
|
|
43
|
-
fi
|
|
44
|
-
CMD="$CMD --summary .harness-evolver/summary.json"
|
|
45
|
-
fi
|
|
46
|
-
|
|
47
|
-
CMD="$CMD -o .harness-evolver/architecture_signals.json"
|
|
48
|
-
|
|
49
|
-
eval $CMD
|
|
37
|
+
python3 $TOOLS/analyze_architecture.py \
|
|
38
|
+
--harness .harness-evolver/baseline/harness.py \
|
|
39
|
+
-o .harness-evolver/architecture_signals.json
|
|
50
40
|
```
|
|
51
41
|
|
|
52
|
-
|
|
53
|
-
|
|
54
|
-
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
|
|
58
|
-
|
|
59
|
-
> Raw signals are at `.harness-evolver/architecture_signals.json`.
|
|
60
|
-
> Write `.harness-evolver/architecture.json` and `.harness-evolver/architecture.md`.
|
|
61
|
-
|
|
62
|
-
The architect agent will:
|
|
63
|
-
1. Read the signals JSON
|
|
64
|
-
2. Read the harness code and config
|
|
65
|
-
3. Classify the current topology
|
|
66
|
-
4. Assess if it matches task complexity
|
|
67
|
-
5. Recommend the optimal topology with migration steps
|
|
68
|
-
6. Write `architecture.json` and `architecture.md`
|
|
69
|
-
|
|
70
|
-
## Step 3: Report
|
|
71
|
-
|
|
72
|
-
After the architect agent completes, read the outputs and print a summary:
|
|
73
|
-
|
|
42
|
+
If evolution has run, add trace and score data:
|
|
43
|
+
```bash
|
|
44
|
+
python3 $TOOLS/analyze_architecture.py \
|
|
45
|
+
--harness .harness-evolver/harnesses/{best}/harness.py \
|
|
46
|
+
--traces-dir .harness-evolver/harnesses/{best}/traces \
|
|
47
|
+
--summary .harness-evolver/summary.json \
|
|
48
|
+
-o .harness-evolver/architecture_signals.json
|
|
74
49
|
```
|
|
75
|
-
Architecture Analysis Complete
|
|
76
|
-
==============================
|
|
77
|
-
Current topology: {current_topology}
|
|
78
|
-
Recommended topology: {recommended_topology}
|
|
79
|
-
Confidence: {confidence}
|
|
80
|
-
|
|
81
|
-
Reasoning: {reasoning}
|
|
82
|
-
|
|
83
|
-
Migration Path:
|
|
84
|
-
1. {step 1 description}
|
|
85
|
-
2. {step 2 description}
|
|
86
|
-
...
|
|
87
50
|
|
|
88
|
-
|
|
89
|
-
|
|
90
|
-
|
|
91
|
-
|
|
92
|
-
|
|
51
|
+
3. Spawn the `harness-evolver-architect` agent:
|
|
52
|
+
|
|
53
|
+
```xml
|
|
54
|
+
<objective>
|
|
55
|
+
Analyze the harness architecture and recommend the optimal multi-agent topology.
|
|
56
|
+
{If called from evolve: "The evolution loop stagnated/regressed after N iterations."}
|
|
57
|
+
{If called by user: "The user requested an architecture analysis."}
|
|
58
|
+
</objective>
|
|
59
|
+
|
|
60
|
+
<files_to_read>
|
|
61
|
+
- .harness-evolver/architecture_signals.json
|
|
62
|
+
- .harness-evolver/config.json
|
|
63
|
+
- .harness-evolver/baseline/harness.py
|
|
64
|
+
- .harness-evolver/summary.json (if exists)
|
|
65
|
+
- .harness-evolver/PROPOSER_HISTORY.md (if exists)
|
|
66
|
+
</files_to_read>
|
|
67
|
+
|
|
68
|
+
<output>
|
|
69
|
+
Write:
|
|
70
|
+
- .harness-evolver/architecture.json
|
|
71
|
+
- .harness-evolver/architecture.md
|
|
72
|
+
</output>
|
|
73
|
+
|
|
74
|
+
<success_criteria>
|
|
75
|
+
- Classifies current topology correctly
|
|
76
|
+
- Recommendation includes migration path with concrete steps
|
|
77
|
+
- Considers detected stack and API key availability
|
|
78
|
+
- Confidence rating is honest (low/medium/high)
|
|
79
|
+
</success_criteria>
|
|
93
80
|
```
|
|
94
81
|
|
|
95
|
-
|
|
96
|
-
|
|
97
|
-
```
|
|
98
|
-
Architecture Analysis Complete
|
|
99
|
-
==============================
|
|
100
|
-
Current topology: {topology} — looks optimal for these tasks.
|
|
101
|
-
No architecture change recommended. Score: {score}
|
|
82
|
+
4. Wait for `## ARCHITECTURE ANALYSIS COMPLETE`.
|
|
102
83
|
|
|
103
|
-
|
|
104
|
-
```
|
|
84
|
+
5. Print summary: current -> recommended, confidence, migration steps.
|
|
105
85
|
|
|
106
86
|
## Arguments
|
|
107
87
|
|
package/skills/critic/SKILL.md
CHANGED
|
@@ -20,18 +20,42 @@ TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo
|
|
|
20
20
|
|
|
21
21
|
## What To Do
|
|
22
22
|
|
|
23
|
-
1. Read `summary.json`
|
|
24
|
-
- Score jump >0.3 in a single iteration
|
|
25
|
-
- Score reached 1.0 in <3 iterations
|
|
26
|
-
- All tasks suddenly pass after failing
|
|
23
|
+
1. Read `summary.json` and identify the suspicious pattern (score jump, premature convergence).
|
|
27
24
|
|
|
28
25
|
2. Spawn the `harness-evolver-critic` agent:
|
|
29
|
-
|
|
30
|
-
|
|
31
|
-
|
|
32
|
-
|
|
33
|
-
|
|
34
|
-
|
|
35
|
-
|
|
36
|
-
|
|
37
|
-
|
|
26
|
+
|
|
27
|
+
```xml
|
|
28
|
+
<objective>
|
|
29
|
+
Analyze eval quality for this harness evolution project.
|
|
30
|
+
The best version is {version} with score {score} achieved in {iterations} iteration(s).
|
|
31
|
+
{Specific concern: "Score jumped from X to Y in one iteration" or "Perfect score in N iterations"}
|
|
32
|
+
</objective>
|
|
33
|
+
|
|
34
|
+
<files_to_read>
|
|
35
|
+
- .harness-evolver/eval/eval.py
|
|
36
|
+
- .harness-evolver/summary.json
|
|
37
|
+
- .harness-evolver/harnesses/{best_version}/scores.json
|
|
38
|
+
- .harness-evolver/harnesses/{best_version}/harness.py
|
|
39
|
+
- .harness-evolver/harnesses/{best_version}/proposal.md
|
|
40
|
+
- .harness-evolver/config.json
|
|
41
|
+
</files_to_read>
|
|
42
|
+
|
|
43
|
+
<output>
|
|
44
|
+
Write:
|
|
45
|
+
- .harness-evolver/critic_report.md (human-readable analysis)
|
|
46
|
+
- .harness-evolver/eval/eval_improved.py (if weaknesses found)
|
|
47
|
+
</output>
|
|
48
|
+
|
|
49
|
+
<success_criteria>
|
|
50
|
+
- Identifies specific weaknesses in eval.py with examples
|
|
51
|
+
- If gaming detected, shows exact tasks/outputs that expose the weakness
|
|
52
|
+
- Improved eval preserves the --results-dir/--tasks-dir/--scores interface
|
|
53
|
+
- Re-scores the best version with improved eval to quantify the difference
|
|
54
|
+
</success_criteria>
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
3. Wait for `## CRITIC REPORT COMPLETE`.
|
|
58
|
+
|
|
59
|
+
4. Report findings to user. If `eval_improved.py` was written:
|
|
60
|
+
- Show score comparison (current eval vs improved eval)
|
|
61
|
+
- Ask: "Adopt the improved eval? This will affect future iterations."
|
package/skills/evolve/SKILL.md
CHANGED
|
@@ -5,7 +5,7 @@ argument-hint: "[--iterations N]"
|
|
|
5
5
|
allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent]
|
|
6
6
|
---
|
|
7
7
|
|
|
8
|
-
# /harness-evolve
|
|
8
|
+
# /harness-evolver:evolve
|
|
9
9
|
|
|
10
10
|
Run the autonomous propose-evaluate-iterate loop.
|
|
11
11
|
|
|
@@ -34,14 +34,81 @@ For each iteration:
|
|
|
34
34
|
python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(f'v{s[\"iterations\"]+1:03d}')"
|
|
35
35
|
```
|
|
36
36
|
|
|
37
|
-
###
|
|
37
|
+
### 1.5. Gather Diagnostic Context (LangSmith + Context7)
|
|
38
|
+
|
|
39
|
+
**This step is MANDATORY before every propose.** The orchestrator gathers data so the proposer receives it as files.
|
|
40
|
+
|
|
41
|
+
**LangSmith (if enabled):**
|
|
42
|
+
|
|
43
|
+
Check if LangSmith is enabled and langsmith-cli is available:
|
|
44
|
+
```bash
|
|
45
|
+
cat .harness-evolver/config.json | python3 -c "import sys,json; print(json.load(sys.stdin).get('eval',{}).get('langsmith',{}).get('enabled',False))"
|
|
46
|
+
which langsmith-cli 2>/dev/null
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
If BOTH are true AND at least one iteration has run, gather LangSmith data:
|
|
50
|
+
```bash
|
|
51
|
+
langsmith-cli --json runs list --project harness-evolver-{best_version} --failed --fields id,name,error,inputs --limit 10 > .harness-evolver/langsmith_diagnosis.json 2>/dev/null || echo "[]" > .harness-evolver/langsmith_diagnosis.json
|
|
52
|
+
|
|
53
|
+
langsmith-cli --json runs stats --project harness-evolver-{best_version} > .harness-evolver/langsmith_stats.json 2>/dev/null || echo "{}" > .harness-evolver/langsmith_stats.json
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
**Context7 (if available):**
|
|
57
|
+
|
|
58
|
+
Check `config.json` field `stack.detected`. For each detected library, use the Context7 MCP tools to fetch relevant documentation:
|
|
59
|
+
|
|
60
|
+
```
|
|
61
|
+
For each library in stack.detected:
|
|
62
|
+
1. resolve-library-id with the context7_id
|
|
63
|
+
2. get-library-docs with a query relevant to the current failure modes
|
|
64
|
+
3. Save output to .harness-evolver/context7_docs.md (append each library's docs)
|
|
65
|
+
```
|
|
66
|
+
|
|
67
|
+
This runs ONCE per iteration, not per library. Focus on the library most relevant to the current failures.
|
|
38
68
|
|
|
39
|
-
|
|
69
|
+
If Context7 MCP is not available, skip silently.
|
|
40
70
|
|
|
41
|
-
|
|
42
|
-
|
|
71
|
+
### 2. Propose
|
|
72
|
+
|
|
73
|
+
Spawn the `harness-evolver-proposer` agent with a structured prompt.
|
|
74
|
+
|
|
75
|
+
The `<files_to_read>` MUST include the LangSmith/Context7 files if they were gathered:
|
|
76
|
+
|
|
77
|
+
```xml
|
|
78
|
+
<objective>
|
|
79
|
+
Propose harness version {version} that improves on the current best score of {best_score}.
|
|
80
|
+
</objective>
|
|
81
|
+
|
|
82
|
+
<files_to_read>
|
|
83
|
+
- .harness-evolver/summary.json
|
|
84
|
+
- .harness-evolver/PROPOSER_HISTORY.md
|
|
85
|
+
- .harness-evolver/config.json
|
|
86
|
+
- .harness-evolver/baseline/harness.py
|
|
87
|
+
- .harness-evolver/harnesses/{best_version}/harness.py
|
|
88
|
+
- .harness-evolver/harnesses/{best_version}/scores.json
|
|
89
|
+
- .harness-evolver/harnesses/{best_version}/proposal.md
|
|
90
|
+
- .harness-evolver/langsmith_diagnosis.json (if exists — LangSmith failure analysis)
|
|
91
|
+
- .harness-evolver/langsmith_stats.json (if exists — LangSmith aggregate stats)
|
|
92
|
+
- .harness-evolver/context7_docs.md (if exists — current library documentation)
|
|
93
|
+
- .harness-evolver/architecture.json (if exists — architect topology recommendation)
|
|
94
|
+
</files_to_read>
|
|
95
|
+
|
|
96
|
+
<output>
|
|
97
|
+
Create directory .harness-evolver/harnesses/{version}/ containing:
|
|
98
|
+
- harness.py (the improved harness)
|
|
99
|
+
- config.json (parameters, copy from parent if unchanged)
|
|
100
|
+
- proposal.md (reasoning, must start with "Based on v{PARENT}")
|
|
101
|
+
</output>
|
|
102
|
+
|
|
103
|
+
<success_criteria>
|
|
104
|
+
- harness.py maintains CLI interface (--input, --output, --traces-dir, --config)
|
|
105
|
+
- proposal.md documents evidence-based reasoning
|
|
106
|
+
- Changes are motivated by trace analysis (LangSmith data if available), not guesswork
|
|
107
|
+
- If context7_docs.md was provided, API usage must match current documentation
|
|
108
|
+
</success_criteria>
|
|
109
|
+
```
|
|
43
110
|
|
|
44
|
-
|
|
111
|
+
Wait for the agent to complete. Look for `## PROPOSAL COMPLETE` in the response.
|
|
45
112
|
|
|
46
113
|
### 3. Validate
|
|
47
114
|
|
|
@@ -80,26 +147,73 @@ python3 $TOOLS/state.py update \
|
|
|
80
147
|
|
|
81
148
|
Read `summary.json`. Print: `Iteration {i}/{N}: {version} scored {score} (best: {best} at {best_score})`
|
|
82
149
|
|
|
83
|
-
### 6.5.
|
|
150
|
+
### 6.5. Auto-trigger Critic (on eval gaming)
|
|
84
151
|
|
|
85
|
-
|
|
152
|
+
Read `summary.json` and check:
|
|
86
153
|
- Did the score jump >0.3 from parent version?
|
|
87
154
|
- Did we reach 1.0 in fewer than 3 total iterations?
|
|
88
155
|
|
|
89
|
-
If
|
|
156
|
+
If EITHER is true, **AUTO-SPAWN the critic agent** (do not just suggest — actually spawn it):
|
|
157
|
+
|
|
158
|
+
```bash
|
|
159
|
+
python3 $TOOLS/evaluate.py run \
|
|
160
|
+
--harness .harness-evolver/harnesses/{version}/harness.py \
|
|
161
|
+
--tasks-dir .harness-evolver/eval/tasks/ \
|
|
162
|
+
--eval .harness-evolver/eval/eval.py \
|
|
163
|
+
--traces-dir /tmp/critic-check/ \
|
|
164
|
+
--scores /tmp/critic-check-scores.json \
|
|
165
|
+
--timeout 60
|
|
166
|
+
```
|
|
167
|
+
|
|
168
|
+
Spawn the `harness-evolver-critic` agent:
|
|
169
|
+
|
|
170
|
+
```xml
|
|
171
|
+
<objective>
|
|
172
|
+
EVAL GAMING DETECTED: Score jumped from {parent_score} to {score} in one iteration.
|
|
173
|
+
Analyze the eval quality and propose a stricter eval.
|
|
174
|
+
</objective>
|
|
175
|
+
|
|
176
|
+
<files_to_read>
|
|
177
|
+
- .harness-evolver/eval/eval.py
|
|
178
|
+
- .harness-evolver/summary.json
|
|
179
|
+
- .harness-evolver/harnesses/{version}/scores.json
|
|
180
|
+
- .harness-evolver/harnesses/{version}/harness.py
|
|
181
|
+
- .harness-evolver/harnesses/{version}/proposal.md
|
|
182
|
+
- .harness-evolver/config.json
|
|
183
|
+
- .harness-evolver/langsmith_stats.json (if exists)
|
|
184
|
+
</files_to_read>
|
|
185
|
+
|
|
186
|
+
<output>
|
|
187
|
+
Write:
|
|
188
|
+
- .harness-evolver/critic_report.md
|
|
189
|
+
- .harness-evolver/eval/eval_improved.py (if weaknesses found)
|
|
190
|
+
</output>
|
|
191
|
+
|
|
192
|
+
<success_criteria>
|
|
193
|
+
- Identifies specific weaknesses in eval.py with task/output examples
|
|
194
|
+
- If gaming detected, shows exact tasks that expose the weakness
|
|
195
|
+
- Improved eval preserves the --results-dir/--tasks-dir/--scores interface
|
|
196
|
+
- Re-scores the best version with improved eval to show the difference
|
|
197
|
+
</success_criteria>
|
|
198
|
+
```
|
|
90
199
|
|
|
91
|
-
|
|
92
|
-
> The eval may be too lenient. Run `/harness-evolver:critic` to analyze eval quality.
|
|
200
|
+
Wait for `## CRITIC REPORT COMPLETE`.
|
|
93
201
|
|
|
94
|
-
If
|
|
202
|
+
If critic wrote `eval_improved.py`:
|
|
203
|
+
- Re-score the best harness with the improved eval
|
|
204
|
+
- Show the score difference (e.g., "Current eval: 1.0. Improved eval: 0.45")
|
|
205
|
+
- **AUTO-ADOPT the improved eval**: copy `eval_improved.py` to `eval/eval.py`
|
|
206
|
+
- Re-run baseline with new eval and update `summary.json`
|
|
207
|
+
- Print: "Eval upgraded. Resuming evolution with stricter eval."
|
|
208
|
+
- **Continue the loop** with the new eval
|
|
95
209
|
|
|
96
|
-
|
|
97
|
-
|
|
98
|
-
|
|
210
|
+
If critic did NOT write `eval_improved.py` (eval is fine):
|
|
211
|
+
- Print the critic's assessment
|
|
212
|
+
- Continue the loop normally
|
|
99
213
|
|
|
100
214
|
### 7. Auto-trigger Architect (on stagnation or regression)
|
|
101
215
|
|
|
102
|
-
Check if the architect should be auto-spawned
|
|
216
|
+
Check if the architect should be auto-spawned:
|
|
103
217
|
- **Stagnation**: 3 consecutive iterations within 1% of each other
|
|
104
218
|
- **Regression**: score dropped below parent score (even once)
|
|
105
219
|
|
|
@@ -115,32 +229,54 @@ python3 $TOOLS/analyze_architecture.py \
|
|
|
115
229
|
-o .harness-evolver/architecture_signals.json
|
|
116
230
|
```
|
|
117
231
|
|
|
118
|
-
|
|
119
|
-
|
|
120
|
-
|
|
121
|
-
>
|
|
122
|
-
|
|
123
|
-
|
|
124
|
-
|
|
125
|
-
|
|
232
|
+
Spawn the `harness-evolver-architect` agent:
|
|
233
|
+
|
|
234
|
+
```xml
|
|
235
|
+
<objective>
|
|
236
|
+
The evolution loop has {stagnated/regressed} after {iterations} iterations (best: {best_score}).
|
|
237
|
+
Analyze the harness architecture and recommend a topology change.
|
|
238
|
+
</objective>
|
|
239
|
+
|
|
240
|
+
<files_to_read>
|
|
241
|
+
- .harness-evolver/architecture_signals.json
|
|
242
|
+
- .harness-evolver/summary.json
|
|
243
|
+
- .harness-evolver/PROPOSER_HISTORY.md
|
|
244
|
+
- .harness-evolver/config.json
|
|
245
|
+
- .harness-evolver/harnesses/{best_version}/harness.py
|
|
246
|
+
- .harness-evolver/harnesses/{best_version}/scores.json
|
|
247
|
+
- .harness-evolver/context7_docs.md (if exists)
|
|
248
|
+
</files_to_read>
|
|
249
|
+
|
|
250
|
+
<output>
|
|
251
|
+
Write:
|
|
252
|
+
- .harness-evolver/architecture.json (structured recommendation)
|
|
253
|
+
- .harness-evolver/architecture.md (human-readable analysis)
|
|
254
|
+
</output>
|
|
255
|
+
|
|
256
|
+
<success_criteria>
|
|
257
|
+
- Recommendation includes concrete migration steps
|
|
258
|
+
- Each step is implementable in one proposer iteration
|
|
259
|
+
- Considers detected stack and available API keys
|
|
260
|
+
</success_criteria>
|
|
261
|
+
```
|
|
126
262
|
|
|
127
|
-
|
|
128
|
-
> Migration path: {N} steps. Continuing evolution with architecture guidance.
|
|
263
|
+
Wait for `## ARCHITECTURE ANALYSIS COMPLETE`.
|
|
129
264
|
|
|
130
|
-
|
|
265
|
+
Report: `Architect recommends: {current} → {recommended} ({confidence} confidence)`
|
|
131
266
|
|
|
132
|
-
|
|
267
|
+
Then **continue the loop** — the proposer reads `architecture.json` in the next iteration.
|
|
133
268
|
|
|
134
269
|
### 8. Check Stop Conditions
|
|
135
270
|
|
|
136
271
|
- **Target**: `combined_score >= target_score` → stop
|
|
137
272
|
- **N reached**: done
|
|
138
|
-
- **Stagnation post-architect**: 3 more iterations without improvement AFTER architect ran → stop
|
|
273
|
+
- **Stagnation post-architect**: 3 more iterations without improvement AFTER architect ran → stop
|
|
139
274
|
|
|
140
275
|
## When Loop Ends — Final Report
|
|
141
276
|
|
|
142
277
|
- Best version and score
|
|
143
278
|
- Improvement over baseline (absolute and %)
|
|
144
279
|
- Total iterations run
|
|
280
|
+
- Whether critic was triggered and eval was upgraded
|
|
145
281
|
- Whether architect was triggered and what it recommended
|
|
146
282
|
- Suggest: "The best harness is at `.harness-evolver/harnesses/{best}/harness.py`. Copy it to your project."
|