harness-evolver 1.2.0 → 1.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,12 +1,26 @@
1
1
  ---
2
2
  name: harness-evolver-architect
3
3
  description: |
4
- Use this agent when the harness-evolver:architect skill needs to analyze a harness
5
- and recommend the optimal multi-agent topology. Reads code analysis signals, traces,
6
- and scores to produce a migration plan from current to recommended architecture.
7
- model: opus
4
+ Use this agent to analyze harness architecture and recommend optimal multi-agent topology.
5
+ Reads code analysis signals, traces, and scores to produce a migration plan.
6
+ tools: Read, Write, Bash, Grep, Glob
8
7
  ---
9
8
 
9
+ ## Bootstrap
10
+
11
+ If your prompt contains a `<files_to_read>` block, you MUST use the Read tool to load
12
+ every file listed there before performing any other actions.
13
+
14
+ ## Return Protocol
15
+
16
+ When done, end your response with:
17
+
18
+ ## ARCHITECTURE ANALYSIS COMPLETE
19
+ - **Current topology**: {topology}
20
+ - **Recommended**: {topology}
21
+ - **Confidence**: {low|medium|high}
22
+ - **Migration steps**: {N}
23
+
10
24
  # Harness Evolver — Architect Agent
11
25
 
12
26
  You are the architect in a Meta-Harness optimization system. Your job is to analyze a harness's current agent topology, assess whether it matches the task complexity, and recommend the optimal topology with a concrete migration plan.
@@ -1,13 +1,27 @@
1
1
  ---
2
2
  name: harness-evolver-critic
3
3
  description: |
4
- Use this agent when scores converge suspiciously fast (>0.3 jump in one iteration
5
- or 1.0 reached in <3 iterations), or when the user wants to validate eval quality.
6
- Analyzes the eval script, harness outputs, and optionally uses LangSmith evaluators
7
- to cross-validate scores and identify eval weaknesses.
8
- model: opus
4
+ Use this agent to assess eval quality, detect eval gaming, and propose stricter evaluation.
5
+ Triggered when scores converge suspiciously fast or on user request.
6
+ tools: Read, Write, Bash, Grep, Glob
9
7
  ---
10
8
 
9
+ ## Bootstrap
10
+
11
+ If your prompt contains a `<files_to_read>` block, you MUST use the Read tool to load
12
+ every file listed there before performing any other actions.
13
+
14
+ ## Return Protocol
15
+
16
+ When done, end your response with:
17
+
18
+ ## CRITIC REPORT COMPLETE
19
+ - **Eval quality**: {weak|moderate|strong}
20
+ - **Gaming detected**: {yes|no}
21
+ - **Weaknesses found**: {N}
22
+ - **Improved eval written**: {yes|no}
23
+ - **Score with improved eval**: {score or N/A}
24
+
11
25
  # Harness Evolver — Critic Agent
12
26
 
13
27
  You are the critic in the Harness Evolver loop. Your job is to assess whether the eval
@@ -1,12 +1,27 @@
1
1
  ---
2
2
  name: harness-evolver-proposer
3
3
  description: |
4
- Use this agent when the harness-evolve skill needs to propose a new harness candidate.
5
- This agent navigates the .harness-evolver/ filesystem to diagnose failures in prior
6
- candidates and propose an improved harness. It is the core of the Meta-Harness optimization loop.
7
- model: opus
4
+ Use this agent when the evolve skill needs to propose a new harness candidate.
5
+ Navigates the .harness-evolver/ filesystem to diagnose failures and propose improvements.
6
+ tools: Read, Write, Edit, Bash, Glob, Grep
7
+ permissionMode: acceptEdits
8
8
  ---
9
9
 
10
+ ## Bootstrap
11
+
12
+ If your prompt contains a `<files_to_read>` block, you MUST use the Read tool to load
13
+ every file listed there before performing any other actions. These files are your context.
14
+
15
+ ## Return Protocol
16
+
17
+ When done, end your response with:
18
+
19
+ ## PROPOSAL COMPLETE
20
+ - **Version**: v{NNN}
21
+ - **Parent**: v{PARENT}
22
+ - **Change**: {one-sentence summary}
23
+ - **Expected impact**: {score prediction}
24
+
10
25
  # Harness Evolver — Proposer Agent
11
26
 
12
27
  You are the proposer in a Meta-Harness optimization loop. Your job is to analyze all prior harness candidates — their code, execution traces, and scores — and propose a new harness that improves on them.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "harness-evolver",
3
- "version": "1.2.0",
3
+ "version": "1.4.0",
4
4
  "description": "Meta-Harness-style autonomous harness optimization for Claude Code",
5
5
  "author": "Raphael Valdetaro",
6
6
  "license": "MIT",
@@ -28,80 +28,60 @@ TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo
28
28
 
29
29
  Use `$TOOLS` prefix for all tool calls below.
30
30
 
31
- ## Step 1: Run Architecture Analysis
31
+ ## What To Do
32
32
 
33
- Build the command based on what exists:
33
+ 1. Check `.harness-evolver/` exists.
34
34
 
35
+ 2. Run architecture analysis tool:
35
36
  ```bash
36
- CMD="python3 $TOOLS/analyze_architecture.py --harness .harness-evolver/baseline/harness.py"
37
-
38
- # Add traces from best version if evolution has run
39
- if [ -f ".harness-evolver/summary.json" ]; then
40
- BEST=$(python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(s.get('best',{}).get('version',''))")
41
- if [ -n "$BEST" ] && [ -d ".harness-evolver/harnesses/$BEST/traces" ]; then
42
- CMD="$CMD --traces-dir .harness-evolver/harnesses/$BEST/traces"
43
- fi
44
- CMD="$CMD --summary .harness-evolver/summary.json"
45
- fi
46
-
47
- CMD="$CMD -o .harness-evolver/architecture_signals.json"
48
-
49
- eval $CMD
37
+ python3 $TOOLS/analyze_architecture.py \
38
+ --harness .harness-evolver/baseline/harness.py \
39
+ -o .harness-evolver/architecture_signals.json
50
40
  ```
51
41
 
52
- Check exit code. If it fails, report the error and stop.
53
-
54
- ## Step 2: Spawn Architect Agent
55
-
56
- Spawn the `harness-evolver-architect` agent with:
57
-
58
- > Analyze the harness and recommend the optimal multi-agent topology.
59
- > Raw signals are at `.harness-evolver/architecture_signals.json`.
60
- > Write `.harness-evolver/architecture.json` and `.harness-evolver/architecture.md`.
61
-
62
- The architect agent will:
63
- 1. Read the signals JSON
64
- 2. Read the harness code and config
65
- 3. Classify the current topology
66
- 4. Assess if it matches task complexity
67
- 5. Recommend the optimal topology with migration steps
68
- 6. Write `architecture.json` and `architecture.md`
69
-
70
- ## Step 3: Report
71
-
72
- After the architect agent completes, read the outputs and print a summary:
73
-
42
+ If evolution has run, add trace and score data:
43
+ ```bash
44
+ python3 $TOOLS/analyze_architecture.py \
45
+ --harness .harness-evolver/harnesses/{best}/harness.py \
46
+ --traces-dir .harness-evolver/harnesses/{best}/traces \
47
+ --summary .harness-evolver/summary.json \
48
+ -o .harness-evolver/architecture_signals.json
74
49
  ```
75
- Architecture Analysis Complete
76
- ==============================
77
- Current topology: {current_topology}
78
- Recommended topology: {recommended_topology}
79
- Confidence: {confidence}
80
-
81
- Reasoning: {reasoning}
82
-
83
- Migration Path:
84
- 1. {step 1 description}
85
- 2. {step 2 description}
86
- ...
87
50
 
88
- Risks:
89
- - {risk 1}
90
- - {risk 2}
91
-
92
- Next: Run /harness-evolver:evolve the proposer will follow the migration path.
51
+ 3. Spawn the `harness-evolver-architect` agent:
52
+
53
+ ```xml
54
+ <objective>
55
+ Analyze the harness architecture and recommend the optimal multi-agent topology.
56
+ {If called from evolve: "The evolution loop stagnated/regressed after N iterations."}
57
+ {If called by user: "The user requested an architecture analysis."}
58
+ </objective>
59
+
60
+ <files_to_read>
61
+ - .harness-evolver/architecture_signals.json
62
+ - .harness-evolver/config.json
63
+ - .harness-evolver/baseline/harness.py
64
+ - .harness-evolver/summary.json (if exists)
65
+ - .harness-evolver/PROPOSER_HISTORY.md (if exists)
66
+ </files_to_read>
67
+
68
+ <output>
69
+ Write:
70
+ - .harness-evolver/architecture.json
71
+ - .harness-evolver/architecture.md
72
+ </output>
73
+
74
+ <success_criteria>
75
+ - Classifies current topology correctly
76
+ - Recommendation includes migration path with concrete steps
77
+ - Considers detected stack and API key availability
78
+ - Confidence rating is honest (low/medium/high)
79
+ </success_criteria>
93
80
  ```
94
81
 
95
- If the architect recommends no change (current = recommended), report:
96
-
97
- ```
98
- Architecture Analysis Complete
99
- ==============================
100
- Current topology: {topology} — looks optimal for these tasks.
101
- No architecture change recommended. Score: {score}
82
+ 4. Wait for `## ARCHITECTURE ANALYSIS COMPLETE`.
102
83
 
103
- The proposer can continue evolving within the current topology.
104
- ```
84
+ 5. Print summary: current -> recommended, confidence, migration steps.
105
85
 
106
86
  ## Arguments
107
87
 
@@ -20,18 +20,42 @@ TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo
20
20
 
21
21
  ## What To Do
22
22
 
23
- 1. Read `summary.json` to check for suspicious patterns:
24
- - Score jump >0.3 in a single iteration
25
- - Score reached 1.0 in <3 iterations
26
- - All tasks suddenly pass after failing
23
+ 1. Read `summary.json` and identify the suspicious pattern (score jump, premature convergence).
27
24
 
28
25
  2. Spawn the `harness-evolver-critic` agent:
29
- > Analyze the eval quality for this harness evolution project.
30
- > Check if the eval at `.harness-evolver/eval/eval.py` is rigorous enough.
31
- > The best version is {version} with score {score} achieved in {iterations} iterations.
32
-
33
- 3. After the critic reports:
34
- - Show the eval quality assessment
35
- - If `eval_improved.py` was created, show the score comparison
36
- - Ask user: "Adopt the improved eval? This will re-baseline all scores."
37
- - If adopted: copy `eval_improved.py` to `eval/eval.py`, re-run baseline, update state
26
+
27
+ ```xml
28
+ <objective>
29
+ Analyze eval quality for this harness evolution project.
30
+ The best version is {version} with score {score} achieved in {iterations} iteration(s).
31
+ {Specific concern: "Score jumped from X to Y in one iteration" or "Perfect score in N iterations"}
32
+ </objective>
33
+
34
+ <files_to_read>
35
+ - .harness-evolver/eval/eval.py
36
+ - .harness-evolver/summary.json
37
+ - .harness-evolver/harnesses/{best_version}/scores.json
38
+ - .harness-evolver/harnesses/{best_version}/harness.py
39
+ - .harness-evolver/harnesses/{best_version}/proposal.md
40
+ - .harness-evolver/config.json
41
+ </files_to_read>
42
+
43
+ <output>
44
+ Write:
45
+ - .harness-evolver/critic_report.md (human-readable analysis)
46
+ - .harness-evolver/eval/eval_improved.py (if weaknesses found)
47
+ </output>
48
+
49
+ <success_criteria>
50
+ - Identifies specific weaknesses in eval.py with examples
51
+ - If gaming detected, shows exact tasks/outputs that expose the weakness
52
+ - Improved eval preserves the --results-dir/--tasks-dir/--scores interface
53
+ - Re-scores the best version with improved eval to quantify the difference
54
+ </success_criteria>
55
+ ```
56
+
57
+ 3. Wait for `## CRITIC REPORT COMPLETE`.
58
+
59
+ 4. Report findings to user. If `eval_improved.py` was written:
60
+ - Show score comparison (current eval vs improved eval)
61
+ - Ask: "Adopt the improved eval? This will affect future iterations."
@@ -5,7 +5,7 @@ argument-hint: "[--iterations N]"
5
5
  allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent]
6
6
  ---
7
7
 
8
- # /harness-evolve
8
+ # /harness-evolver:evolve
9
9
 
10
10
  Run the autonomous propose-evaluate-iterate loop.
11
11
 
@@ -34,14 +34,81 @@ For each iteration:
34
34
  python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(f'v{s[\"iterations\"]+1:03d}')"
35
35
  ```
36
36
 
37
- ### 2. Propose
37
+ ### 1.5. Gather Diagnostic Context (LangSmith + Context7)
38
+
39
+ **This step is MANDATORY before every propose.** The orchestrator gathers data so the proposer receives it as files.
40
+
41
+ **LangSmith (if enabled):**
42
+
43
+ Check if LangSmith is enabled and langsmith-cli is available:
44
+ ```bash
45
+ cat .harness-evolver/config.json | python3 -c "import sys,json; print(json.load(sys.stdin).get('eval',{}).get('langsmith',{}).get('enabled',False))"
46
+ which langsmith-cli 2>/dev/null
47
+ ```
48
+
49
+ If BOTH are true AND at least one iteration has run, gather LangSmith data:
50
+ ```bash
51
+ langsmith-cli --json runs list --project harness-evolver-{best_version} --failed --fields id,name,error,inputs --limit 10 > .harness-evolver/langsmith_diagnosis.json 2>/dev/null || echo "[]" > .harness-evolver/langsmith_diagnosis.json
52
+
53
+ langsmith-cli --json runs stats --project harness-evolver-{best_version} > .harness-evolver/langsmith_stats.json 2>/dev/null || echo "{}" > .harness-evolver/langsmith_stats.json
54
+ ```
55
+
56
+ **Context7 (if available):**
57
+
58
+ Check `config.json` field `stack.detected`. For each detected library, use the Context7 MCP tools to fetch relevant documentation:
59
+
60
+ ```
61
+ For each library in stack.detected:
62
+ 1. resolve-library-id with the context7_id
63
+ 2. get-library-docs with a query relevant to the current failure modes
64
+ 3. Save output to .harness-evolver/context7_docs.md (append each library's docs)
65
+ ```
66
+
67
+ This runs ONCE per iteration, not per library. Focus on the library most relevant to the current failures.
38
68
 
39
- Spawn the `harness-evolver-proposer` agent:
69
+ If Context7 MCP is not available, skip silently.
40
70
 
41
- > You are proposing iteration {i}. Create version {version} in `.harness-evolver/harnesses/{version}/`.
42
- > Working directory contains `.harness-evolver/` with all prior candidates and traces.
71
+ ### 2. Propose
72
+
73
+ Spawn the `harness-evolver-proposer` agent with a structured prompt.
74
+
75
+ The `<files_to_read>` MUST include the LangSmith/Context7 files if they were gathered:
76
+
77
+ ```xml
78
+ <objective>
79
+ Propose harness version {version} that improves on the current best score of {best_score}.
80
+ </objective>
81
+
82
+ <files_to_read>
83
+ - .harness-evolver/summary.json
84
+ - .harness-evolver/PROPOSER_HISTORY.md
85
+ - .harness-evolver/config.json
86
+ - .harness-evolver/baseline/harness.py
87
+ - .harness-evolver/harnesses/{best_version}/harness.py
88
+ - .harness-evolver/harnesses/{best_version}/scores.json
89
+ - .harness-evolver/harnesses/{best_version}/proposal.md
90
+ - .harness-evolver/langsmith_diagnosis.json (if exists — LangSmith failure analysis)
91
+ - .harness-evolver/langsmith_stats.json (if exists — LangSmith aggregate stats)
92
+ - .harness-evolver/context7_docs.md (if exists — current library documentation)
93
+ - .harness-evolver/architecture.json (if exists — architect topology recommendation)
94
+ </files_to_read>
95
+
96
+ <output>
97
+ Create directory .harness-evolver/harnesses/{version}/ containing:
98
+ - harness.py (the improved harness)
99
+ - config.json (parameters, copy from parent if unchanged)
100
+ - proposal.md (reasoning, must start with "Based on v{PARENT}")
101
+ </output>
102
+
103
+ <success_criteria>
104
+ - harness.py maintains CLI interface (--input, --output, --traces-dir, --config)
105
+ - proposal.md documents evidence-based reasoning
106
+ - Changes are motivated by trace analysis (LangSmith data if available), not guesswork
107
+ - If context7_docs.md was provided, API usage must match current documentation
108
+ </success_criteria>
109
+ ```
43
110
 
44
- The proposer creates: `harness.py`, `config.json`, `proposal.md`.
111
+ Wait for the agent to complete. Look for `## PROPOSAL COMPLETE` in the response.
45
112
 
46
113
  ### 3. Validate
47
114
 
@@ -80,26 +147,73 @@ python3 $TOOLS/state.py update \
80
147
 
81
148
  Read `summary.json`. Print: `Iteration {i}/{N}: {version} scored {score} (best: {best} at {best_score})`
82
149
 
83
- ### 6.5. Check for Eval Gaming
150
+ ### 6.5. Auto-trigger Critic (on eval gaming)
84
151
 
85
- After updating state, read the latest `summary.json` and check:
152
+ Read `summary.json` and check:
86
153
  - Did the score jump >0.3 from parent version?
87
154
  - Did we reach 1.0 in fewer than 3 total iterations?
88
155
 
89
- If either is true, warn:
156
+ If EITHER is true, **AUTO-SPAWN the critic agent** (do not just suggest — actually spawn it):
157
+
158
+ ```bash
159
+ python3 $TOOLS/evaluate.py run \
160
+ --harness .harness-evolver/harnesses/{version}/harness.py \
161
+ --tasks-dir .harness-evolver/eval/tasks/ \
162
+ --eval .harness-evolver/eval/eval.py \
163
+ --traces-dir /tmp/critic-check/ \
164
+ --scores /tmp/critic-check-scores.json \
165
+ --timeout 60
166
+ ```
167
+
168
+ Spawn the `harness-evolver-critic` agent:
169
+
170
+ ```xml
171
+ <objective>
172
+ EVAL GAMING DETECTED: Score jumped from {parent_score} to {score} in one iteration.
173
+ Analyze the eval quality and propose a stricter eval.
174
+ </objective>
175
+
176
+ <files_to_read>
177
+ - .harness-evolver/eval/eval.py
178
+ - .harness-evolver/summary.json
179
+ - .harness-evolver/harnesses/{version}/scores.json
180
+ - .harness-evolver/harnesses/{version}/harness.py
181
+ - .harness-evolver/harnesses/{version}/proposal.md
182
+ - .harness-evolver/config.json
183
+ - .harness-evolver/langsmith_stats.json (if exists)
184
+ </files_to_read>
185
+
186
+ <output>
187
+ Write:
188
+ - .harness-evolver/critic_report.md
189
+ - .harness-evolver/eval/eval_improved.py (if weaknesses found)
190
+ </output>
191
+
192
+ <success_criteria>
193
+ - Identifies specific weaknesses in eval.py with task/output examples
194
+ - If gaming detected, shows exact tasks that expose the weakness
195
+ - Improved eval preserves the --results-dir/--tasks-dir/--scores interface
196
+ - Re-scores the best version with improved eval to show the difference
197
+ </success_criteria>
198
+ ```
90
199
 
91
- > Suspicious convergence detected: score jumped from {parent_score} to {score} in one iteration.
92
- > The eval may be too lenient. Run `/harness-evolver:critic` to analyze eval quality.
200
+ Wait for `## CRITIC REPORT COMPLETE`.
93
201
 
94
- If score is 1.0 and iterations < 3, STOP the loop and strongly recommend the critic:
202
+ If critic wrote `eval_improved.py`:
203
+ - Re-score the best harness with the improved eval
204
+ - Show the score difference (e.g., "Current eval: 1.0. Improved eval: 0.45")
205
+ - **AUTO-ADOPT the improved eval**: copy `eval_improved.py` to `eval/eval.py`
206
+ - Re-run baseline with new eval and update `summary.json`
207
+ - Print: "Eval upgraded. Resuming evolution with stricter eval."
208
+ - **Continue the loop** with the new eval
95
209
 
96
- > Perfect score reached in only {iterations} iteration(s). This usually indicates
97
- > the eval is too easy, not that the harness is perfect. Run `/harness-evolver:critic`
98
- > before continuing.
210
+ If critic did NOT write `eval_improved.py` (eval is fine):
211
+ - Print the critic's assessment
212
+ - Continue the loop normally
99
213
 
100
214
  ### 7. Auto-trigger Architect (on stagnation or regression)
101
215
 
102
- Check if the architect should be auto-spawned. This happens when:
216
+ Check if the architect should be auto-spawned:
103
217
  - **Stagnation**: 3 consecutive iterations within 1% of each other
104
218
  - **Regression**: score dropped below parent score (even once)
105
219
 
@@ -115,32 +229,54 @@ python3 $TOOLS/analyze_architecture.py \
115
229
  -o .harness-evolver/architecture_signals.json
116
230
  ```
117
231
 
118
- Then spawn the `harness-evolver-architect` agent:
119
-
120
- > The evolution loop has stagnated/regressed after {iterations} iterations (best: {best_score}).
121
- > Analyze the harness architecture and recommend a topology change.
122
- > Raw signals at `.harness-evolver/architecture_signals.json`.
123
- > Write `.harness-evolver/architecture.json` and `.harness-evolver/architecture.md`.
124
-
125
- After the architect completes, report:
232
+ Spawn the `harness-evolver-architect` agent:
233
+
234
+ ```xml
235
+ <objective>
236
+ The evolution loop has {stagnated/regressed} after {iterations} iterations (best: {best_score}).
237
+ Analyze the harness architecture and recommend a topology change.
238
+ </objective>
239
+
240
+ <files_to_read>
241
+ - .harness-evolver/architecture_signals.json
242
+ - .harness-evolver/summary.json
243
+ - .harness-evolver/PROPOSER_HISTORY.md
244
+ - .harness-evolver/config.json
245
+ - .harness-evolver/harnesses/{best_version}/harness.py
246
+ - .harness-evolver/harnesses/{best_version}/scores.json
247
+ - .harness-evolver/context7_docs.md (if exists)
248
+ </files_to_read>
249
+
250
+ <output>
251
+ Write:
252
+ - .harness-evolver/architecture.json (structured recommendation)
253
+ - .harness-evolver/architecture.md (human-readable analysis)
254
+ </output>
255
+
256
+ <success_criteria>
257
+ - Recommendation includes concrete migration steps
258
+ - Each step is implementable in one proposer iteration
259
+ - Considers detected stack and available API keys
260
+ </success_criteria>
261
+ ```
126
262
 
127
- > Architect recommends: {current} {recommended} ({confidence} confidence)
128
- > Migration path: {N} steps. Continuing evolution with architecture guidance.
263
+ Wait for `## ARCHITECTURE ANALYSIS COMPLETE`.
129
264
 
130
- Then **continue the loop** the proposer will read `architecture.json` in the next iteration.
265
+ Report: `Architect recommends: {current} {recommended} ({confidence} confidence)`
131
266
 
132
- If `architecture.json` already exists (architect already ran), skip — don't re-run.
267
+ Then **continue the loop** — the proposer reads `architecture.json` in the next iteration.
133
268
 
134
269
  ### 8. Check Stop Conditions
135
270
 
136
271
  - **Target**: `combined_score >= target_score` → stop
137
272
  - **N reached**: done
138
- - **Stagnation post-architect**: 3 more iterations without improvement AFTER architect ran → stop (architecture change didn't help)
273
+ - **Stagnation post-architect**: 3 more iterations without improvement AFTER architect ran → stop
139
274
 
140
275
  ## When Loop Ends — Final Report
141
276
 
142
277
  - Best version and score
143
278
  - Improvement over baseline (absolute and %)
144
279
  - Total iterations run
280
+ - Whether critic was triggered and eval was upgraded
145
281
  - Whether architect was triggered and what it recommended
146
282
  - Suggest: "The best harness is at `.harness-evolver/harnesses/{best}/harness.py`. Copy it to your project."