harness-evolver 1.3.0 → 1.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "harness-evolver",
3
- "version": "1.3.0",
3
+ "version": "1.5.0",
4
4
  "description": "Meta-Harness-style autonomous harness optimization for Claude Code",
5
5
  "author": "Raphael Valdetaro",
6
6
  "license": "MIT",
@@ -48,35 +48,40 @@ python3 $TOOLS/analyze_architecture.py \
48
48
  -o .harness-evolver/architecture_signals.json
49
49
  ```
50
50
 
51
- 3. Spawn the `harness-evolver-architect` agent:
52
-
53
- ```xml
54
- <objective>
55
- Analyze the harness architecture and recommend the optimal multi-agent topology.
56
- {If called from evolve: "The evolution loop stagnated/regressed after N iterations."}
57
- {If called by user: "The user requested an architecture analysis."}
58
- </objective>
59
-
60
- <files_to_read>
61
- - .harness-evolver/architecture_signals.json
62
- - .harness-evolver/config.json
63
- - .harness-evolver/baseline/harness.py
64
- - .harness-evolver/summary.json (if exists)
65
- - .harness-evolver/PROPOSER_HISTORY.md (if exists)
66
- </files_to_read>
67
-
68
- <output>
69
- Write:
70
- - .harness-evolver/architecture.json
71
- - .harness-evolver/architecture.md
72
- </output>
73
-
74
- <success_criteria>
75
- - Classifies current topology correctly
76
- - Recommendation includes migration path with concrete steps
77
- - Considers detected stack and API key availability
78
- - Confidence rating is honest (low/medium/high)
79
- </success_criteria>
51
+ 3. Dispatch subagent using the **Agent tool** with `subagent_type: "harness-evolver-architect"`:
52
+
53
+ ```
54
+ Agent(
55
+ subagent_type: "harness-evolver-architect",
56
+ description: "Architect: topology analysis",
57
+ prompt: |
58
+ <objective>
59
+ Analyze the harness architecture and recommend the optimal multi-agent topology.
60
+ {If called from evolve: "The evolution loop stagnated/regressed after N iterations."}
61
+ {If called by user: "The user requested an architecture analysis."}
62
+ </objective>
63
+
64
+ <files_to_read>
65
+ - .harness-evolver/architecture_signals.json
66
+ - .harness-evolver/config.json
67
+ - .harness-evolver/baseline/harness.py
68
+ - .harness-evolver/summary.json (if exists)
69
+ - .harness-evolver/PROPOSER_HISTORY.md (if exists)
70
+ </files_to_read>
71
+
72
+ <output>
73
+ Write:
74
+ - .harness-evolver/architecture.json
75
+ - .harness-evolver/architecture.md
76
+ </output>
77
+
78
+ <success_criteria>
79
+ - Classifies current topology correctly
80
+ - Recommendation includes migration path with concrete steps
81
+ - Considers detected stack and API key availability
82
+ - Confidence rating is honest (low/medium/high)
83
+ </success_criteria>
84
+ )
80
85
  ```
81
86
 
82
87
  4. Wait for `## ARCHITECTURE ANALYSIS COMPLETE`.
@@ -22,36 +22,42 @@ TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo
22
22
 
23
23
  1. Read `summary.json` and identify the suspicious pattern (score jump, premature convergence).
24
24
 
25
- 2. Spawn the `harness-evolver-critic` agent:
26
-
27
- ```xml
28
- <objective>
29
- Analyze eval quality for this harness evolution project.
30
- The best version is {version} with score {score} achieved in {iterations} iteration(s).
31
- {Specific concern: "Score jumped from X to Y in one iteration" or "Perfect score in N iterations"}
32
- </objective>
33
-
34
- <files_to_read>
35
- - .harness-evolver/eval/eval.py
36
- - .harness-evolver/summary.json
37
- - .harness-evolver/harnesses/{best_version}/scores.json
38
- - .harness-evolver/harnesses/{best_version}/harness.py
39
- - .harness-evolver/harnesses/{best_version}/proposal.md
40
- - .harness-evolver/config.json
41
- </files_to_read>
42
-
43
- <output>
44
- Write:
45
- - .harness-evolver/critic_report.md (human-readable analysis)
46
- - .harness-evolver/eval/eval_improved.py (if weaknesses found)
47
- </output>
48
-
49
- <success_criteria>
50
- - Identifies specific weaknesses in eval.py with examples
51
- - If gaming detected, shows exact tasks/outputs that expose the weakness
52
- - Improved eval preserves the --results-dir/--tasks-dir/--scores interface
53
- - Re-scores the best version with improved eval to quantify the difference
54
- </success_criteria>
25
+ 2. Dispatch subagent using the **Agent tool** with `subagent_type: "harness-evolver-critic"`:
26
+
27
+ ```
28
+ Agent(
29
+ subagent_type: "harness-evolver-critic",
30
+ description: "Critic: analyze eval quality",
31
+ prompt: |
32
+ <objective>
33
+ Analyze eval quality for this harness evolution project.
34
+ The best version is {version} with score {score} achieved in {iterations} iteration(s).
35
+ {Specific concern: "Score jumped from X to Y in one iteration" or "Perfect score in N iterations"}
36
+ </objective>
37
+
38
+ <files_to_read>
39
+ - .harness-evolver/eval/eval.py
40
+ - .harness-evolver/summary.json
41
+ - .harness-evolver/harnesses/{best_version}/scores.json
42
+ - .harness-evolver/harnesses/{best_version}/harness.py
43
+ - .harness-evolver/harnesses/{best_version}/proposal.md
44
+ - .harness-evolver/config.json
45
+ - .harness-evolver/langsmith_stats.json (if exists)
46
+ </files_to_read>
47
+
48
+ <output>
49
+ Write:
50
+ - .harness-evolver/critic_report.md (human-readable analysis)
51
+ - .harness-evolver/eval/eval_improved.py (if weaknesses found)
52
+ </output>
53
+
54
+ <success_criteria>
55
+ - Identifies specific weaknesses in eval.py with examples
56
+ - If gaming detected, shows exact tasks/outputs that expose the weakness
57
+ - Improved eval preserves the --results-dir/--tasks-dir/--scores interface
58
+ - Re-scores the best version with improved eval to quantify the difference
59
+ </success_criteria>
60
+ )
55
61
  ```
56
62
 
57
63
  3. Wait for `## CRITIC REPORT COMPLETE`.
@@ -5,7 +5,7 @@ argument-hint: "[--iterations N]"
5
5
  allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent]
6
6
  ---
7
7
 
8
- # /harness-evolve
8
+ # /harness-evolver:evolve
9
9
 
10
10
  Run the autonomous propose-evaluate-iterate loop.
11
11
 
@@ -34,40 +34,86 @@ For each iteration:
34
34
  python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(f'v{s[\"iterations\"]+1:03d}')"
35
35
  ```
36
36
 
37
- ### 2. Propose
37
+ ### 1.5. Gather Diagnostic Context (LangSmith + Context7)
38
+
39
+ **This step is MANDATORY before every propose.** The orchestrator gathers data so the proposer receives it as files.
40
+
41
+ **LangSmith (if enabled):**
42
+
43
+ Check if LangSmith is enabled and langsmith-cli is available:
44
+ ```bash
45
+ cat .harness-evolver/config.json | python3 -c "import sys,json; print(json.load(sys.stdin).get('eval',{}).get('langsmith',{}).get('enabled',False))"
46
+ which langsmith-cli 2>/dev/null
47
+ ```
48
+
49
+ If BOTH are true AND at least one iteration has run, gather LangSmith data:
50
+ ```bash
51
+ langsmith-cli --json runs list --project harness-evolver-{best_version} --failed --fields id,name,error,inputs --limit 10 > .harness-evolver/langsmith_diagnosis.json 2>/dev/null || echo "[]" > .harness-evolver/langsmith_diagnosis.json
52
+
53
+ langsmith-cli --json runs stats --project harness-evolver-{best_version} > .harness-evolver/langsmith_stats.json 2>/dev/null || echo "{}" > .harness-evolver/langsmith_stats.json
54
+ ```
38
55
 
39
- Spawn the `harness-evolver-proposer` agent with a structured prompt:
56
+ **Context7 (if available):**
40
57
 
41
- ```xml
42
- <objective>
43
- Propose harness version {version} that improves on the current best score of {best_score}.
44
- </objective>
58
+ Check `config.json` field `stack.detected`. For each detected library, use the Context7 MCP tools to fetch relevant documentation:
45
59
 
46
- <files_to_read>
47
- - .harness-evolver/summary.json
48
- - .harness-evolver/PROPOSER_HISTORY.md
49
- - .harness-evolver/config.json
50
- - .harness-evolver/baseline/harness.py
51
- - .harness-evolver/harnesses/{best_version}/harness.py
52
- - .harness-evolver/harnesses/{best_version}/scores.json
53
- - .harness-evolver/harnesses/{best_version}/proposal.md
54
- </files_to_read>
60
+ ```
61
+ For each library in stack.detected:
62
+ 1. resolve-library-id with the context7_id
63
+ 2. get-library-docs with a query relevant to the current failure modes
64
+ 3. Save output to .harness-evolver/context7_docs.md (append each library's docs)
65
+ ```
55
66
 
56
- <output>
57
- Create directory .harness-evolver/harnesses/{version}/ containing:
58
- - harness.py (the improved harness)
59
- - config.json (parameters, copy from parent if unchanged)
60
- - proposal.md (reasoning, must start with "Based on v{PARENT}")
61
- </output>
67
+ This runs ONCE per iteration, not per library. Focus on the library most relevant to the current failures.
62
68
 
63
- <success_criteria>
64
- - harness.py maintains CLI interface (--input, --output, --traces-dir, --config)
65
- - proposal.md documents evidence-based reasoning
66
- - Changes are motivated by trace analysis, not guesswork
67
- </success_criteria>
69
+ If Context7 MCP is not available, skip silently.
70
+
71
+ ### 2. Propose
72
+
73
+ Dispatch a subagent using the **Agent tool** with `subagent_type: "harness-evolver-proposer"`.
74
+
75
+ The prompt MUST be the full XML block below. The `<files_to_read>` MUST include the LangSmith/Context7 files if they were gathered in step 1.5.
76
+
77
+ ```
78
+ Agent(
79
+ subagent_type: "harness-evolver-proposer",
80
+ description: "Propose harness {version}",
81
+ prompt: |
82
+ <objective>
83
+ Propose harness version {version} that improves on the current best score of {best_score}.
84
+ </objective>
85
+
86
+ <files_to_read>
87
+ - .harness-evolver/summary.json
88
+ - .harness-evolver/PROPOSER_HISTORY.md
89
+ - .harness-evolver/config.json
90
+ - .harness-evolver/baseline/harness.py
91
+ - .harness-evolver/harnesses/{best_version}/harness.py
92
+ - .harness-evolver/harnesses/{best_version}/scores.json
93
+ - .harness-evolver/harnesses/{best_version}/proposal.md
94
+ - .harness-evolver/langsmith_diagnosis.json (if exists — LangSmith failure analysis)
95
+ - .harness-evolver/langsmith_stats.json (if exists — LangSmith aggregate stats)
96
+ - .harness-evolver/context7_docs.md (if exists — current library documentation)
97
+ - .harness-evolver/architecture.json (if exists — architect topology recommendation)
98
+ </files_to_read>
99
+
100
+ <output>
101
+ Create directory .harness-evolver/harnesses/{version}/ containing:
102
+ - harness.py (the improved harness)
103
+ - config.json (parameters, copy from parent if unchanged)
104
+ - proposal.md (reasoning, must start with "Based on v{PARENT}")
105
+ </output>
106
+
107
+ <success_criteria>
108
+ - harness.py maintains CLI interface (--input, --output, --traces-dir, --config)
109
+ - proposal.md documents evidence-based reasoning
110
+ - Changes are motivated by trace analysis (LangSmith data if available), not guesswork
111
+ - If context7_docs.md was provided, API usage must match current documentation
112
+ </success_criteria>
113
+ )
68
114
  ```
69
115
 
70
- Wait for the agent to complete. Look for `## PROPOSAL COMPLETE` in the response.
116
+ Wait for `## PROPOSAL COMPLETE` in the response. The subagent gets a fresh context window and loads the agent definition from `~/.claude/agents/harness-evolver-proposer.md`.
71
117
 
72
118
  ### 3. Validate
73
119
 
@@ -106,26 +152,78 @@ python3 $TOOLS/state.py update \
106
152
 
107
153
  Read `summary.json`. Print: `Iteration {i}/{N}: {version} scored {score} (best: {best} at {best_score})`
108
154
 
109
- ### 6.5. Check for Eval Gaming
155
+ ### 6.5. Auto-trigger Critic (on eval gaming)
110
156
 
111
- After updating state, read the latest `summary.json` and check:
157
+ Read `summary.json` and check:
112
158
  - Did the score jump >0.3 from parent version?
113
159
  - Did we reach 1.0 in fewer than 3 total iterations?
114
160
 
115
- If either is true, warn:
161
+ If EITHER is true, **AUTO-SPAWN the critic agent** (do not just suggest — actually spawn it):
116
162
 
117
- > Suspicious convergence detected: score jumped from {parent_score} to {score} in one iteration.
118
- > The eval may be too lenient. Run `/harness-evolver:critic` to analyze eval quality.
163
+ ```bash
164
+ python3 $TOOLS/evaluate.py run \
165
+ --harness .harness-evolver/harnesses/{version}/harness.py \
166
+ --tasks-dir .harness-evolver/eval/tasks/ \
167
+ --eval .harness-evolver/eval/eval.py \
168
+ --traces-dir /tmp/critic-check/ \
169
+ --scores /tmp/critic-check-scores.json \
170
+ --timeout 60
171
+ ```
172
+
173
+ Dispatch critic subagent using the **Agent tool** with `subagent_type: "harness-evolver-critic"`:
174
+
175
+ ```
176
+ Agent(
177
+ subagent_type: "harness-evolver-critic",
178
+ description: "Critic: analyze eval quality",
179
+ prompt: |
180
+ <objective>
181
+ EVAL GAMING DETECTED: Score jumped from {parent_score} to {score} in one iteration.
182
+ Analyze the eval quality and propose a stricter eval.
183
+ </objective>
184
+
185
+ <files_to_read>
186
+ - .harness-evolver/eval/eval.py
187
+ - .harness-evolver/summary.json
188
+ - .harness-evolver/harnesses/{version}/scores.json
189
+ - .harness-evolver/harnesses/{version}/harness.py
190
+ - .harness-evolver/harnesses/{version}/proposal.md
191
+ - .harness-evolver/config.json
192
+ - .harness-evolver/langsmith_stats.json (if exists)
193
+ </files_to_read>
194
+
195
+ <output>
196
+ Write:
197
+ - .harness-evolver/critic_report.md
198
+ - .harness-evolver/eval/eval_improved.py (if weaknesses found)
199
+ </output>
200
+
201
+ <success_criteria>
202
+ - Identifies specific weaknesses in eval.py with task/output examples
203
+ - If gaming detected, shows exact tasks that expose the weakness
204
+ - Improved eval preserves the --results-dir/--tasks-dir/--scores interface
205
+ - Re-scores the best version with improved eval to show the difference
206
+ </success_criteria>
207
+ )
208
+ ```
209
+
210
+ Wait for `## CRITIC REPORT COMPLETE`.
119
211
 
120
- If score is 1.0 and iterations < 3, STOP the loop and strongly recommend the critic:
212
+ If critic wrote `eval_improved.py`:
213
+ - Re-score the best harness with the improved eval
214
+ - Show the score difference (e.g., "Current eval: 1.0. Improved eval: 0.45")
215
+ - **AUTO-ADOPT the improved eval**: copy `eval_improved.py` to `eval/eval.py`
216
+ - Re-run baseline with new eval and update `summary.json`
217
+ - Print: "Eval upgraded. Resuming evolution with stricter eval."
218
+ - **Continue the loop** with the new eval
121
219
 
122
- > Perfect score reached in only {iterations} iteration(s). This usually indicates
123
- > the eval is too easy, not that the harness is perfect. Run `/harness-evolver:critic`
124
- > before continuing.
220
+ If critic did NOT write `eval_improved.py` (eval is fine):
221
+ - Print the critic's assessment
222
+ - Continue the loop normally
125
223
 
126
224
  ### 7. Auto-trigger Architect (on stagnation or regression)
127
225
 
128
- Check if the architect should be auto-spawned. This happens when:
226
+ Check if the architect should be auto-spawned:
129
227
  - **Stagnation**: 3 consecutive iterations within 1% of each other
130
228
  - **Regression**: score dropped below parent score (even once)
131
229
 
@@ -141,57 +239,59 @@ python3 $TOOLS/analyze_architecture.py \
141
239
  -o .harness-evolver/architecture_signals.json
142
240
  ```
143
241
 
144
- Then spawn the `harness-evolver-architect` agent:
145
-
146
- ```xml
147
- <objective>
148
- The evolution loop has {stagnated/regressed} after {iterations} iterations (best: {best_score}).
149
- Analyze the harness architecture and recommend a topology change.
150
- </objective>
151
-
152
- <files_to_read>
153
- - .harness-evolver/architecture_signals.json
154
- - .harness-evolver/summary.json
155
- - .harness-evolver/PROPOSER_HISTORY.md
156
- - .harness-evolver/config.json
157
- - .harness-evolver/harnesses/{best_version}/harness.py
158
- - .harness-evolver/harnesses/{best_version}/scores.json
159
- </files_to_read>
242
+ Dispatch architect subagent using the **Agent tool** with `subagent_type: "harness-evolver-architect"`:
160
243
 
161
- <output>
162
- Write:
163
- - .harness-evolver/architecture.json (structured recommendation)
164
- - .harness-evolver/architecture.md (human-readable analysis)
165
- </output>
166
-
167
- <success_criteria>
168
- - Recommendation includes concrete migration steps
169
- - Each step is implementable in one proposer iteration
170
- - Considers detected stack and available API keys
171
- </success_criteria>
244
+ ```
245
+ Agent(
246
+ subagent_type: "harness-evolver-architect",
247
+ description: "Architect: analyze topology after {stagnation/regression}",
248
+ prompt: |
249
+ <objective>
250
+ The evolution loop has {stagnated/regressed} after {iterations} iterations (best: {best_score}).
251
+ Analyze the harness architecture and recommend a topology change.
252
+ </objective>
253
+
254
+ <files_to_read>
255
+ - .harness-evolver/architecture_signals.json
256
+ - .harness-evolver/summary.json
257
+ - .harness-evolver/PROPOSER_HISTORY.md
258
+ - .harness-evolver/config.json
259
+ - .harness-evolver/harnesses/{best_version}/harness.py
260
+ - .harness-evolver/harnesses/{best_version}/scores.json
261
+ - .harness-evolver/context7_docs.md (if exists)
262
+ </files_to_read>
263
+
264
+ <output>
265
+ Write:
266
+ - .harness-evolver/architecture.json (structured recommendation)
267
+ - .harness-evolver/architecture.md (human-readable analysis)
268
+ </output>
269
+
270
+ <success_criteria>
271
+ - Recommendation includes concrete migration steps
272
+ - Each step is implementable in one proposer iteration
273
+ - Considers detected stack and available API keys
274
+ </success_criteria>
275
+ )
172
276
  ```
173
277
 
174
278
  Wait for `## ARCHITECTURE ANALYSIS COMPLETE`.
175
279
 
176
- After the architect completes, report:
177
-
178
- > Architect recommends: {current} → {recommended} ({confidence} confidence)
179
- > Migration path: {N} steps. Continuing evolution with architecture guidance.
180
-
181
- Then **continue the loop** — the proposer will read `architecture.json` in the next iteration.
280
+ Report: `Architect recommends: {current} → {recommended} ({confidence} confidence)`
182
281
 
183
- If `architecture.json` already exists (architect already ran), skip — don't re-run.
282
+ Then **continue the loop** — the proposer reads `architecture.json` in the next iteration.
184
283
 
185
284
  ### 8. Check Stop Conditions
186
285
 
187
286
  - **Target**: `combined_score >= target_score` → stop
188
287
  - **N reached**: done
189
- - **Stagnation post-architect**: 3 more iterations without improvement AFTER architect ran → stop (architecture change didn't help)
288
+ - **Stagnation post-architect**: 3 more iterations without improvement AFTER architect ran → stop
190
289
 
191
290
  ## When Loop Ends — Final Report
192
291
 
193
292
  - Best version and score
194
293
  - Improvement over baseline (absolute and %)
195
294
  - Total iterations run
295
+ - Whether critic was triggered and eval was upgraded
196
296
  - Whether architect was triggered and what it recommended
197
297
  - Suggest: "The best harness is at `.harness-evolver/harnesses/{best}/harness.py`. Copy it to your project."