harness-evolver 1.4.0 → 1.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -4,6 +4,7 @@ description: |
4
4
  Use this agent to analyze harness architecture and recommend optimal multi-agent topology.
5
5
  Reads code analysis signals, traces, and scores to produce a migration plan.
6
6
  tools: Read, Write, Bash, Grep, Glob
7
+ color: blue
7
8
  ---
8
9
 
9
10
  ## Bootstrap
@@ -4,6 +4,7 @@ description: |
4
4
  Use this agent to assess eval quality, detect eval gaming, and propose stricter evaluation.
5
5
  Triggered when scores converge suspiciously fast or on user request.
6
6
  tools: Read, Write, Bash, Grep, Glob
7
+ color: red
7
8
  ---
8
9
 
9
10
  ## Bootstrap
@@ -4,6 +4,7 @@ description: |
4
4
  Use this agent when the evolve skill needs to propose a new harness candidate.
5
5
  Navigates the .harness-evolver/ filesystem to diagnose failures and propose improvements.
6
6
  tools: Read, Write, Edit, Bash, Glob, Grep
7
+ color: green
7
8
  permissionMode: acceptEdits
8
9
  ---
9
10
 
@@ -12,6 +13,24 @@ permissionMode: acceptEdits
12
13
  If your prompt contains a `<files_to_read>` block, you MUST use the Read tool to load
13
14
  every file listed there before performing any other actions. These files are your context.
14
15
 
16
+ ## Context7 — Enrich Your Knowledge
17
+
18
+ You have access to Context7 MCP tools (`resolve-library-id` and `get-library-docs`) for looking up **current, version-specific documentation** of any library.
19
+
20
+ **USE CONTEXT7 PROACTIVELY whenever you:**
21
+ - Are about to write code that uses a library API (LangGraph, LangChain, OpenAI, etc.)
22
+ - Are unsure about the correct method signature, parameters, or patterns
23
+ - Want to check if a better approach exists in the latest version
24
+ - See an error in traces that might be caused by using a deprecated API
25
+
26
+ **How to use:**
27
+ 1. `resolve-library-id` with the library name (e.g., "langchain", "langgraph")
28
+ 2. `get-library-docs` with a specific query (e.g., "StateGraph conditional edges", "ChatGoogleGenerativeAI streaming")
29
+
30
+ **Do NOT skip this.** Your training data may be outdated. Context7 gives you the current docs. Even if you're confident about an API, a quick check takes seconds and prevents proposing deprecated patterns.
31
+
32
+ If Context7 is not available, proceed with model knowledge but note in `proposal.md`: "API not verified against current docs."
33
+
15
34
  ## Return Protocol
16
35
 
17
36
  When done, end your response with:
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "harness-evolver",
3
- "version": "1.4.0",
3
+ "version": "1.6.0",
4
4
  "description": "Meta-Harness-style autonomous harness optimization for Claude Code",
5
5
  "author": "Raphael Valdetaro",
6
6
  "license": "MIT",
@@ -48,35 +48,40 @@ python3 $TOOLS/analyze_architecture.py \
48
48
  -o .harness-evolver/architecture_signals.json
49
49
  ```
50
50
 
51
- 3. Spawn the `harness-evolver-architect` agent:
52
-
53
- ```xml
54
- <objective>
55
- Analyze the harness architecture and recommend the optimal multi-agent topology.
56
- {If called from evolve: "The evolution loop stagnated/regressed after N iterations."}
57
- {If called by user: "The user requested an architecture analysis."}
58
- </objective>
59
-
60
- <files_to_read>
61
- - .harness-evolver/architecture_signals.json
62
- - .harness-evolver/config.json
63
- - .harness-evolver/baseline/harness.py
64
- - .harness-evolver/summary.json (if exists)
65
- - .harness-evolver/PROPOSER_HISTORY.md (if exists)
66
- </files_to_read>
67
-
68
- <output>
69
- Write:
70
- - .harness-evolver/architecture.json
71
- - .harness-evolver/architecture.md
72
- </output>
73
-
74
- <success_criteria>
75
- - Classifies current topology correctly
76
- - Recommendation includes migration path with concrete steps
77
- - Considers detected stack and API key availability
78
- - Confidence rating is honest (low/medium/high)
79
- </success_criteria>
51
+ 3. Dispatch subagent using the **Agent tool** with `subagent_type: "harness-evolver-architect"`:
52
+
53
+ ```
54
+ Agent(
55
+ subagent_type: "harness-evolver-architect",
56
+ description: "Architect: topology analysis",
57
+ prompt: |
58
+ <objective>
59
+ Analyze the harness architecture and recommend the optimal multi-agent topology.
60
+ {If called from evolve: "The evolution loop stagnated/regressed after N iterations."}
61
+ {If called by user: "The user requested an architecture analysis."}
62
+ </objective>
63
+
64
+ <files_to_read>
65
+ - .harness-evolver/architecture_signals.json
66
+ - .harness-evolver/config.json
67
+ - .harness-evolver/baseline/harness.py
68
+ - .harness-evolver/summary.json (if exists)
69
+ - .harness-evolver/PROPOSER_HISTORY.md (if exists)
70
+ </files_to_read>
71
+
72
+ <output>
73
+ Write:
74
+ - .harness-evolver/architecture.json
75
+ - .harness-evolver/architecture.md
76
+ </output>
77
+
78
+ <success_criteria>
79
+ - Classifies current topology correctly
80
+ - Recommendation includes migration path with concrete steps
81
+ - Considers detected stack and API key availability
82
+ - Confidence rating is honest (low/medium/high)
83
+ </success_criteria>
84
+ )
80
85
  ```
81
86
 
82
87
  4. Wait for `## ARCHITECTURE ANALYSIS COMPLETE`.
@@ -22,36 +22,42 @@ TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo
22
22
 
23
23
  1. Read `summary.json` and identify the suspicious pattern (score jump, premature convergence).
24
24
 
25
- 2. Spawn the `harness-evolver-critic` agent:
26
-
27
- ```xml
28
- <objective>
29
- Analyze eval quality for this harness evolution project.
30
- The best version is {version} with score {score} achieved in {iterations} iteration(s).
31
- {Specific concern: "Score jumped from X to Y in one iteration" or "Perfect score in N iterations"}
32
- </objective>
33
-
34
- <files_to_read>
35
- - .harness-evolver/eval/eval.py
36
- - .harness-evolver/summary.json
37
- - .harness-evolver/harnesses/{best_version}/scores.json
38
- - .harness-evolver/harnesses/{best_version}/harness.py
39
- - .harness-evolver/harnesses/{best_version}/proposal.md
40
- - .harness-evolver/config.json
41
- </files_to_read>
42
-
43
- <output>
44
- Write:
45
- - .harness-evolver/critic_report.md (human-readable analysis)
46
- - .harness-evolver/eval/eval_improved.py (if weaknesses found)
47
- </output>
48
-
49
- <success_criteria>
50
- - Identifies specific weaknesses in eval.py with examples
51
- - If gaming detected, shows exact tasks/outputs that expose the weakness
52
- - Improved eval preserves the --results-dir/--tasks-dir/--scores interface
53
- - Re-scores the best version with improved eval to quantify the difference
54
- </success_criteria>
25
+ 2. Dispatch subagent using the **Agent tool** with `subagent_type: "harness-evolver-critic"`:
26
+
27
+ ```
28
+ Agent(
29
+ subagent_type: "harness-evolver-critic",
30
+ description: "Critic: analyze eval quality",
31
+ prompt: |
32
+ <objective>
33
+ Analyze eval quality for this harness evolution project.
34
+ The best version is {version} with score {score} achieved in {iterations} iteration(s).
35
+ {Specific concern: "Score jumped from X to Y in one iteration" or "Perfect score in N iterations"}
36
+ </objective>
37
+
38
+ <files_to_read>
39
+ - .harness-evolver/eval/eval.py
40
+ - .harness-evolver/summary.json
41
+ - .harness-evolver/harnesses/{best_version}/scores.json
42
+ - .harness-evolver/harnesses/{best_version}/harness.py
43
+ - .harness-evolver/harnesses/{best_version}/proposal.md
44
+ - .harness-evolver/config.json
45
+ - .harness-evolver/langsmith_stats.json (if exists)
46
+ </files_to_read>
47
+
48
+ <output>
49
+ Write:
50
+ - .harness-evolver/critic_report.md (human-readable analysis)
51
+ - .harness-evolver/eval/eval_improved.py (if weaknesses found)
52
+ </output>
53
+
54
+ <success_criteria>
55
+ - Identifies specific weaknesses in eval.py with examples
56
+ - If gaming detected, shows exact tasks/outputs that expose the weakness
57
+ - Improved eval preserves the --results-dir/--tasks-dir/--scores interface
58
+ - Re-scores the best version with improved eval to quantify the difference
59
+ </success_criteria>
60
+ )
55
61
  ```
56
62
 
57
63
  3. Wait for `## CRITIC REPORT COMPLETE`.
@@ -34,81 +34,66 @@ For each iteration:
34
34
  python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(f'v{s[\"iterations\"]+1:03d}')"
35
35
  ```
36
36
 
37
- ### 1.5. Gather Diagnostic Context (LangSmith + Context7)
37
+ ### 1.5. Gather LangSmith Traces (MANDATORY after every evaluation)
38
38
 
39
- **This step is MANDATORY before every propose.** The orchestrator gathers data so the proposer receives it as files.
39
+ **Run these commands unconditionally after EVERY evaluation** (including baseline). If langsmith-cli is not installed or there are no runs, the commands fail silently that's fine. But you MUST attempt them.
40
40
 
41
- **LangSmith (if enabled):**
42
-
43
- Check if LangSmith is enabled and langsmith-cli is available:
44
- ```bash
45
- cat .harness-evolver/config.json | python3 -c "import sys,json; print(json.load(sys.stdin).get('eval',{}).get('langsmith',{}).get('enabled',False))"
46
- which langsmith-cli 2>/dev/null
47
- ```
48
-
49
- If BOTH are true AND at least one iteration has run, gather LangSmith data:
50
41
  ```bash
51
- langsmith-cli --json runs list --project harness-evolver-{best_version} --failed --fields id,name,error,inputs --limit 10 > .harness-evolver/langsmith_diagnosis.json 2>/dev/null || echo "[]" > .harness-evolver/langsmith_diagnosis.json
42
+ langsmith-cli --json runs list --project harness-evolver-{last_evaluated_version} --failed --fields id,name,error,inputs --limit 10 > .harness-evolver/langsmith_diagnosis.json 2>/dev/null || echo "[]" > .harness-evolver/langsmith_diagnosis.json
52
43
 
53
- langsmith-cli --json runs stats --project harness-evolver-{best_version} > .harness-evolver/langsmith_stats.json 2>/dev/null || echo "{}" > .harness-evolver/langsmith_stats.json
44
+ langsmith-cli --json runs stats --project harness-evolver-{last_evaluated_version} > .harness-evolver/langsmith_stats.json 2>/dev/null || echo "{}" > .harness-evolver/langsmith_stats.json
54
45
  ```
55
46
 
56
- **Context7 (if available):**
47
+ For the first iteration, use `baseline` as the version. For subsequent iterations, use the latest evaluated version.
57
48
 
58
- Check `config.json` field `stack.detected`. For each detected library, use the Context7 MCP tools to fetch relevant documentation:
59
-
60
- ```
61
- For each library in stack.detected:
62
- 1. resolve-library-id with the context7_id
63
- 2. get-library-docs with a query relevant to the current failure modes
64
- 3. Save output to .harness-evolver/context7_docs.md (append each library's docs)
65
- ```
49
+ These files are included in the proposer's `<files_to_read>` so it has real trace data for diagnosis.
66
50
 
67
- This runs ONCE per iteration, not per library. Focus on the library most relevant to the current failures.
51
+ ### 2. Propose
68
52
 
69
- If Context7 MCP is not available, skip silently.
53
+ Dispatch a subagent using the **Agent tool** with `subagent_type: "harness-evolver-proposer"`.
70
54
 
71
- ### 2. Propose
55
+ The prompt MUST be the full XML block below. The `<files_to_read>` MUST include the LangSmith/Context7 files if they were gathered in step 1.5.
72
56
 
73
- Spawn the `harness-evolver-proposer` agent with a structured prompt.
74
-
75
- The `<files_to_read>` MUST include the LangSmith/Context7 files if they were gathered:
76
-
77
- ```xml
78
- <objective>
79
- Propose harness version {version} that improves on the current best score of {best_score}.
80
- </objective>
81
-
82
- <files_to_read>
83
- - .harness-evolver/summary.json
84
- - .harness-evolver/PROPOSER_HISTORY.md
85
- - .harness-evolver/config.json
86
- - .harness-evolver/baseline/harness.py
87
- - .harness-evolver/harnesses/{best_version}/harness.py
88
- - .harness-evolver/harnesses/{best_version}/scores.json
89
- - .harness-evolver/harnesses/{best_version}/proposal.md
90
- - .harness-evolver/langsmith_diagnosis.json (if exists — LangSmith failure analysis)
91
- - .harness-evolver/langsmith_stats.json (if exists — LangSmith aggregate stats)
92
- - .harness-evolver/context7_docs.md (if exists — current library documentation)
93
- - .harness-evolver/architecture.json (if exists — architect topology recommendation)
94
- </files_to_read>
95
-
96
- <output>
97
- Create directory .harness-evolver/harnesses/{version}/ containing:
98
- - harness.py (the improved harness)
99
- - config.json (parameters, copy from parent if unchanged)
100
- - proposal.md (reasoning, must start with "Based on v{PARENT}")
101
- </output>
102
-
103
- <success_criteria>
104
- - harness.py maintains CLI interface (--input, --output, --traces-dir, --config)
105
- - proposal.md documents evidence-based reasoning
106
- - Changes are motivated by trace analysis (LangSmith data if available), not guesswork
107
- - If context7_docs.md was provided, API usage must match current documentation
108
- </success_criteria>
57
+ ```
58
+ Agent(
59
+ subagent_type: "harness-evolver-proposer",
60
+ description: "Propose harness {version}",
61
+ prompt: |
62
+ <objective>
63
+ Propose harness version {version} that improves on the current best score of {best_score}.
64
+ </objective>
65
+
66
+ <files_to_read>
67
+ - .harness-evolver/summary.json
68
+ - .harness-evolver/PROPOSER_HISTORY.md
69
+ - .harness-evolver/config.json
70
+ - .harness-evolver/baseline/harness.py
71
+ - .harness-evolver/harnesses/{best_version}/harness.py
72
+ - .harness-evolver/harnesses/{best_version}/scores.json
73
+ - .harness-evolver/harnesses/{best_version}/proposal.md
74
+ - .harness-evolver/langsmith_diagnosis.json (if exists — LangSmith failure analysis)
75
+ - .harness-evolver/langsmith_stats.json (if exists — LangSmith aggregate stats)
76
+ - .harness-evolver/context7_docs.md (if exists — current library documentation)
77
+ - .harness-evolver/architecture.json (if exists — architect topology recommendation)
78
+ </files_to_read>
79
+
80
+ <output>
81
+ Create directory .harness-evolver/harnesses/{version}/ containing:
82
+ - harness.py (the improved harness)
83
+ - config.json (parameters, copy from parent if unchanged)
84
+ - proposal.md (reasoning, must start with "Based on v{PARENT}")
85
+ </output>
86
+
87
+ <success_criteria>
88
+ - harness.py maintains CLI interface (--input, --output, --traces-dir, --config)
89
+ - proposal.md documents evidence-based reasoning
90
+ - Changes are motivated by trace analysis (LangSmith data if available), not guesswork
91
+ - If context7_docs.md was provided, API usage must match current documentation
92
+ </success_criteria>
93
+ )
109
94
  ```
110
95
 
111
- Wait for the agent to complete. Look for `## PROPOSAL COMPLETE` in the response.
96
+ Wait for `## PROPOSAL COMPLETE` in the response. The subagent gets a fresh context window and loads the agent definition from `~/.claude/agents/harness-evolver-proposer.md`.
112
97
 
113
98
  ### 3. Validate
114
99
 
@@ -165,36 +150,41 @@ python3 $TOOLS/evaluate.py run \
165
150
  --timeout 60
166
151
  ```
167
152
 
168
- Spawn the `harness-evolver-critic` agent:
169
-
170
- ```xml
171
- <objective>
172
- EVAL GAMING DETECTED: Score jumped from {parent_score} to {score} in one iteration.
173
- Analyze the eval quality and propose a stricter eval.
174
- </objective>
175
-
176
- <files_to_read>
177
- - .harness-evolver/eval/eval.py
178
- - .harness-evolver/summary.json
179
- - .harness-evolver/harnesses/{version}/scores.json
180
- - .harness-evolver/harnesses/{version}/harness.py
181
- - .harness-evolver/harnesses/{version}/proposal.md
182
- - .harness-evolver/config.json
183
- - .harness-evolver/langsmith_stats.json (if exists)
184
- </files_to_read>
185
-
186
- <output>
187
- Write:
188
- - .harness-evolver/critic_report.md
189
- - .harness-evolver/eval/eval_improved.py (if weaknesses found)
190
- </output>
191
-
192
- <success_criteria>
193
- - Identifies specific weaknesses in eval.py with task/output examples
194
- - If gaming detected, shows exact tasks that expose the weakness
195
- - Improved eval preserves the --results-dir/--tasks-dir/--scores interface
196
- - Re-scores the best version with improved eval to show the difference
197
- </success_criteria>
153
+ Dispatch critic subagent using the **Agent tool** with `subagent_type: "harness-evolver-critic"`:
154
+
155
+ ```
156
+ Agent(
157
+ subagent_type: "harness-evolver-critic",
158
+ description: "Critic: analyze eval quality",
159
+ prompt: |
160
+ <objective>
161
+ EVAL GAMING DETECTED: Score jumped from {parent_score} to {score} in one iteration.
162
+ Analyze the eval quality and propose a stricter eval.
163
+ </objective>
164
+
165
+ <files_to_read>
166
+ - .harness-evolver/eval/eval.py
167
+ - .harness-evolver/summary.json
168
+ - .harness-evolver/harnesses/{version}/scores.json
169
+ - .harness-evolver/harnesses/{version}/harness.py
170
+ - .harness-evolver/harnesses/{version}/proposal.md
171
+ - .harness-evolver/config.json
172
+ - .harness-evolver/langsmith_stats.json (if exists)
173
+ </files_to_read>
174
+
175
+ <output>
176
+ Write:
177
+ - .harness-evolver/critic_report.md
178
+ - .harness-evolver/eval/eval_improved.py (if weaknesses found)
179
+ </output>
180
+
181
+ <success_criteria>
182
+ - Identifies specific weaknesses in eval.py with task/output examples
183
+ - If gaming detected, shows exact tasks that expose the weakness
184
+ - Improved eval preserves the --results-dir/--tasks-dir/--scores interface
185
+ - Re-scores the best version with improved eval to show the difference
186
+ </success_criteria>
187
+ )
198
188
  ```
199
189
 
200
190
  Wait for `## CRITIC REPORT COMPLETE`.
@@ -229,35 +219,40 @@ python3 $TOOLS/analyze_architecture.py \
229
219
  -o .harness-evolver/architecture_signals.json
230
220
  ```
231
221
 
232
- Spawn the `harness-evolver-architect` agent:
233
-
234
- ```xml
235
- <objective>
236
- The evolution loop has {stagnated/regressed} after {iterations} iterations (best: {best_score}).
237
- Analyze the harness architecture and recommend a topology change.
238
- </objective>
239
-
240
- <files_to_read>
241
- - .harness-evolver/architecture_signals.json
242
- - .harness-evolver/summary.json
243
- - .harness-evolver/PROPOSER_HISTORY.md
244
- - .harness-evolver/config.json
245
- - .harness-evolver/harnesses/{best_version}/harness.py
246
- - .harness-evolver/harnesses/{best_version}/scores.json
247
- - .harness-evolver/context7_docs.md (if exists)
248
- </files_to_read>
249
-
250
- <output>
251
- Write:
252
- - .harness-evolver/architecture.json (structured recommendation)
253
- - .harness-evolver/architecture.md (human-readable analysis)
254
- </output>
255
-
256
- <success_criteria>
257
- - Recommendation includes concrete migration steps
258
- - Each step is implementable in one proposer iteration
259
- - Considers detected stack and available API keys
260
- </success_criteria>
222
+ Dispatch architect subagent using the **Agent tool** with `subagent_type: "harness-evolver-architect"`:
223
+
224
+ ```
225
+ Agent(
226
+ subagent_type: "harness-evolver-architect",
227
+ description: "Architect: analyze topology after {stagnation/regression}",
228
+ prompt: |
229
+ <objective>
230
+ The evolution loop has {stagnated/regressed} after {iterations} iterations (best: {best_score}).
231
+ Analyze the harness architecture and recommend a topology change.
232
+ </objective>
233
+
234
+ <files_to_read>
235
+ - .harness-evolver/architecture_signals.json
236
+ - .harness-evolver/summary.json
237
+ - .harness-evolver/PROPOSER_HISTORY.md
238
+ - .harness-evolver/config.json
239
+ - .harness-evolver/harnesses/{best_version}/harness.py
240
+ - .harness-evolver/harnesses/{best_version}/scores.json
241
+ - .harness-evolver/context7_docs.md (if exists)
242
+ </files_to_read>
243
+
244
+ <output>
245
+ Write:
246
+ - .harness-evolver/architecture.json (structured recommendation)
247
+ - .harness-evolver/architecture.md (human-readable analysis)
248
+ </output>
249
+
250
+ <success_criteria>
251
+ - Recommendation includes concrete migration steps
252
+ - Each step is implementable in one proposer iteration
253
+ - Considers detected stack and available API keys
254
+ </success_criteria>
255
+ )
261
256
  ```
262
257
 
263
258
  Wait for `## ARCHITECTURE ANALYSIS COMPLETE`.
package/tools/init.py CHANGED
@@ -298,10 +298,26 @@ def main():
298
298
  print(" Recommendation: install langsmith-cli for rich trace analysis:")
299
299
  print(" uv tool install langsmith-cli && langsmith-cli auth login")
300
300
 
301
- # Detect stack
302
- stack = _detect_stack(args.harness)
301
+ # Detect stack — try original harness first, then baseline copy, then scan entire source dir
302
+ stack = _detect_stack(os.path.abspath(args.harness))
303
+ if not stack:
304
+ stack = _detect_stack(os.path.join(base, "baseline", "harness.py"))
305
+ if not stack:
306
+ # Scan the original directory for any .py files with known imports
307
+ harness_dir = os.path.dirname(os.path.abspath(args.harness))
308
+ detect_stack_py = os.path.join(os.path.dirname(__file__), "detect_stack.py")
309
+ if os.path.exists(detect_stack_py):
310
+ try:
311
+ r = subprocess.run(
312
+ ["python3", detect_stack_py, harness_dir],
313
+ capture_output=True, text=True, timeout=30,
314
+ )
315
+ if r.returncode == 0 and r.stdout.strip():
316
+ stack = json.loads(r.stdout)
317
+ except Exception:
318
+ pass
303
319
  config["stack"] = {
304
- "detected": stack,
320
+ "detected": stack if stack else {},
305
321
  "documentation_hint": "use context7",
306
322
  "auto_detected": True,
307
323
  }