harness-evolver 1.3.0 → 1.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "harness-evolver",
3
- "version": "1.3.0",
3
+ "version": "1.4.0",
4
4
  "description": "Meta-Harness-style autonomous harness optimization for Claude Code",
5
5
  "author": "Raphael Valdetaro",
6
6
  "license": "MIT",
@@ -5,7 +5,7 @@ argument-hint: "[--iterations N]"
5
5
  allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent]
6
6
  ---
7
7
 
8
- # /harness-evolve
8
+ # /harness-evolver:evolve
9
9
 
10
10
  Run the autonomous propose-evaluate-iterate loop.
11
11
 
@@ -34,9 +34,45 @@ For each iteration:
34
34
  python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(f'v{s[\"iterations\"]+1:03d}')"
35
35
  ```
36
36
 
37
+ ### 1.5. Gather Diagnostic Context (LangSmith + Context7)
38
+
39
+ **This step is MANDATORY before every propose.** The orchestrator gathers data so the proposer receives it as files.
40
+
41
+ **LangSmith (if enabled):**
42
+
43
+ Check if LangSmith is enabled and langsmith-cli is available:
44
+ ```bash
45
+ cat .harness-evolver/config.json | python3 -c "import sys,json; print(json.load(sys.stdin).get('eval',{}).get('langsmith',{}).get('enabled',False))"
46
+ which langsmith-cli 2>/dev/null
47
+ ```
48
+
49
+ If BOTH are true AND at least one iteration has run, gather LangSmith data:
50
+ ```bash
51
+ langsmith-cli --json runs list --project harness-evolver-{best_version} --failed --fields id,name,error,inputs --limit 10 > .harness-evolver/langsmith_diagnosis.json 2>/dev/null || echo "[]" > .harness-evolver/langsmith_diagnosis.json
52
+
53
+ langsmith-cli --json runs stats --project harness-evolver-{best_version} > .harness-evolver/langsmith_stats.json 2>/dev/null || echo "{}" > .harness-evolver/langsmith_stats.json
54
+ ```
55
+
56
+ **Context7 (if available):**
57
+
58
+ Check `config.json` field `stack.detected`. For each detected library, use the Context7 MCP tools to fetch relevant documentation:
59
+
60
+ ```
61
+ For each library in stack.detected:
62
+ 1. resolve-library-id with the context7_id
63
+ 2. get-library-docs with a query relevant to the current failure modes
64
+ 3. Save output to .harness-evolver/context7_docs.md (append each library's docs)
65
+ ```
66
+
67
+ This runs ONCE per iteration, not per library. Focus on the library most relevant to the current failures.
68
+
69
+ If Context7 MCP is not available, skip silently.
70
+
37
71
  ### 2. Propose
38
72
 
39
- Spawn the `harness-evolver-proposer` agent with a structured prompt:
73
+ Spawn the `harness-evolver-proposer` agent with a structured prompt.
74
+
75
+ The `<files_to_read>` MUST include the LangSmith/Context7 files if they were gathered:
40
76
 
41
77
  ```xml
42
78
  <objective>
@@ -51,6 +87,10 @@ Propose harness version {version} that improves on the current best score of {be
51
87
  - .harness-evolver/harnesses/{best_version}/harness.py
52
88
  - .harness-evolver/harnesses/{best_version}/scores.json
53
89
  - .harness-evolver/harnesses/{best_version}/proposal.md
90
+ - .harness-evolver/langsmith_diagnosis.json (if exists — LangSmith failure analysis)
91
+ - .harness-evolver/langsmith_stats.json (if exists — LangSmith aggregate stats)
92
+ - .harness-evolver/context7_docs.md (if exists — current library documentation)
93
+ - .harness-evolver/architecture.json (if exists — architect topology recommendation)
54
94
  </files_to_read>
55
95
 
56
96
  <output>
@@ -63,7 +103,8 @@ Create directory .harness-evolver/harnesses/{version}/ containing:
63
103
  <success_criteria>
64
104
  - harness.py maintains CLI interface (--input, --output, --traces-dir, --config)
65
105
  - proposal.md documents evidence-based reasoning
66
- - Changes are motivated by trace analysis, not guesswork
106
+ - Changes are motivated by trace analysis (LangSmith data if available), not guesswork
107
+ - If context7_docs.md was provided, API usage must match current documentation
67
108
  </success_criteria>
68
109
  ```
69
110
 
@@ -106,26 +147,73 @@ python3 $TOOLS/state.py update \
106
147
 
107
148
  Read `summary.json`. Print: `Iteration {i}/{N}: {version} scored {score} (best: {best} at {best_score})`
108
149
 
109
- ### 6.5. Check for Eval Gaming
150
+ ### 6.5. Auto-trigger Critic (on eval gaming)
110
151
 
111
- After updating state, read the latest `summary.json` and check:
152
+ Read `summary.json` and check:
112
153
  - Did the score jump >0.3 from parent version?
113
154
  - Did we reach 1.0 in fewer than 3 total iterations?
114
155
 
115
- If either is true, warn:
156
+ If EITHER is true, **AUTO-SPAWN the critic agent** (do not just suggest — actually spawn it):
157
+
158
+ ```bash
159
+ python3 $TOOLS/evaluate.py run \
160
+ --harness .harness-evolver/harnesses/{version}/harness.py \
161
+ --tasks-dir .harness-evolver/eval/tasks/ \
162
+ --eval .harness-evolver/eval/eval.py \
163
+ --traces-dir /tmp/critic-check/ \
164
+ --scores /tmp/critic-check-scores.json \
165
+ --timeout 60
166
+ ```
167
+
168
+ Spawn the `harness-evolver-critic` agent:
116
169
 
117
- > Suspicious convergence detected: score jumped from {parent_score} to {score} in one iteration.
118
- > The eval may be too lenient. Run `/harness-evolver:critic` to analyze eval quality.
170
+ ```xml
171
+ <objective>
172
+ EVAL GAMING DETECTED: Score jumped from {parent_score} to {score} in one iteration.
173
+ Analyze the eval quality and propose a stricter eval.
174
+ </objective>
119
175
 
120
- If score is 1.0 and iterations < 3, STOP the loop and strongly recommend the critic:
176
+ <files_to_read>
177
+ - .harness-evolver/eval/eval.py
178
+ - .harness-evolver/summary.json
179
+ - .harness-evolver/harnesses/{version}/scores.json
180
+ - .harness-evolver/harnesses/{version}/harness.py
181
+ - .harness-evolver/harnesses/{version}/proposal.md
182
+ - .harness-evolver/config.json
183
+ - .harness-evolver/langsmith_stats.json (if exists)
184
+ </files_to_read>
121
185
 
122
- > Perfect score reached in only {iterations} iteration(s). This usually indicates
123
- > the eval is too easy, not that the harness is perfect. Run `/harness-evolver:critic`
124
- > before continuing.
186
+ <output>
187
+ Write:
188
+ - .harness-evolver/critic_report.md
189
+ - .harness-evolver/eval/eval_improved.py (if weaknesses found)
190
+ </output>
191
+
192
+ <success_criteria>
193
+ - Identifies specific weaknesses in eval.py with task/output examples
194
+ - If gaming detected, shows exact tasks that expose the weakness
195
+ - Improved eval preserves the --results-dir/--tasks-dir/--scores interface
196
+ - Re-scores the best version with improved eval to show the difference
197
+ </success_criteria>
198
+ ```
199
+
200
+ Wait for `## CRITIC REPORT COMPLETE`.
201
+
202
+ If critic wrote `eval_improved.py`:
203
+ - Re-score the best harness with the improved eval
204
+ - Show the score difference (e.g., "Current eval: 1.0. Improved eval: 0.45")
205
+ - **AUTO-ADOPT the improved eval**: copy `eval_improved.py` to `eval/eval.py`
206
+ - Re-run baseline with new eval and update `summary.json`
207
+ - Print: "Eval upgraded. Resuming evolution with stricter eval."
208
+ - **Continue the loop** with the new eval
209
+
210
+ If critic did NOT write `eval_improved.py` (eval is fine):
211
+ - Print the critic's assessment
212
+ - Continue the loop normally
125
213
 
126
214
  ### 7. Auto-trigger Architect (on stagnation or regression)
127
215
 
128
- Check if the architect should be auto-spawned. This happens when:
216
+ Check if the architect should be auto-spawned:
129
217
  - **Stagnation**: 3 consecutive iterations within 1% of each other
130
218
  - **Regression**: score dropped below parent score (even once)
131
219
 
@@ -141,7 +229,7 @@ python3 $TOOLS/analyze_architecture.py \
141
229
  -o .harness-evolver/architecture_signals.json
142
230
  ```
143
231
 
144
- Then spawn the `harness-evolver-architect` agent:
232
+ Spawn the `harness-evolver-architect` agent:
145
233
 
146
234
  ```xml
147
235
  <objective>
@@ -156,6 +244,7 @@ Analyze the harness architecture and recommend a topology change.
156
244
  - .harness-evolver/config.json
157
245
  - .harness-evolver/harnesses/{best_version}/harness.py
158
246
  - .harness-evolver/harnesses/{best_version}/scores.json
247
+ - .harness-evolver/context7_docs.md (if exists)
159
248
  </files_to_read>
160
249
 
161
250
  <output>
@@ -173,25 +262,21 @@ Write:
173
262
 
174
263
  Wait for `## ARCHITECTURE ANALYSIS COMPLETE`.
175
264
 
176
- After the architect completes, report:
177
-
178
- > Architect recommends: {current} → {recommended} ({confidence} confidence)
179
- > Migration path: {N} steps. Continuing evolution with architecture guidance.
180
-
181
- Then **continue the loop** — the proposer will read `architecture.json` in the next iteration.
265
+ Report: `Architect recommends: {current} → {recommended} ({confidence} confidence)`
182
266
 
183
- If `architecture.json` already exists (architect already ran), skip — don't re-run.
267
+ Then **continue the loop** — the proposer reads `architecture.json` in the next iteration.
184
268
 
185
269
  ### 8. Check Stop Conditions
186
270
 
187
271
  - **Target**: `combined_score >= target_score` → stop
188
272
  - **N reached**: done
189
- - **Stagnation post-architect**: 3 more iterations without improvement AFTER architect ran → stop (architecture change didn't help)
273
+ - **Stagnation post-architect**: 3 more iterations without improvement AFTER architect ran → stop
190
274
 
191
275
  ## When Loop Ends — Final Report
192
276
 
193
277
  - Best version and score
194
278
  - Improvement over baseline (absolute and %)
195
279
  - Total iterations run
280
+ - Whether critic was triggered and eval was upgraded
196
281
  - Whether architect was triggered and what it recommended
197
282
  - Suggest: "The best harness is at `.harness-evolver/harnesses/{best}/harness.py`. Copy it to your project."