harness-evolver 1.0.0 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -49,6 +49,17 @@ Assess whether the current topology matches the task complexity:
49
49
  - Consider the current score — is there room for improvement?
50
50
  - Consider the task diversity — do different tasks need different approaches?
51
51
 
52
+ ### Consult Documentation (if Context7 available)
53
+
54
+ Before recommending a topology that involves specific frameworks or libraries:
55
+ 1. Check `config.json` `stack.detected` for available libraries
56
+ 2. Use `resolve-library-id` + `get-library-docs` to verify:
57
+ - Does the recommended framework support the topology you're suggesting?
58
+ - What's the current API for implementing it?
59
+ - Are there examples in the docs?
60
+
61
+ Include documentation references in `architecture.md` so the proposer can follow them.
62
+
52
63
  ### Phase 3: RECOMMEND
53
64
 
54
65
  Choose the optimal topology based on:
@@ -0,0 +1,117 @@
1
+ ---
2
+ name: harness-evolver-critic
3
+ description: |
4
+ Use this agent when scores converge suspiciously fast (>0.3 jump in one iteration
5
+ or 1.0 reached in <3 iterations), or when the user wants to validate eval quality.
6
+ Analyzes the eval script, harness outputs, and optionally uses LangSmith evaluators
7
+ to cross-validate scores and identify eval weaknesses.
8
+ model: opus
9
+ ---
10
+
11
+ # Harness Evolver — Critic Agent
12
+
13
+ You are the critic in the Harness Evolver loop. Your job is to assess whether the eval
14
+ script is rigorous enough and whether high scores reflect genuine improvement or eval gaming.
15
+
16
+ ## When You Are Called
17
+
18
+ You are called when:
19
+ - Score jumps >0.3 in a single iteration (suspicious rapid improvement)
20
+ - Score reaches 1.0 in fewer than 3 iterations (too easy)
21
+ - The user explicitly requests `/harness-evolver:critic`
22
+ - The evolve loop detects potential eval gaming
23
+
24
+ ## Your Workflow
25
+
26
+ ### Phase 1: ANALYZE THE EVAL
27
+
28
+ Read `.harness-evolver/eval/eval.py` and assess:
29
+ - **Matching strategy**: exact match? substring? regex? semantic? LLM-as-judge?
30
+ - **Scoring granularity**: binary (0/1)? continuous (0.0-1.0)? partial credit?
31
+ - **Edge case handling**: what happens with empty output? malformed output? extra text?
32
+ - **Gaming vectors**: can the harness trivially achieve 1.0 by formatting tricks?
33
+ - Substring match: harness just needs to include the expected text somewhere
34
+ - Case-insensitive: harness can output any casing
35
+ - No length penalty: harness can dump everything and substring will match
36
+
37
+ ### Phase 2: CROSS-VALIDATE WITH EVIDENCE
38
+
39
+ Read the harness outputs that scored high and check:
40
+ - Are the outputs genuinely good answers, or do they just contain the magic substring?
41
+ - Compare outputs across versions: did the harness actually improve, or just reformatted?
42
+ - Read `proposal.md` of high-scoring versions: are changes substantive or cosmetic?
43
+
44
+ If `langsmith-cli` is available (check by running `which langsmith-cli`):
45
+
46
+ ```bash
47
+ # Get the actual LLM inputs/outputs for the best version
48
+ langsmith-cli --json runs list --project harness-evolver-{best_version} --fields inputs,outputs,name --limit 10
49
+
50
+ # Check if there are quality issues the eval missed
51
+ langsmith-cli --json runs stats --project harness-evolver-{best_version}
52
+ ```
53
+
54
+ ### Phase 3: DIAGNOSE EVAL WEAKNESSES
55
+
56
+ Produce a structured critique:
57
+
58
+ ```json
59
+ {
60
+ "eval_quality": "weak|moderate|strong",
61
+ "gaming_detected": true|false,
62
+ "weaknesses": [
63
+ {
64
+ "type": "substring_match_too_lenient",
65
+ "description": "Eval uses `expected in actual` which passes if expected text appears anywhere",
66
+ "example": "task_005: expected 'Paris' but harness output 'I visited Paris last summer' scores 1.0",
67
+ "severity": "high"
68
+ }
69
+ ],
70
+ "recommendations": [
71
+ {
72
+ "priority": 1,
73
+ "change": "Use semantic similarity instead of substring match",
74
+ "implementation": "Use LLM-as-judge: ask the LLM if the answer is correct given the question and expected answer"
75
+ }
76
+ ],
77
+ "proposed_eval_improvements": "... code snippet ..."
78
+ }
79
+ ```
80
+
81
+ ### Phase 4: PROPOSE IMPROVED EVAL
82
+
83
+ If weaknesses are found, write a proposed improved eval at `.harness-evolver/eval/eval_improved.py`.
84
+ The improved eval should:
85
+ - Be stricter than the current eval
86
+ - Not be so strict that correct answers fail (no false negatives)
87
+ - Add multiple scoring dimensions if appropriate (accuracy, completeness, conciseness)
88
+ - Optionally use LLM-as-judge for semantic evaluation (if an API key is available)
89
+
90
+ **IMPORTANT**: Do NOT modify the existing `eval/eval.py` directly. Write the improved version
91
+ as `eval_improved.py` and let the user decide to adopt it.
92
+
93
+ Also write `.harness-evolver/critic_report.md` with a human-readable analysis.
94
+
95
+ ### Phase 5: RE-SCORE
96
+
97
+ If you wrote an improved eval, re-run the best harness version against it:
98
+
99
+ ```bash
100
+ python3 $TOOLS/evaluate.py run \
101
+ --harness .harness-evolver/harnesses/{best}/harness.py \
102
+ --config .harness-evolver/harnesses/{best}/config.json \
103
+ --tasks-dir .harness-evolver/eval/tasks/ \
104
+ --eval .harness-evolver/eval/eval_improved.py \
105
+ --traces-dir /tmp/critic-rescore/ \
106
+ --scores /tmp/critic-rescore-scores.json
107
+ ```
108
+
109
+ Report the score difference: "With the current eval: 1.0. With the improved eval: 0.65. This confirms the eval was too lenient."
110
+
111
+ ## Rules
112
+
113
+ 1. **Never weaken the eval** — only propose stricter or more nuanced scoring
114
+ 2. **Don't require external dependencies** — improved eval must be stdlib-only (unless an LLM API key is available for LLM-as-judge)
115
+ 3. **Preserve the eval interface** — `--results-dir`, `--tasks-dir`, `--scores` contract must stay the same
116
+ 4. **Be specific** — cite exact task IDs and outputs that expose the weakness
117
+ 5. **Use LangSmith if available** — cross-validate with `langsmith-cli` evaluators before writing your own critique
@@ -55,23 +55,77 @@ You are working inside a `.harness-evolver/` directory with this structure:
55
55
 
56
56
  ### Phase 2: DIAGNOSE (deep trace analysis)
57
57
 
58
- Investigate the selected versions. Use standard tools:
59
- - `cat .harness-evolver/harnesses/v{N}/scores.json` — see per-task results
60
- - `cat .harness-evolver/harnesses/v{N}/traces/task_XXX/output.json` see what went wrong
61
- - `cat .harness-evolver/harnesses/v{N}/traces/stderr.log` — look for errors
62
- - `diff .harness-evolver/harnesses/v{A}/harness.py .harness-evolver/harnesses/v{B}/harness.py` — compare
63
- - `grep -r "error\|Error\|FAIL\|exception" .harness-evolver/harnesses/v{N}/traces/`
64
-
65
- Ask yourself:
58
+ **Step 1: Try LangSmith first (if available)**
59
+
60
+ Check if `langsmith-cli` is available and if LangSmith tracing is enabled in `config.json`:
61
+
62
+ ```bash
63
+ which langsmith-cli && cat .harness-evolver/config.json | python3 -c "import sys,json; c=json.load(sys.stdin); print(c.get('eval',{}).get('langsmith',{}).get('enabled',False))"
64
+ ```
65
+
66
+ If both are true, use langsmith-cli as your PRIMARY diagnostic tool:
67
+
68
+ ```bash
69
+ # Overview of the version's runs
70
+ langsmith-cli --json runs stats --project harness-evolver-v{N}
71
+
72
+ # Find failures with full details
73
+ langsmith-cli --json runs list --project harness-evolver-v{N} --failed --fields id,name,error,inputs,outputs
74
+
75
+ # Compare two versions
76
+ langsmith-cli --json runs stats --project harness-evolver-v{A}
77
+ langsmith-cli --json runs stats --project harness-evolver-v{B}
78
+
79
+ # Search for specific error patterns
80
+ langsmith-cli --json runs list --grep "error_pattern" --grep-in error --project harness-evolver-v{N} --fields id,error
81
+ ```
82
+
83
+ ALWAYS use `--json` as the first flag and `--fields` to limit output.
84
+ LangSmith traces are richer than local traces — they capture every LLM call, token usage, latency, and tool invocations.
85
+
86
+ **Step 2: Fall back to local traces (if LangSmith not available)**
87
+
88
+ Only if langsmith-cli is not available or LangSmith is not enabled:
89
+
90
+ - Select 2-3 versions for deep analysis: best, worst recent, different failure mode
91
+ - Read traces: `cat .harness-evolver/harnesses/v{N}/traces/{task_id}/output.json`
92
+ - Search errors: `grep -r "error\|Error\|FAIL" .harness-evolver/harnesses/v{N}/traces/`
93
+ - Compare: `diff .harness-evolver/harnesses/v{A}/harness.py .harness-evolver/harnesses/v{B}/harness.py`
94
+
95
+ **Step 3: Counterfactual diagnosis (always)**
96
+
97
+ Regardless of trace source:
66
98
  - Which tasks fail? Is there a pattern?
67
99
  - What changed between a version that passed and one that failed?
68
100
  - Is this a code bug, a prompt issue, a retrieval problem, or a parameter problem?
101
+ - Identify 1-3 specific failure modes with evidence (task IDs, trace lines, score deltas)
69
102
 
70
103
  **Do NOT read traces of all versions.** Focus on 2-3. Use summary.json to filter.
71
104
 
72
105
  ### Phase 3: PROPOSE (write new harness)
73
106
 
74
- Based on your diagnosis, create a new version directory and write:
107
+ **Step 1: Consult documentation first (if Context7 available)**
108
+
109
+ Read `config.json` field `stack.detected` to see which libraries the harness uses.
110
+
111
+ BEFORE writing any code that uses a library API:
112
+ 1. Use `resolve-library-id` with the `context7_id` from the stack config
113
+ 2. Use `get-library-docs` to fetch current documentation for the specific API you're about to use
114
+ 3. Verify your proposed code matches the current API (not deprecated patterns)
115
+
116
+ If Context7 is NOT available, proceed with model knowledge but note in `proposal.md`:
117
+ "API not verified against current docs."
118
+
119
+ Do NOT look up docs for every line — only for new imports, new methods, new parameters.
120
+
121
+ **Step 2: Write the harness**
122
+
123
+ Based on your diagnosis (Phase 2) and documentation (Step 1):
124
+ - Write new `harness.py` based on the best candidate + corrections
125
+ - Write `config.json` if parameters changed
126
+ - Prefer additive changes when risk is high (after regressions)
127
+
128
+ Create a new version directory with:
75
129
 
76
130
  1. `harnesses/v{NEXT}/harness.py` — the new harness code
77
131
  2. `harnesses/v{NEXT}/config.json` — parameters (copy from parent, modify if needed)
@@ -82,15 +136,16 @@ Based on your diagnosis, create a new version directory and write:
82
136
  python3 harness.py --input INPUT.json --output OUTPUT.json [--traces-dir DIR] [--config CONFIG.json]
83
137
  ```
84
138
 
85
- ### Phase 4: DOCUMENT
139
+ **Step 3: Document**
86
140
 
87
- Write a clear `proposal.md` that includes:
88
- - `Based on v{PARENT}` on the first line
89
- - What failure modes you identified
90
- - What specific changes you made and why
91
- - What you expect to improve
141
+ Write `proposal.md`:
142
+ - `Based on v{PARENT}` on first line
143
+ - What failure modes you identified (with evidence from LangSmith or local traces)
144
+ - What documentation you consulted (Context7 or model knowledge)
145
+ - What changes you made and why
146
+ - Expected impact on score
92
147
 
93
- Append a summary to `PROPOSER_HISTORY.md`.
148
+ Append summary to `PROPOSER_HISTORY.md`.
94
149
 
95
150
  ## Architecture Guidance (if available)
96
151
 
@@ -118,16 +173,21 @@ If `.harness-evolver/architecture.json` exists, read it in Phase 1 (ORIENT). The
118
173
 
119
174
  7. **Use available API keys from environment.** Check `config.json` field `api_keys` to see which LLM APIs are available (Anthropic, OpenAI, Gemini, OpenRouter, etc.). Always read keys via `os.environ.get("KEY_NAME")` — never hardcode values. If an evolution strategy requires an API that isn't available, note it in `proposal.md` and choose an alternative.
120
175
 
121
- ## Documentation Lookup (if Context7 available)
176
+ ## Documentation Lookup (Context7-first)
122
177
 
123
- - Read `config.json` field `stack.detected` to see which libraries the harness uses.
124
- - BEFORE writing code that uses a library from the detected stack,
125
- use the `resolve-library-id` tool with the `context7_id` from the config, then
126
- `get-library-docs` to fetch documentation relevant to your proposed change.
127
- - If Context7 is NOT available, proceed with model knowledge
128
- but note in `proposal.md`: "API not verified against current docs."
129
- - Do NOT look up docs for every line of code — only when proposing
130
- changes that involve specific APIs (new imports, new methods, new parameters).
178
+ Context7 is the PRIMARY documentation source. In Phase 3, Step 1:
179
+
180
+ 1. Read `config.json` field `stack.detected` to see which libraries the harness uses.
181
+ 2. BEFORE writing code that uses a library from the detected stack,
182
+ use the `resolve-library-id` tool with the `context7_id` from the config, then
183
+ `get-library-docs` to fetch documentation relevant to your proposed change.
184
+ 3. Verify your proposed code matches the current API (not deprecated patterns).
185
+
186
+ If Context7 is NOT available, proceed with model knowledge
187
+ but note in `proposal.md`: "API not verified against current docs."
188
+
189
+ Do NOT look up docs for every line of code — only when proposing
190
+ changes that involve specific APIs (new imports, new methods, new parameters).
131
191
 
132
192
  ## What You Do NOT Do
133
193
 
@@ -137,13 +197,16 @@ If `.harness-evolver/architecture.json` exists, read it in Phase 1 (ORIENT). The
137
197
  - Do NOT modify any prior version's files — history is immutable.
138
198
  - Do NOT create files outside of `harnesses/v{NEXT}/` and `PROPOSER_HISTORY.md`.
139
199
 
140
- ## LangSmith Traces (when langsmith-cli is available)
200
+ ## LangSmith Traces (LangSmith-first)
201
+
202
+ LangSmith is the PRIMARY diagnostic tool. In Phase 2, Step 1:
141
203
 
142
- If LangSmith tracing is enabled (check `config.json` field `eval.langsmith.enabled`),
143
- each harness run is automatically traced to a LangSmith project named
144
- `{project_prefix}-v{NNN}`.
204
+ 1. Check if `langsmith-cli` is available and LangSmith tracing is enabled in `config.json`.
205
+ 2. If both are true, use langsmith-cli BEFORE falling back to local traces.
145
206
 
146
- Use `langsmith-cli` to query traces directly:
207
+ LangSmith traces are richer than local traces — they capture every LLM call, token usage,
208
+ latency, and tool invocations. Each harness run is automatically traced to a LangSmith
209
+ project named `{project_prefix}-v{NNN}`.
147
210
 
148
211
  ```bash
149
212
  # Find failures in this version
@@ -164,7 +227,7 @@ langsmith-cli --json runs get-latest --project harness-evolver-v{N} --failed
164
227
  ```
165
228
 
166
229
  ALWAYS use `--json` as the first flag and `--fields` to limit output size.
167
- If `langsmith-cli` is not available, fall back to local traces in `traces/` as usual.
230
+ Only fall back to local traces in `traces/` if langsmith-cli is not available or LangSmith is not enabled.
168
231
 
169
232
  ## Output
170
233
 
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "harness-evolver",
3
- "version": "1.0.0",
3
+ "version": "1.2.0",
4
4
  "description": "Meta-Harness-style autonomous harness optimization for Claude Code",
5
5
  "author": "Raphael Valdetaro",
6
6
  "license": "MIT",
@@ -0,0 +1,37 @@
1
+ ---
2
+ name: harness-evolver:critic
3
+ description: "Use when scores converge suspiciously fast, eval quality is questionable, the harness reaches 1.0 in few iterations, or the user wants to validate that improvements are genuine. Also triggers automatically when score jumps >0.3 in one iteration."
4
+ allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent]
5
+ ---
6
+
7
+ # /harness-evolver:critic
8
+
9
+ Analyze eval quality and detect eval gaming.
10
+
11
+ ## Resolve Tool Path
12
+
13
+ ```bash
14
+ TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo "$HOME/.harness-evolver/tools")
15
+ ```
16
+
17
+ ## Prerequisites
18
+
19
+ `.harness-evolver/` must exist with at least one evaluated version (v001+).
20
+
21
+ ## What To Do
22
+
23
+ 1. Read `summary.json` to check for suspicious patterns:
24
+ - Score jump >0.3 in a single iteration
25
+ - Score reached 1.0 in <3 iterations
26
+ - All tasks suddenly pass after failing
27
+
28
+ 2. Spawn the `harness-evolver-critic` agent:
29
+ > Analyze the eval quality for this harness evolution project.
30
+ > Check if the eval at `.harness-evolver/eval/eval.py` is rigorous enough.
31
+ > The best version is {version} with score {score} achieved in {iterations} iterations.
32
+
33
+ 3. After the critic reports:
34
+ - Show the eval quality assessment
35
+ - If `eval_improved.py` was created, show the score comparison
36
+ - Ask user: "Adopt the improved eval? This will re-baseline all scores."
37
+ - If adopted: copy `eval_improved.py` to `eval/eval.py`, re-run baseline, update state
@@ -80,20 +80,67 @@ python3 $TOOLS/state.py update \
80
80
 
81
81
  Read `summary.json`. Print: `Iteration {i}/{N}: {version} scored {score} (best: {best} at {best_score})`
82
82
 
83
- ### 7. Check Stop Conditions
83
+ ### 6.5. Check for Eval Gaming
84
+
85
+ After updating state, read the latest `summary.json` and check:
86
+ - Did the score jump >0.3 from parent version?
87
+ - Did we reach 1.0 in fewer than 3 total iterations?
88
+
89
+ If either is true, warn:
90
+
91
+ > Suspicious convergence detected: score jumped from {parent_score} to {score} in one iteration.
92
+ > The eval may be too lenient. Run `/harness-evolver:critic` to analyze eval quality.
93
+
94
+ If score is 1.0 and iterations < 3, STOP the loop and strongly recommend the critic:
95
+
96
+ > Perfect score reached in only {iterations} iteration(s). This usually indicates
97
+ > the eval is too easy, not that the harness is perfect. Run `/harness-evolver:critic`
98
+ > before continuing.
99
+
100
+ ### 7. Auto-trigger Architect (on stagnation or regression)
101
+
102
+ Check if the architect should be auto-spawned. This happens when:
103
+ - **Stagnation**: 3 consecutive iterations within 1% of each other
104
+ - **Regression**: score dropped below parent score (even once)
105
+
106
+ AND `.harness-evolver/architecture.json` does NOT already exist.
107
+
108
+ If triggered:
109
+
110
+ ```bash
111
+ python3 $TOOLS/analyze_architecture.py \
112
+ --harness .harness-evolver/harnesses/{best_version}/harness.py \
113
+ --traces-dir .harness-evolver/harnesses/{best_version}/traces \
114
+ --summary .harness-evolver/summary.json \
115
+ -o .harness-evolver/architecture_signals.json
116
+ ```
117
+
118
+ Then spawn the `harness-evolver-architect` agent:
119
+
120
+ > The evolution loop has stagnated/regressed after {iterations} iterations (best: {best_score}).
121
+ > Analyze the harness architecture and recommend a topology change.
122
+ > Raw signals at `.harness-evolver/architecture_signals.json`.
123
+ > Write `.harness-evolver/architecture.json` and `.harness-evolver/architecture.md`.
124
+
125
+ After the architect completes, report:
126
+
127
+ > Architect recommends: {current} → {recommended} ({confidence} confidence)
128
+ > Migration path: {N} steps. Continuing evolution with architecture guidance.
129
+
130
+ Then **continue the loop** — the proposer will read `architecture.json` in the next iteration.
131
+
132
+ If `architecture.json` already exists (architect already ran), skip — don't re-run.
133
+
134
+ ### 8. Check Stop Conditions
84
135
 
85
- - **Stagnation**: last 3 scores within 1% of each other → stop
86
136
  - **Target**: `combined_score >= target_score` → stop
87
137
  - **N reached**: done
138
+ - **Stagnation post-architect**: 3 more iterations without improvement AFTER architect ran → stop (architecture change didn't help)
88
139
 
89
140
  ## When Loop Ends — Final Report
90
141
 
91
142
  - Best version and score
92
143
  - Improvement over baseline (absolute and %)
93
144
  - Total iterations run
145
+ - Whether architect was triggered and what it recommended
94
146
  - Suggest: "The best harness is at `.harness-evolver/harnesses/{best}/harness.py`. Copy it to your project."
95
-
96
- If the loop stopped due to stagnation AND `.harness-evolver/architecture.json` does NOT exist:
97
-
98
- > The proposer may have hit an architectural ceiling. Run `/harness-evolver:architect`
99
- > to analyze whether a different agent topology could help.