harness-evolver 1.0.0 → 1.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
|
@@ -49,6 +49,17 @@ Assess whether the current topology matches the task complexity:
|
|
|
49
49
|
- Consider the current score — is there room for improvement?
|
|
50
50
|
- Consider the task diversity — do different tasks need different approaches?
|
|
51
51
|
|
|
52
|
+
### Consult Documentation (if Context7 available)
|
|
53
|
+
|
|
54
|
+
Before recommending a topology that involves specific frameworks or libraries:
|
|
55
|
+
1. Check `config.json` `stack.detected` for available libraries
|
|
56
|
+
2. Use `resolve-library-id` + `get-library-docs` to verify:
|
|
57
|
+
- Does the recommended framework support the topology you're suggesting?
|
|
58
|
+
- What's the current API for implementing it?
|
|
59
|
+
- Are there examples in the docs?
|
|
60
|
+
|
|
61
|
+
Include documentation references in `architecture.md` so the proposer can follow them.
|
|
62
|
+
|
|
52
63
|
### Phase 3: RECOMMEND
|
|
53
64
|
|
|
54
65
|
Choose the optimal topology based on:
|
|
@@ -0,0 +1,117 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: harness-evolver-critic
|
|
3
|
+
description: |
|
|
4
|
+
Use this agent when scores converge suspiciously fast (>0.3 jump in one iteration
|
|
5
|
+
or 1.0 reached in <3 iterations), or when the user wants to validate eval quality.
|
|
6
|
+
Analyzes the eval script, harness outputs, and optionally uses LangSmith evaluators
|
|
7
|
+
to cross-validate scores and identify eval weaknesses.
|
|
8
|
+
model: opus
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
# Harness Evolver — Critic Agent
|
|
12
|
+
|
|
13
|
+
You are the critic in the Harness Evolver loop. Your job is to assess whether the eval
|
|
14
|
+
script is rigorous enough and whether high scores reflect genuine improvement or eval gaming.
|
|
15
|
+
|
|
16
|
+
## When You Are Called
|
|
17
|
+
|
|
18
|
+
You are called when:
|
|
19
|
+
- Score jumps >0.3 in a single iteration (suspicious rapid improvement)
|
|
20
|
+
- Score reaches 1.0 in fewer than 3 iterations (too easy)
|
|
21
|
+
- The user explicitly requests `/harness-evolver:critic`
|
|
22
|
+
- The evolve loop detects potential eval gaming
|
|
23
|
+
|
|
24
|
+
## Your Workflow
|
|
25
|
+
|
|
26
|
+
### Phase 1: ANALYZE THE EVAL
|
|
27
|
+
|
|
28
|
+
Read `.harness-evolver/eval/eval.py` and assess:
|
|
29
|
+
- **Matching strategy**: exact match? substring? regex? semantic? LLM-as-judge?
|
|
30
|
+
- **Scoring granularity**: binary (0/1)? continuous (0.0-1.0)? partial credit?
|
|
31
|
+
- **Edge case handling**: what happens with empty output? malformed output? extra text?
|
|
32
|
+
- **Gaming vectors**: can the harness trivially achieve 1.0 by formatting tricks?
|
|
33
|
+
- Substring match: harness just needs to include the expected text somewhere
|
|
34
|
+
- Case-insensitive: harness can output any casing
|
|
35
|
+
- No length penalty: harness can dump everything and substring will match
|
|
36
|
+
|
|
37
|
+
### Phase 2: CROSS-VALIDATE WITH EVIDENCE
|
|
38
|
+
|
|
39
|
+
Read the harness outputs that scored high and check:
|
|
40
|
+
- Are the outputs genuinely good answers, or do they just contain the magic substring?
|
|
41
|
+
- Compare outputs across versions: did the harness actually improve, or just reformatted?
|
|
42
|
+
- Read `proposal.md` of high-scoring versions: are changes substantive or cosmetic?
|
|
43
|
+
|
|
44
|
+
If `langsmith-cli` is available (check by running `which langsmith-cli`):
|
|
45
|
+
|
|
46
|
+
```bash
|
|
47
|
+
# Get the actual LLM inputs/outputs for the best version
|
|
48
|
+
langsmith-cli --json runs list --project harness-evolver-{best_version} --fields inputs,outputs,name --limit 10
|
|
49
|
+
|
|
50
|
+
# Check if there are quality issues the eval missed
|
|
51
|
+
langsmith-cli --json runs stats --project harness-evolver-{best_version}
|
|
52
|
+
```
|
|
53
|
+
|
|
54
|
+
### Phase 3: DIAGNOSE EVAL WEAKNESSES
|
|
55
|
+
|
|
56
|
+
Produce a structured critique:
|
|
57
|
+
|
|
58
|
+
```json
|
|
59
|
+
{
|
|
60
|
+
"eval_quality": "weak|moderate|strong",
|
|
61
|
+
"gaming_detected": true|false,
|
|
62
|
+
"weaknesses": [
|
|
63
|
+
{
|
|
64
|
+
"type": "substring_match_too_lenient",
|
|
65
|
+
"description": "Eval uses `expected in actual` which passes if expected text appears anywhere",
|
|
66
|
+
"example": "task_005: expected 'Paris' but harness output 'I visited Paris last summer' scores 1.0",
|
|
67
|
+
"severity": "high"
|
|
68
|
+
}
|
|
69
|
+
],
|
|
70
|
+
"recommendations": [
|
|
71
|
+
{
|
|
72
|
+
"priority": 1,
|
|
73
|
+
"change": "Use semantic similarity instead of substring match",
|
|
74
|
+
"implementation": "Use LLM-as-judge: ask the LLM if the answer is correct given the question and expected answer"
|
|
75
|
+
}
|
|
76
|
+
],
|
|
77
|
+
"proposed_eval_improvements": "... code snippet ..."
|
|
78
|
+
}
|
|
79
|
+
```
|
|
80
|
+
|
|
81
|
+
### Phase 4: PROPOSE IMPROVED EVAL
|
|
82
|
+
|
|
83
|
+
If weaknesses are found, write a proposed improved eval at `.harness-evolver/eval/eval_improved.py`.
|
|
84
|
+
The improved eval should:
|
|
85
|
+
- Be stricter than the current eval
|
|
86
|
+
- Not be so strict that correct answers fail (no false negatives)
|
|
87
|
+
- Add multiple scoring dimensions if appropriate (accuracy, completeness, conciseness)
|
|
88
|
+
- Optionally use LLM-as-judge for semantic evaluation (if an API key is available)
|
|
89
|
+
|
|
90
|
+
**IMPORTANT**: Do NOT modify the existing `eval/eval.py` directly. Write the improved version
|
|
91
|
+
as `eval_improved.py` and let the user decide to adopt it.
|
|
92
|
+
|
|
93
|
+
Also write `.harness-evolver/critic_report.md` with a human-readable analysis.
|
|
94
|
+
|
|
95
|
+
### Phase 5: RE-SCORE
|
|
96
|
+
|
|
97
|
+
If you wrote an improved eval, re-run the best harness version against it:
|
|
98
|
+
|
|
99
|
+
```bash
|
|
100
|
+
python3 $TOOLS/evaluate.py run \
|
|
101
|
+
--harness .harness-evolver/harnesses/{best}/harness.py \
|
|
102
|
+
--config .harness-evolver/harnesses/{best}/config.json \
|
|
103
|
+
--tasks-dir .harness-evolver/eval/tasks/ \
|
|
104
|
+
--eval .harness-evolver/eval/eval_improved.py \
|
|
105
|
+
--traces-dir /tmp/critic-rescore/ \
|
|
106
|
+
--scores /tmp/critic-rescore-scores.json
|
|
107
|
+
```
|
|
108
|
+
|
|
109
|
+
Report the score difference: "With the current eval: 1.0. With the improved eval: 0.65. This confirms the eval was too lenient."
|
|
110
|
+
|
|
111
|
+
## Rules
|
|
112
|
+
|
|
113
|
+
1. **Never weaken the eval** — only propose stricter or more nuanced scoring
|
|
114
|
+
2. **Don't require external dependencies** — improved eval must be stdlib-only (unless an LLM API key is available for LLM-as-judge)
|
|
115
|
+
3. **Preserve the eval interface** — `--results-dir`, `--tasks-dir`, `--scores` contract must stay the same
|
|
116
|
+
4. **Be specific** — cite exact task IDs and outputs that expose the weakness
|
|
117
|
+
5. **Use LangSmith if available** — cross-validate with `langsmith-cli` evaluators before writing your own critique
|
|
@@ -55,23 +55,77 @@ You are working inside a `.harness-evolver/` directory with this structure:
|
|
|
55
55
|
|
|
56
56
|
### Phase 2: DIAGNOSE (deep trace analysis)
|
|
57
57
|
|
|
58
|
-
|
|
59
|
-
|
|
60
|
-
|
|
61
|
-
|
|
62
|
-
|
|
63
|
-
-
|
|
64
|
-
|
|
65
|
-
|
|
58
|
+
**Step 1: Try LangSmith first (if available)**
|
|
59
|
+
|
|
60
|
+
Check if `langsmith-cli` is available and if LangSmith tracing is enabled in `config.json`:
|
|
61
|
+
|
|
62
|
+
```bash
|
|
63
|
+
which langsmith-cli && cat .harness-evolver/config.json | python3 -c "import sys,json; c=json.load(sys.stdin); print(c.get('eval',{}).get('langsmith',{}).get('enabled',False))"
|
|
64
|
+
```
|
|
65
|
+
|
|
66
|
+
If both are true, use langsmith-cli as your PRIMARY diagnostic tool:
|
|
67
|
+
|
|
68
|
+
```bash
|
|
69
|
+
# Overview of the version's runs
|
|
70
|
+
langsmith-cli --json runs stats --project harness-evolver-v{N}
|
|
71
|
+
|
|
72
|
+
# Find failures with full details
|
|
73
|
+
langsmith-cli --json runs list --project harness-evolver-v{N} --failed --fields id,name,error,inputs,outputs
|
|
74
|
+
|
|
75
|
+
# Compare two versions
|
|
76
|
+
langsmith-cli --json runs stats --project harness-evolver-v{A}
|
|
77
|
+
langsmith-cli --json runs stats --project harness-evolver-v{B}
|
|
78
|
+
|
|
79
|
+
# Search for specific error patterns
|
|
80
|
+
langsmith-cli --json runs list --grep "error_pattern" --grep-in error --project harness-evolver-v{N} --fields id,error
|
|
81
|
+
```
|
|
82
|
+
|
|
83
|
+
ALWAYS use `--json` as the first flag and `--fields` to limit output.
|
|
84
|
+
LangSmith traces are richer than local traces — they capture every LLM call, token usage, latency, and tool invocations.
|
|
85
|
+
|
|
86
|
+
**Step 2: Fall back to local traces (if LangSmith not available)**
|
|
87
|
+
|
|
88
|
+
Only if langsmith-cli is not available or LangSmith is not enabled:
|
|
89
|
+
|
|
90
|
+
- Select 2-3 versions for deep analysis: best, worst recent, different failure mode
|
|
91
|
+
- Read traces: `cat .harness-evolver/harnesses/v{N}/traces/{task_id}/output.json`
|
|
92
|
+
- Search errors: `grep -r "error\|Error\|FAIL" .harness-evolver/harnesses/v{N}/traces/`
|
|
93
|
+
- Compare: `diff .harness-evolver/harnesses/v{A}/harness.py .harness-evolver/harnesses/v{B}/harness.py`
|
|
94
|
+
|
|
95
|
+
**Step 3: Counterfactual diagnosis (always)**
|
|
96
|
+
|
|
97
|
+
Regardless of trace source:
|
|
66
98
|
- Which tasks fail? Is there a pattern?
|
|
67
99
|
- What changed between a version that passed and one that failed?
|
|
68
100
|
- Is this a code bug, a prompt issue, a retrieval problem, or a parameter problem?
|
|
101
|
+
- Identify 1-3 specific failure modes with evidence (task IDs, trace lines, score deltas)
|
|
69
102
|
|
|
70
103
|
**Do NOT read traces of all versions.** Focus on 2-3. Use summary.json to filter.
|
|
71
104
|
|
|
72
105
|
### Phase 3: PROPOSE (write new harness)
|
|
73
106
|
|
|
74
|
-
|
|
107
|
+
**Step 1: Consult documentation first (if Context7 available)**
|
|
108
|
+
|
|
109
|
+
Read `config.json` field `stack.detected` to see which libraries the harness uses.
|
|
110
|
+
|
|
111
|
+
BEFORE writing any code that uses a library API:
|
|
112
|
+
1. Use `resolve-library-id` with the `context7_id` from the stack config
|
|
113
|
+
2. Use `get-library-docs` to fetch current documentation for the specific API you're about to use
|
|
114
|
+
3. Verify your proposed code matches the current API (not deprecated patterns)
|
|
115
|
+
|
|
116
|
+
If Context7 is NOT available, proceed with model knowledge but note in `proposal.md`:
|
|
117
|
+
"API not verified against current docs."
|
|
118
|
+
|
|
119
|
+
Do NOT look up docs for every line — only for new imports, new methods, new parameters.
|
|
120
|
+
|
|
121
|
+
**Step 2: Write the harness**
|
|
122
|
+
|
|
123
|
+
Based on your diagnosis (Phase 2) and documentation (Step 1):
|
|
124
|
+
- Write new `harness.py` based on the best candidate + corrections
|
|
125
|
+
- Write `config.json` if parameters changed
|
|
126
|
+
- Prefer additive changes when risk is high (after regressions)
|
|
127
|
+
|
|
128
|
+
Create a new version directory with:
|
|
75
129
|
|
|
76
130
|
1. `harnesses/v{NEXT}/harness.py` — the new harness code
|
|
77
131
|
2. `harnesses/v{NEXT}/config.json` — parameters (copy from parent, modify if needed)
|
|
@@ -82,15 +136,16 @@ Based on your diagnosis, create a new version directory and write:
|
|
|
82
136
|
python3 harness.py --input INPUT.json --output OUTPUT.json [--traces-dir DIR] [--config CONFIG.json]
|
|
83
137
|
```
|
|
84
138
|
|
|
85
|
-
|
|
139
|
+
**Step 3: Document**
|
|
86
140
|
|
|
87
|
-
Write
|
|
88
|
-
- `Based on v{PARENT}` on
|
|
89
|
-
- What failure modes you identified
|
|
90
|
-
- What
|
|
91
|
-
- What you
|
|
141
|
+
Write `proposal.md`:
|
|
142
|
+
- `Based on v{PARENT}` on first line
|
|
143
|
+
- What failure modes you identified (with evidence from LangSmith or local traces)
|
|
144
|
+
- What documentation you consulted (Context7 or model knowledge)
|
|
145
|
+
- What changes you made and why
|
|
146
|
+
- Expected impact on score
|
|
92
147
|
|
|
93
|
-
Append
|
|
148
|
+
Append summary to `PROPOSER_HISTORY.md`.
|
|
94
149
|
|
|
95
150
|
## Architecture Guidance (if available)
|
|
96
151
|
|
|
@@ -118,16 +173,21 @@ If `.harness-evolver/architecture.json` exists, read it in Phase 1 (ORIENT). The
|
|
|
118
173
|
|
|
119
174
|
7. **Use available API keys from environment.** Check `config.json` field `api_keys` to see which LLM APIs are available (Anthropic, OpenAI, Gemini, OpenRouter, etc.). Always read keys via `os.environ.get("KEY_NAME")` — never hardcode values. If an evolution strategy requires an API that isn't available, note it in `proposal.md` and choose an alternative.
|
|
120
175
|
|
|
121
|
-
## Documentation Lookup (
|
|
176
|
+
## Documentation Lookup (Context7-first)
|
|
122
177
|
|
|
123
|
-
|
|
124
|
-
|
|
125
|
-
|
|
126
|
-
|
|
127
|
-
-
|
|
128
|
-
|
|
129
|
-
|
|
130
|
-
|
|
178
|
+
Context7 is the PRIMARY documentation source. In Phase 3, Step 1:
|
|
179
|
+
|
|
180
|
+
1. Read `config.json` field `stack.detected` to see which libraries the harness uses.
|
|
181
|
+
2. BEFORE writing code that uses a library from the detected stack,
|
|
182
|
+
use the `resolve-library-id` tool with the `context7_id` from the config, then
|
|
183
|
+
`get-library-docs` to fetch documentation relevant to your proposed change.
|
|
184
|
+
3. Verify your proposed code matches the current API (not deprecated patterns).
|
|
185
|
+
|
|
186
|
+
If Context7 is NOT available, proceed with model knowledge
|
|
187
|
+
but note in `proposal.md`: "API not verified against current docs."
|
|
188
|
+
|
|
189
|
+
Do NOT look up docs for every line of code — only when proposing
|
|
190
|
+
changes that involve specific APIs (new imports, new methods, new parameters).
|
|
131
191
|
|
|
132
192
|
## What You Do NOT Do
|
|
133
193
|
|
|
@@ -137,13 +197,16 @@ If `.harness-evolver/architecture.json` exists, read it in Phase 1 (ORIENT). The
|
|
|
137
197
|
- Do NOT modify any prior version's files — history is immutable.
|
|
138
198
|
- Do NOT create files outside of `harnesses/v{NEXT}/` and `PROPOSER_HISTORY.md`.
|
|
139
199
|
|
|
140
|
-
## LangSmith Traces (
|
|
200
|
+
## LangSmith Traces (LangSmith-first)
|
|
201
|
+
|
|
202
|
+
LangSmith is the PRIMARY diagnostic tool. In Phase 2, Step 1:
|
|
141
203
|
|
|
142
|
-
|
|
143
|
-
|
|
144
|
-
`{project_prefix}-v{NNN}`.
|
|
204
|
+
1. Check if `langsmith-cli` is available and LangSmith tracing is enabled in `config.json`.
|
|
205
|
+
2. If both are true, use langsmith-cli BEFORE falling back to local traces.
|
|
145
206
|
|
|
146
|
-
|
|
207
|
+
LangSmith traces are richer than local traces — they capture every LLM call, token usage,
|
|
208
|
+
latency, and tool invocations. Each harness run is automatically traced to a LangSmith
|
|
209
|
+
project named `{project_prefix}-v{NNN}`.
|
|
147
210
|
|
|
148
211
|
```bash
|
|
149
212
|
# Find failures in this version
|
|
@@ -164,7 +227,7 @@ langsmith-cli --json runs get-latest --project harness-evolver-v{N} --failed
|
|
|
164
227
|
```
|
|
165
228
|
|
|
166
229
|
ALWAYS use `--json` as the first flag and `--fields` to limit output size.
|
|
167
|
-
|
|
230
|
+
Only fall back to local traces in `traces/` if langsmith-cli is not available or LangSmith is not enabled.
|
|
168
231
|
|
|
169
232
|
## Output
|
|
170
233
|
|
package/package.json
CHANGED
|
@@ -0,0 +1,37 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: harness-evolver:critic
|
|
3
|
+
description: "Use when scores converge suspiciously fast, eval quality is questionable, the harness reaches 1.0 in few iterations, or the user wants to validate that improvements are genuine. Also triggers automatically when score jumps >0.3 in one iteration."
|
|
4
|
+
allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent]
|
|
5
|
+
---
|
|
6
|
+
|
|
7
|
+
# /harness-evolver:critic
|
|
8
|
+
|
|
9
|
+
Analyze eval quality and detect eval gaming.
|
|
10
|
+
|
|
11
|
+
## Resolve Tool Path
|
|
12
|
+
|
|
13
|
+
```bash
|
|
14
|
+
TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo "$HOME/.harness-evolver/tools")
|
|
15
|
+
```
|
|
16
|
+
|
|
17
|
+
## Prerequisites
|
|
18
|
+
|
|
19
|
+
`.harness-evolver/` must exist with at least one evaluated version (v001+).
|
|
20
|
+
|
|
21
|
+
## What To Do
|
|
22
|
+
|
|
23
|
+
1. Read `summary.json` to check for suspicious patterns:
|
|
24
|
+
- Score jump >0.3 in a single iteration
|
|
25
|
+
- Score reached 1.0 in <3 iterations
|
|
26
|
+
- All tasks suddenly pass after failing
|
|
27
|
+
|
|
28
|
+
2. Spawn the `harness-evolver-critic` agent:
|
|
29
|
+
> Analyze the eval quality for this harness evolution project.
|
|
30
|
+
> Check if the eval at `.harness-evolver/eval/eval.py` is rigorous enough.
|
|
31
|
+
> The best version is {version} with score {score} achieved in {iterations} iterations.
|
|
32
|
+
|
|
33
|
+
3. After the critic reports:
|
|
34
|
+
- Show the eval quality assessment
|
|
35
|
+
- If `eval_improved.py` was created, show the score comparison
|
|
36
|
+
- Ask user: "Adopt the improved eval? This will re-baseline all scores."
|
|
37
|
+
- If adopted: copy `eval_improved.py` to `eval/eval.py`, re-run baseline, update state
|
package/skills/evolve/SKILL.md
CHANGED
|
@@ -80,6 +80,23 @@ python3 $TOOLS/state.py update \
|
|
|
80
80
|
|
|
81
81
|
Read `summary.json`. Print: `Iteration {i}/{N}: {version} scored {score} (best: {best} at {best_score})`
|
|
82
82
|
|
|
83
|
+
### 6.5. Check for Eval Gaming
|
|
84
|
+
|
|
85
|
+
After updating state, read the latest `summary.json` and check:
|
|
86
|
+
- Did the score jump >0.3 from parent version?
|
|
87
|
+
- Did we reach 1.0 in fewer than 3 total iterations?
|
|
88
|
+
|
|
89
|
+
If either is true, warn:
|
|
90
|
+
|
|
91
|
+
> Suspicious convergence detected: score jumped from {parent_score} to {score} in one iteration.
|
|
92
|
+
> The eval may be too lenient. Run `/harness-evolver:critic` to analyze eval quality.
|
|
93
|
+
|
|
94
|
+
If score is 1.0 and iterations < 3, STOP the loop and strongly recommend the critic:
|
|
95
|
+
|
|
96
|
+
> Perfect score reached in only {iterations} iteration(s). This usually indicates
|
|
97
|
+
> the eval is too easy, not that the harness is perfect. Run `/harness-evolver:critic`
|
|
98
|
+
> before continuing.
|
|
99
|
+
|
|
83
100
|
### 7. Check Stop Conditions
|
|
84
101
|
|
|
85
102
|
- **Stagnation**: last 3 scores within 1% of each other → stop
|