harness-evolver 3.0.6 → 3.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +8 -4
- package/agents/evolver-evaluator.md +152 -0
- package/bin/install.js +19 -21
- package/package.json +1 -1
- package/skills/evolve/SKILL.md +70 -13
- package/skills/setup/SKILL.md +2 -2
- package/tools/__pycache__/detect_stack.cpython-314.pyc +0 -0
- package/tools/__pycache__/trace_logger.cpython-314.pyc +0 -0
- package/tools/run_eval.py +31 -24
- package/tools/setup.py +65 -34
package/README.md
CHANGED
|
@@ -47,7 +47,7 @@ claude
|
|
|
47
47
|
<table>
|
|
48
48
|
<tr>
|
|
49
49
|
<td><b>LangSmith-Native</b></td>
|
|
50
|
-
<td>No custom eval scripts or task files. Uses LangSmith Datasets for test inputs, Experiments for results, and
|
|
50
|
+
<td>No custom eval scripts or task files. Uses LangSmith Datasets for test inputs, Experiments for results, and an agent-based LLM-as-judge for scoring via langsmith-cli. No external API keys needed. Everything is visible in the LangSmith UI.</td>
|
|
51
51
|
</tr>
|
|
52
52
|
<tr>
|
|
53
53
|
<td><b>Real Code Evolution</b></td>
|
|
@@ -92,6 +92,7 @@ claude
|
|
|
92
92
|
| **Architect** | Recommends multi-agent topology changes | Blue |
|
|
93
93
|
| **Critic** | Validates evaluator quality, detects gaming | Red |
|
|
94
94
|
| **TestGen** | Generates test inputs for LangSmith datasets | Cyan |
|
|
95
|
+
| **Evaluator** | LLM-as-judge — reads outputs via langsmith-cli, scores correctness | Yellow |
|
|
95
96
|
|
|
96
97
|
---
|
|
97
98
|
|
|
@@ -104,7 +105,8 @@ claude
|
|
|
104
105
|
+- 1.5 Gather trace insights (cluster errors, tokens, latency)
|
|
105
106
|
+- 1.8 Analyze per-task failures (adaptive briefings)
|
|
106
107
|
+- 2. Spawn 5 proposers in parallel (each in a git worktree)
|
|
107
|
-
+- 3.
|
|
108
|
+
+- 3. Run target for each candidate (client.evaluate() -> code-based evaluators)
|
|
109
|
+
+- 3.5 Spawn evaluator agent (reads outputs via langsmith-cli, judges, writes scores)
|
|
108
110
|
+- 4. Compare experiments -> select winner + per-task champion
|
|
109
111
|
+- 5. Merge winning worktree into main branch
|
|
110
112
|
+- 5.5 Test suite growth (add regression examples to dataset)
|
|
@@ -119,13 +121,15 @@ claude
|
|
|
119
121
|
## Requirements
|
|
120
122
|
|
|
121
123
|
- **LangSmith account** + `LANGSMITH_API_KEY`
|
|
122
|
-
- **Python 3.10+** with `langsmith`
|
|
124
|
+
- **Python 3.10+** with `langsmith` package
|
|
125
|
+
- **langsmith-cli** (`uv tool install langsmith-cli`) — required for evaluator agent
|
|
123
126
|
- **Git** (for worktree-based isolation)
|
|
124
127
|
- **Claude Code** (or Cursor/Codex/Windsurf)
|
|
125
128
|
|
|
126
129
|
```bash
|
|
127
130
|
export LANGSMITH_API_KEY="lsv2_pt_..."
|
|
128
|
-
pip install langsmith
|
|
131
|
+
pip install langsmith
|
|
132
|
+
uv tool install langsmith-cli
|
|
129
133
|
```
|
|
130
134
|
|
|
131
135
|
---
|
|
@@ -0,0 +1,152 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: evolver-evaluator
|
|
3
|
+
description: |
|
|
4
|
+
Use this agent to evaluate experiment outputs using LLM-as-judge.
|
|
5
|
+
Reads run inputs/outputs from LangSmith via langsmith-cli, judges correctness,
|
|
6
|
+
and writes scores back as feedback. No external API keys needed.
|
|
7
|
+
tools: Read, Bash, Glob, Grep
|
|
8
|
+
color: yellow
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
# Evolver — Evaluator Agent (v3)
|
|
12
|
+
|
|
13
|
+
You are an LLM evaluation judge. Your job is to read the outputs of an experiment from LangSmith, evaluate each one for correctness, and write scores back as feedback.
|
|
14
|
+
|
|
15
|
+
You ARE the LLM-as-judge. You replace the need for an external LLM API call.
|
|
16
|
+
|
|
17
|
+
## Bootstrap
|
|
18
|
+
|
|
19
|
+
1. Verify langsmith-cli is available:
|
|
20
|
+
```bash
|
|
21
|
+
langsmith-cli --version
|
|
22
|
+
```
|
|
23
|
+
If this fails, report the error and stop — langsmith-cli is required.
|
|
24
|
+
|
|
25
|
+
2. Your prompt contains `<experiment>`, `<evaluators>`, and `<context>` blocks. Parse them to understand:
|
|
26
|
+
- Which experiment to evaluate
|
|
27
|
+
- What evaluation criteria to apply
|
|
28
|
+
- What the agent is supposed to do (domain context)
|
|
29
|
+
|
|
30
|
+
## Tool: langsmith-cli
|
|
31
|
+
|
|
32
|
+
You interact with LangSmith exclusively through `langsmith-cli`. Always use `--json` for machine-readable output.
|
|
33
|
+
|
|
34
|
+
### Reading experiment outputs
|
|
35
|
+
|
|
36
|
+
```bash
|
|
37
|
+
langsmith-cli --json runs list \
|
|
38
|
+
--project "{experiment_name}" \
|
|
39
|
+
--fields id,inputs,outputs,error,reference_example_id \
|
|
40
|
+
--is-root \
|
|
41
|
+
--limit 200
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
This returns one JSON object per line (JSONL). Each line has:
|
|
45
|
+
- `id` — the run ID (needed to write feedback)
|
|
46
|
+
- `inputs` — what was sent to the agent
|
|
47
|
+
- `outputs` — what the agent responded
|
|
48
|
+
- `error` — error message if the run failed
|
|
49
|
+
- `reference_example_id` — links back to the dataset example
|
|
50
|
+
|
|
51
|
+
### Writing scores
|
|
52
|
+
|
|
53
|
+
For EACH run, after judging it:
|
|
54
|
+
|
|
55
|
+
```bash
|
|
56
|
+
langsmith-cli --json feedback create {run_id} \
|
|
57
|
+
--key "{evaluator_key}" \
|
|
58
|
+
--score {score} \
|
|
59
|
+
--comment "{brief_reasoning}" \
|
|
60
|
+
--source model
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
Use `--source model` since this is an LLM-generated evaluation.
|
|
64
|
+
|
|
65
|
+
## Your Workflow
|
|
66
|
+
|
|
67
|
+
### Phase 1: Read All Outputs
|
|
68
|
+
|
|
69
|
+
Fetch all runs from the experiment. Save the output to a file for reference:
|
|
70
|
+
|
|
71
|
+
```bash
|
|
72
|
+
langsmith-cli --json runs list \
|
|
73
|
+
--project "{experiment_name}" \
|
|
74
|
+
--fields id,inputs,outputs,error,reference_example_id \
|
|
75
|
+
--is-root --limit 200 \
|
|
76
|
+
--output experiment_runs.jsonl
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
Then read `experiment_runs.jsonl` to see all results.
|
|
80
|
+
|
|
81
|
+
### Phase 2: Evaluate Each Run
|
|
82
|
+
|
|
83
|
+
For each run, apply the requested evaluators. The evaluators you may be asked to judge:
|
|
84
|
+
|
|
85
|
+
#### correctness
|
|
86
|
+
Judge: **Is the output a correct, accurate, and complete response to the input?**
|
|
87
|
+
|
|
88
|
+
Scoring:
|
|
89
|
+
- `1.0` — Correct and complete. The response accurately addresses the input.
|
|
90
|
+
- `0.0` — Incorrect, incomplete, or off-topic.
|
|
91
|
+
|
|
92
|
+
Consider:
|
|
93
|
+
- Does the response answer what was asked?
|
|
94
|
+
- Is the information factually accurate?
|
|
95
|
+
- Are there hallucinations or made-up facts?
|
|
96
|
+
- Is the response relevant to the domain?
|
|
97
|
+
|
|
98
|
+
#### conciseness
|
|
99
|
+
Judge: **Is the response appropriately concise without sacrificing quality?**
|
|
100
|
+
|
|
101
|
+
Scoring:
|
|
102
|
+
- `1.0` — Concise and complete. No unnecessary verbosity.
|
|
103
|
+
- `0.0` — Excessively verbose, repetitive, or padded.
|
|
104
|
+
|
|
105
|
+
### Phase 3: Write All Scores
|
|
106
|
+
|
|
107
|
+
For each run you evaluated, write feedback via `langsmith-cli feedback create`.
|
|
108
|
+
|
|
109
|
+
Write scores in batches — evaluate all runs first, then write all scores. This is more efficient than alternating between reading and writing.
|
|
110
|
+
|
|
111
|
+
Example for one run:
|
|
112
|
+
```bash
|
|
113
|
+
langsmith-cli --json feedback create "run-uuid-here" \
|
|
114
|
+
--key correctness \
|
|
115
|
+
--score 1.0 \
|
|
116
|
+
--comment "Response correctly identifies the applicable regulation and provides accurate guidance." \
|
|
117
|
+
--source model
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
### Phase 4: Summary
|
|
121
|
+
|
|
122
|
+
After writing all scores, compute the aggregate:
|
|
123
|
+
|
|
124
|
+
```bash
|
|
125
|
+
langsmith-cli --json feedback list --run-id "{any_run_id}" --key correctness
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
## Error Handling
|
|
129
|
+
|
|
130
|
+
- If a run has `error` set and empty `outputs`: score it `0.0` with comment "Run failed: {error}"
|
|
131
|
+
- If a run has `outputs` but they contain an error message: score `0.0` with comment explaining the failure
|
|
132
|
+
- If `outputs` is empty but no error: score `0.0` with comment "Empty output"
|
|
133
|
+
|
|
134
|
+
## Rules
|
|
135
|
+
|
|
136
|
+
1. **Be a fair judge** — evaluate based on the criteria, not your preferences
|
|
137
|
+
2. **Brief comments** — keep feedback comments under 200 characters
|
|
138
|
+
3. **Binary scoring for correctness** — use 1.0 or 0.0, not partial scores (unless instructed otherwise)
|
|
139
|
+
4. **Score EVERY run** — don't skip any, even failed ones
|
|
140
|
+
5. **Domain awareness** — use the `<context>` block to understand what constitutes a "correct" answer in this domain
|
|
141
|
+
|
|
142
|
+
## Return Protocol
|
|
143
|
+
|
|
144
|
+
When done, end your response with:
|
|
145
|
+
|
|
146
|
+
## EVALUATION COMPLETE
|
|
147
|
+
- **Experiment**: {experiment_name}
|
|
148
|
+
- **Runs evaluated**: {N}
|
|
149
|
+
- **Evaluators applied**: {list}
|
|
150
|
+
- **Mean score**: {score}
|
|
151
|
+
- **Pass rate**: {N}/{total} ({percent}%)
|
|
152
|
+
- **Common failure patterns**: {brief list}
|
package/bin/install.js
CHANGED
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
/**
|
|
3
3
|
* Harness Evolver v3 installer.
|
|
4
4
|
* Copies skills/agents/tools to runtime directories (GSD pattern).
|
|
5
|
-
* Installs Python dependencies (langsmith
|
|
5
|
+
* Installs Python dependencies (langsmith) and langsmith-cli.
|
|
6
6
|
*
|
|
7
7
|
* Usage: npx harness-evolver@latest
|
|
8
8
|
*/
|
|
@@ -225,15 +225,15 @@ function installPythonDeps() {
|
|
|
225
225
|
|
|
226
226
|
// Install/upgrade deps in the venv
|
|
227
227
|
const installCommands = [
|
|
228
|
-
`uv pip install --python "${venvPython}" langsmith
|
|
229
|
-
`"${venvPip}" install --upgrade langsmith
|
|
230
|
-
`"${venvPython}" -m pip install --upgrade langsmith
|
|
228
|
+
`uv pip install --python "${venvPython}" langsmith`,
|
|
229
|
+
`"${venvPip}" install --upgrade langsmith`,
|
|
230
|
+
`"${venvPython}" -m pip install --upgrade langsmith`,
|
|
231
231
|
];
|
|
232
232
|
|
|
233
233
|
for (const cmd of installCommands) {
|
|
234
234
|
try {
|
|
235
235
|
execSync(cmd, { stdio: "pipe", timeout: 120000 });
|
|
236
|
-
console.log(` ${GREEN}✓${RESET} langsmith
|
|
236
|
+
console.log(` ${GREEN}✓${RESET} langsmith installed in venv`);
|
|
237
237
|
return true;
|
|
238
238
|
} catch {
|
|
239
239
|
continue;
|
|
@@ -241,7 +241,7 @@ function installPythonDeps() {
|
|
|
241
241
|
}
|
|
242
242
|
|
|
243
243
|
console.log(` ${YELLOW}!${RESET} Could not install packages in venv.`);
|
|
244
|
-
console.log(` Run manually: ${BOLD}~/.evolver/venv/bin/pip install langsmith
|
|
244
|
+
console.log(` Run manually: ${BOLD}~/.evolver/venv/bin/pip install langsmith${RESET}`);
|
|
245
245
|
return false;
|
|
246
246
|
}
|
|
247
247
|
|
|
@@ -303,26 +303,24 @@ async function configureLangSmith(rl) {
|
|
|
303
303
|
}
|
|
304
304
|
}
|
|
305
305
|
|
|
306
|
-
// --- Step 2: langsmith-cli ---
|
|
306
|
+
// --- Step 2: langsmith-cli (required for evaluator agent) ---
|
|
307
307
|
if (hasLangsmithCli) {
|
|
308
308
|
console.log(` ${GREEN}✓${RESET} langsmith-cli installed`);
|
|
309
309
|
} else {
|
|
310
|
-
console.log(`\n ${BOLD}langsmith-cli${RESET} —
|
|
311
|
-
console.log(` ${DIM}
|
|
312
|
-
|
|
313
|
-
|
|
314
|
-
|
|
315
|
-
|
|
316
|
-
execSync("uv tool install langsmith-cli 2>/dev/null || pip install langsmith-cli 2>/dev/null || pip3 install langsmith-cli", { stdio: "pipe", timeout: 60000 });
|
|
317
|
-
console.log(` ${GREEN}✓${RESET} langsmith-cli installed`);
|
|
310
|
+
console.log(`\n ${BOLD}langsmith-cli${RESET} — ${YELLOW}required${RESET} for LLM-as-judge evaluation`);
|
|
311
|
+
console.log(` ${DIM}The evaluator agent uses it to read experiment outputs and write scores.${RESET}`);
|
|
312
|
+
console.log(`\n Installing langsmith-cli...`);
|
|
313
|
+
try {
|
|
314
|
+
execSync("uv tool install langsmith-cli 2>/dev/null || pip install langsmith-cli 2>/dev/null || pip3 install langsmith-cli", { stdio: "pipe", timeout: 60000 });
|
|
315
|
+
console.log(` ${GREEN}✓${RESET} langsmith-cli installed`);
|
|
318
316
|
|
|
319
|
-
|
|
320
|
-
|
|
321
|
-
|
|
322
|
-
}
|
|
323
|
-
} catch {
|
|
324
|
-
console.log(` ${YELLOW}!${RESET} Could not install. Try manually: ${DIM}uv tool install langsmith-cli${RESET}`);
|
|
317
|
+
// If we have a key, auto-authenticate
|
|
318
|
+
if (hasKey && fs.existsSync(langsmithCredsFile)) {
|
|
319
|
+
console.log(` ${GREEN}✓${RESET} langsmith-cli auto-authenticated (credentials file exists)`);
|
|
325
320
|
}
|
|
321
|
+
} catch {
|
|
322
|
+
console.log(` ${RED}!${RESET} Could not install langsmith-cli.`);
|
|
323
|
+
console.log(` ${BOLD}This is required.${RESET} Install manually: ${DIM}uv tool install langsmith-cli${RESET}`);
|
|
326
324
|
}
|
|
327
325
|
}
|
|
328
326
|
}
|
package/package.json
CHANGED
package/skills/evolve/SKILL.md
CHANGED
|
@@ -75,13 +75,15 @@ python3 -c "import json; c=json.load(open('.evolver.json')); print(f'v{c[\"itera
|
|
|
75
75
|
|
|
76
76
|
### 1.5. Gather Trace Insights
|
|
77
77
|
|
|
78
|
-
|
|
78
|
+
Read the best experiment from config. If null (no baseline was run), skip trace insights for this iteration — proposers will work blind on the first pass:
|
|
79
79
|
|
|
80
80
|
```bash
|
|
81
|
-
BEST=$(python3 -c "import json;
|
|
82
|
-
|
|
83
|
-
|
|
84
|
-
|
|
81
|
+
BEST=$(python3 -c "import json; b=json.load(open('.evolver.json')).get('best_experiment'); print(b if b else '')")
|
|
82
|
+
if [ -n "$BEST" ]; then
|
|
83
|
+
$EVOLVER_PY $TOOLS/trace_insights.py \
|
|
84
|
+
--from-experiment "$BEST" \
|
|
85
|
+
--output trace_insights.json 2>/dev/null
|
|
86
|
+
fi
|
|
85
87
|
```
|
|
86
88
|
|
|
87
89
|
If a production project is configured, also gather production insights:
|
|
@@ -99,17 +101,20 @@ fi
|
|
|
99
101
|
|
|
100
102
|
### 1.8. Analyze Per-Task Failures
|
|
101
103
|
|
|
102
|
-
|
|
104
|
+
If `$BEST` is set (not the first iteration without baseline), read results and cluster failures:
|
|
103
105
|
|
|
104
106
|
```bash
|
|
105
|
-
|
|
106
|
-
|
|
107
|
-
|
|
108
|
-
|
|
107
|
+
if [ -n "$BEST" ]; then
|
|
108
|
+
$EVOLVER_PY $TOOLS/read_results.py \
|
|
109
|
+
--experiment "$BEST" \
|
|
110
|
+
--config .evolver.json \
|
|
111
|
+
--output best_results.json 2>/dev/null
|
|
112
|
+
fi
|
|
109
113
|
```
|
|
110
114
|
|
|
111
|
-
|
|
115
|
+
If `best_results.json` exists, parse it to find failing examples (score < 0.7). Group by metadata or error pattern.
|
|
112
116
|
Generate adaptive briefings for Candidates D and E (same logic as v2).
|
|
117
|
+
If no best_results.json (first iteration without baseline), all proposers work from code analysis only — no failure data available.
|
|
113
118
|
|
|
114
119
|
### 2. Spawn 5 Proposers in Parallel
|
|
115
120
|
|
|
@@ -172,7 +177,7 @@ If ALL_PASSING: D gets `creative`, E gets `efficiency`.
|
|
|
172
177
|
|
|
173
178
|
Wait for all 5 to complete.
|
|
174
179
|
|
|
175
|
-
### 3.
|
|
180
|
+
### 3. Run Target for Each Candidate
|
|
176
181
|
|
|
177
182
|
For each worktree that has changes (proposer committed something):
|
|
178
183
|
|
|
@@ -184,7 +189,59 @@ $EVOLVER_PY $TOOLS/run_eval.py \
|
|
|
184
189
|
--timeout 120
|
|
185
190
|
```
|
|
186
191
|
|
|
187
|
-
Each candidate becomes a separate LangSmith experiment.
|
|
192
|
+
Each candidate becomes a separate LangSmith experiment. This step runs the agent and applies code-based evaluators (has_output, token_efficiency) only.
|
|
193
|
+
|
|
194
|
+
Collect all experiment names from the output (the `"experiment"` field in each JSON output).
|
|
195
|
+
|
|
196
|
+
### 3.5. LLM-as-Judge Evaluation (Evaluator Agent)
|
|
197
|
+
|
|
198
|
+
Check if the config has LLM-based evaluators (correctness, conciseness):
|
|
199
|
+
|
|
200
|
+
```bash
|
|
201
|
+
python3 -c "import json; c=json.load(open('.evolver.json')); llm=[k for k in c['evaluators'] if k in ('correctness','conciseness')]; print(','.join(llm) if llm else '')"
|
|
202
|
+
```
|
|
203
|
+
|
|
204
|
+
If LLM evaluators are configured, first verify langsmith-cli is available:
|
|
205
|
+
|
|
206
|
+
```bash
|
|
207
|
+
command -v langsmith-cli >/dev/null 2>&1 || { echo "ERROR: langsmith-cli not found. Install with: uv tool install langsmith-cli"; exit 1; }
|
|
208
|
+
```
|
|
209
|
+
|
|
210
|
+
Then spawn ONE evaluator agent that scores ALL candidates in a single pass. This is more efficient than spawning one agent per candidate:
|
|
211
|
+
|
|
212
|
+
```
|
|
213
|
+
Agent(
|
|
214
|
+
subagent_type: "evolver-evaluator",
|
|
215
|
+
description: "Evaluate all candidates for iteration v{NNN}",
|
|
216
|
+
prompt: |
|
|
217
|
+
<experiment>
|
|
218
|
+
Evaluate the following experiments (one per candidate):
|
|
219
|
+
- {experiment_name_a}
|
|
220
|
+
- {experiment_name_b}
|
|
221
|
+
- {experiment_name_c}
|
|
222
|
+
- {experiment_name_d}
|
|
223
|
+
- {experiment_name_e}
|
|
224
|
+
</experiment>
|
|
225
|
+
|
|
226
|
+
<evaluators>
|
|
227
|
+
Apply these evaluators to each run in each experiment:
|
|
228
|
+
- {llm_evaluator_list, e.g. "correctness", "conciseness"}
|
|
229
|
+
</evaluators>
|
|
230
|
+
|
|
231
|
+
<context>
|
|
232
|
+
Agent type: {framework} agent
|
|
233
|
+
Domain: {description from .evolver.json or entry point context}
|
|
234
|
+
Entry point: {entry_point}
|
|
235
|
+
|
|
236
|
+
For each experiment:
|
|
237
|
+
1. Read all runs via: langsmith-cli --json runs list --project "{experiment_name}" --fields id,inputs,outputs,error --is-root --limit 200
|
|
238
|
+
2. Judge each run's output against the input
|
|
239
|
+
3. Write scores via: langsmith-cli --json feedback create {run_id} --key {evaluator} --score {0.0|1.0} --comment "{reason}" --source model
|
|
240
|
+
</context>
|
|
241
|
+
)
|
|
242
|
+
```
|
|
243
|
+
|
|
244
|
+
Wait for the evaluator agent to complete before proceeding.
|
|
188
245
|
|
|
189
246
|
### 4. Compare All Candidates
|
|
190
247
|
|
package/skills/setup/SKILL.md
CHANGED
|
@@ -42,7 +42,7 @@ TOOLS=$([ -d ".evolver/tools" ] && echo ".evolver/tools" || echo "$HOME/.evolver
|
|
|
42
42
|
EVOLVER_PY=$([ -f "$HOME/.evolver/venv/bin/python" ] && echo "$HOME/.evolver/venv/bin/python" || echo "python3")
|
|
43
43
|
```
|
|
44
44
|
|
|
45
|
-
Use `$EVOLVER_PY` instead of `python3` for ALL tool invocations. This ensures the venv with langsmith
|
|
45
|
+
Use `$EVOLVER_PY` instead of `python3` for ALL tool invocations. This ensures the venv with langsmith is used.
|
|
46
46
|
|
|
47
47
|
## Phase 1: Explore Project (automatic)
|
|
48
48
|
|
|
@@ -201,4 +201,4 @@ Next: run /evolver:evolve to start optimizing.
|
|
|
201
201
|
- If `.evolver.json` already exists, ask before overwriting.
|
|
202
202
|
- If the agent needs a venv, the run command should activate it: `cd {dir} && .venv/bin/python main.py`
|
|
203
203
|
- If LangSmith connection fails, check API key and network.
|
|
204
|
-
- The setup
|
|
204
|
+
- The setup requires `langsmith` (Python SDK) and `langsmith-cli` (for evaluator agent).
|
|
Binary file
|
|
Binary file
|
package/tools/run_eval.py
CHANGED
|
@@ -2,7 +2,9 @@
|
|
|
2
2
|
"""Run LangSmith evaluation for a candidate in a worktree.
|
|
3
3
|
|
|
4
4
|
Wraps client.evaluate() — runs the user's agent against the dataset
|
|
5
|
-
with
|
|
5
|
+
with code-based evaluators only (has_output, token_efficiency).
|
|
6
|
+
LLM-as-judge scoring (correctness, conciseness) is handled post-hoc
|
|
7
|
+
by the evolver-evaluator agent via langsmith-cli.
|
|
6
8
|
|
|
7
9
|
Usage:
|
|
8
10
|
python3 run_eval.py \
|
|
@@ -11,7 +13,7 @@ Usage:
|
|
|
11
13
|
--experiment-prefix v001a \
|
|
12
14
|
[--timeout 120]
|
|
13
15
|
|
|
14
|
-
Requires: pip install langsmith
|
|
16
|
+
Requires: pip install langsmith
|
|
15
17
|
"""
|
|
16
18
|
|
|
17
19
|
import argparse
|
|
@@ -124,34 +126,30 @@ def make_target(entry_point, cwd):
|
|
|
124
126
|
|
|
125
127
|
|
|
126
128
|
def load_evaluators(evaluator_keys):
|
|
127
|
-
"""Load evaluators
|
|
128
|
-
from openevals.llm import create_llm_as_judge
|
|
129
|
-
from openevals.prompts import CORRECTNESS_PROMPT, CONCISENESS_PROMPT
|
|
129
|
+
"""Load code-based evaluators only.
|
|
130
130
|
|
|
131
|
+
LLM-as-judge evaluators (correctness, conciseness) are handled
|
|
132
|
+
post-hoc by the evolver-evaluator agent via langsmith-cli.
|
|
133
|
+
"""
|
|
131
134
|
evaluators = []
|
|
135
|
+
|
|
136
|
+
# Always include has_output — verifies the agent produced something
|
|
137
|
+
def has_output_eval(inputs, outputs, **kwargs):
|
|
138
|
+
has = bool(outputs and outputs.get("output", outputs.get("answer", "")))
|
|
139
|
+
return {"key": "has_output", "score": 1.0 if has else 0.0}
|
|
140
|
+
evaluators.append(has_output_eval)
|
|
141
|
+
|
|
132
142
|
for key in evaluator_keys:
|
|
133
|
-
if key == "
|
|
134
|
-
|
|
135
|
-
|
|
136
|
-
feedback_key="correctness",
|
|
137
|
-
model="openai:gpt-4.1-mini",
|
|
138
|
-
))
|
|
139
|
-
elif key == "conciseness":
|
|
140
|
-
evaluators.append(create_llm_as_judge(
|
|
141
|
-
prompt=CONCISENESS_PROMPT,
|
|
142
|
-
feedback_key="conciseness",
|
|
143
|
-
model="openai:gpt-4.1-mini",
|
|
144
|
-
))
|
|
145
|
-
elif key == "latency":
|
|
146
|
-
def latency_eval(inputs, outputs, **kwargs):
|
|
147
|
-
return {"key": "has_output", "score": 1.0 if outputs else 0.0}
|
|
148
|
-
evaluators.append(latency_eval)
|
|
143
|
+
if key == "latency":
|
|
144
|
+
# Latency is captured in traces, just check output exists
|
|
145
|
+
pass # has_output already covers this
|
|
149
146
|
elif key == "token_efficiency":
|
|
150
147
|
def token_eval(inputs, outputs, **kwargs):
|
|
151
148
|
output_text = str(outputs.get("output", outputs.get("answer", "")))
|
|
152
149
|
score = min(1.0, 2000 / max(len(output_text), 1))
|
|
153
150
|
return {"key": "token_efficiency", "score": score}
|
|
154
151
|
evaluators.append(token_eval)
|
|
152
|
+
# correctness, conciseness — skipped, handled by evaluator agent
|
|
155
153
|
|
|
156
154
|
return evaluators
|
|
157
155
|
|
|
@@ -176,10 +174,16 @@ def main():
|
|
|
176
174
|
target = make_target(config["entry_point"], args.worktree_path)
|
|
177
175
|
evaluators = load_evaluators(config["evaluators"])
|
|
178
176
|
|
|
177
|
+
# Identify which evaluators need the agent (LLM-as-judge)
|
|
178
|
+
llm_evaluators = [k for k in config["evaluators"] if k in ("correctness", "conciseness")]
|
|
179
|
+
code_evaluators = [k for k in config["evaluators"] if k not in ("correctness", "conciseness")]
|
|
180
|
+
|
|
179
181
|
print(f"Running evaluation: {args.experiment_prefix}")
|
|
180
182
|
print(f" Dataset: {config['dataset']}")
|
|
181
183
|
print(f" Worktree: {args.worktree_path}")
|
|
182
|
-
print(f"
|
|
184
|
+
print(f" Code evaluators: {['has_output'] + code_evaluators}")
|
|
185
|
+
if llm_evaluators:
|
|
186
|
+
print(f" Pending LLM evaluators (agent): {llm_evaluators}")
|
|
183
187
|
|
|
184
188
|
try:
|
|
185
189
|
results = client.evaluate(
|
|
@@ -192,7 +196,7 @@ def main():
|
|
|
192
196
|
|
|
193
197
|
experiment_name = results.experiment_name
|
|
194
198
|
|
|
195
|
-
# Calculate mean score
|
|
199
|
+
# Calculate mean score from code-based evaluators only
|
|
196
200
|
scores = []
|
|
197
201
|
per_example = {}
|
|
198
202
|
for result in results:
|
|
@@ -218,10 +222,13 @@ def main():
|
|
|
218
222
|
"num_examples": len(per_example),
|
|
219
223
|
"num_scores": len(scores),
|
|
220
224
|
"per_example": per_example,
|
|
225
|
+
"pending_llm_evaluators": llm_evaluators,
|
|
221
226
|
}
|
|
222
227
|
|
|
223
228
|
print(json.dumps(output))
|
|
224
|
-
print(f"\
|
|
229
|
+
print(f"\nTarget runs complete: {len(per_example)} examples")
|
|
230
|
+
if llm_evaluators:
|
|
231
|
+
print(f"Awaiting evaluator agent for: {llm_evaluators}")
|
|
225
232
|
|
|
226
233
|
except Exception as e:
|
|
227
234
|
print(f"Evaluation failed: {e}", file=sys.stderr)
|
package/tools/setup.py
CHANGED
|
@@ -19,7 +19,7 @@ Usage:
|
|
|
19
19
|
[--production-project my-prod-project] \
|
|
20
20
|
[--evaluators correctness,conciseness]
|
|
21
21
|
|
|
22
|
-
Requires: pip install langsmith
|
|
22
|
+
Requires: pip install langsmith
|
|
23
23
|
"""
|
|
24
24
|
|
|
25
25
|
import argparse
|
|
@@ -78,19 +78,38 @@ def ensure_langsmith_api_key():
|
|
|
78
78
|
|
|
79
79
|
|
|
80
80
|
def check_dependencies():
|
|
81
|
-
"""Verify langsmith
|
|
81
|
+
"""Verify langsmith is installed."""
|
|
82
82
|
missing = []
|
|
83
83
|
try:
|
|
84
84
|
import langsmith # noqa: F401
|
|
85
85
|
except ImportError:
|
|
86
86
|
missing.append("langsmith")
|
|
87
|
-
try:
|
|
88
|
-
import openevals # noqa: F401
|
|
89
|
-
except ImportError:
|
|
90
|
-
missing.append("openevals")
|
|
91
87
|
return missing
|
|
92
88
|
|
|
93
89
|
|
|
90
|
+
def resolve_dataset_name(client, base_name):
|
|
91
|
+
"""Find an available dataset name by auto-incrementing the version suffix.
|
|
92
|
+
|
|
93
|
+
Tries base_name-eval-v1, v2, v3... until an unused name is found.
|
|
94
|
+
Returns (resolved_name, version_number).
|
|
95
|
+
"""
|
|
96
|
+
existing = set()
|
|
97
|
+
try:
|
|
98
|
+
for ds in client.list_datasets():
|
|
99
|
+
existing.add(ds.name)
|
|
100
|
+
except Exception:
|
|
101
|
+
pass
|
|
102
|
+
|
|
103
|
+
for v in range(1, 100):
|
|
104
|
+
candidate = f"{base_name}-eval-v{v}"
|
|
105
|
+
if candidate not in existing:
|
|
106
|
+
return candidate, v
|
|
107
|
+
|
|
108
|
+
# Fallback: timestamp-based
|
|
109
|
+
ts = datetime.now(timezone.utc).strftime("%Y%m%d%H%M%S")
|
|
110
|
+
return f"{base_name}-eval-{ts}", 0
|
|
111
|
+
|
|
112
|
+
|
|
94
113
|
def create_dataset_from_file(client, dataset_name, file_path):
|
|
95
114
|
"""Create a LangSmith dataset from a JSON file of inputs."""
|
|
96
115
|
with open(file_path) as f:
|
|
@@ -177,17 +196,19 @@ def create_empty_dataset(client, dataset_name):
|
|
|
177
196
|
|
|
178
197
|
|
|
179
198
|
def get_evaluators(goals, evaluator_names=None):
|
|
180
|
-
"""Build evaluator list based on optimization goals.
|
|
181
|
-
from openevals.llm import create_llm_as_judge
|
|
182
|
-
from openevals.prompts import CORRECTNESS_PROMPT, CONCISENESS_PROMPT
|
|
199
|
+
"""Build evaluator list based on optimization goals.
|
|
183
200
|
|
|
201
|
+
Returns only code-based evaluators. LLM-as-judge evaluators
|
|
202
|
+
(correctness, conciseness) are handled post-hoc by the
|
|
203
|
+
evolver-evaluator agent via langsmith-cli.
|
|
204
|
+
"""
|
|
184
205
|
evaluators = []
|
|
185
206
|
evaluator_keys = []
|
|
186
207
|
|
|
187
|
-
# Map goals to
|
|
188
|
-
|
|
189
|
-
"accuracy":
|
|
190
|
-
"conciseness":
|
|
208
|
+
# Map goals to evaluator keys (LLM-based are recorded but not instantiated)
|
|
209
|
+
goal_to_key = {
|
|
210
|
+
"accuracy": "correctness",
|
|
211
|
+
"conciseness": "conciseness",
|
|
191
212
|
}
|
|
192
213
|
|
|
193
214
|
if evaluator_names:
|
|
@@ -195,39 +216,33 @@ def get_evaluators(goals, evaluator_names=None):
|
|
|
195
216
|
else:
|
|
196
217
|
names = []
|
|
197
218
|
for goal in goals:
|
|
198
|
-
if goal in
|
|
199
|
-
names.append(
|
|
219
|
+
if goal in goal_to_key:
|
|
220
|
+
names.append(goal_to_key[goal])
|
|
200
221
|
if not names:
|
|
201
222
|
names = ["correctness"] # default
|
|
202
223
|
|
|
224
|
+
# Record all evaluator keys (for config) but only instantiate code-based ones
|
|
203
225
|
for name in names:
|
|
204
226
|
if name in ("correctness", "accuracy"):
|
|
205
|
-
evaluators.append(create_llm_as_judge(
|
|
206
|
-
prompt=CORRECTNESS_PROMPT,
|
|
207
|
-
feedback_key="correctness",
|
|
208
|
-
model="openai:gpt-4.1-mini",
|
|
209
|
-
))
|
|
210
227
|
evaluator_keys.append("correctness")
|
|
228
|
+
# LLM-as-judge — handled by evaluator agent, not here
|
|
211
229
|
elif name in ("conciseness", "brevity"):
|
|
212
|
-
evaluators.append(create_llm_as_judge(
|
|
213
|
-
prompt=CONCISENESS_PROMPT,
|
|
214
|
-
feedback_key="conciseness",
|
|
215
|
-
model="openai:gpt-4.1-mini",
|
|
216
|
-
))
|
|
217
230
|
evaluator_keys.append("conciseness")
|
|
231
|
+
# LLM-as-judge — handled by evaluator agent, not here
|
|
232
|
+
|
|
233
|
+
# Always include has_output
|
|
234
|
+
def has_output_eval(inputs, outputs, **kwargs):
|
|
235
|
+
has = bool(outputs and outputs.get("output", outputs.get("answer", "")))
|
|
236
|
+
return {"key": "has_output", "score": 1.0 if has else 0.0}
|
|
237
|
+
evaluators.append(has_output_eval)
|
|
218
238
|
|
|
219
239
|
# Code-based evaluators for latency/tokens
|
|
220
240
|
if "latency" in goals:
|
|
221
|
-
def latency_eval(inputs, outputs, **kwargs):
|
|
222
|
-
# Latency is captured in traces, not scored here
|
|
223
|
-
return {"key": "has_output", "score": 1.0 if outputs else 0.0}
|
|
224
|
-
evaluators.append(latency_eval)
|
|
225
241
|
evaluator_keys.append("latency")
|
|
226
242
|
|
|
227
243
|
if "token_efficiency" in goals:
|
|
228
244
|
def token_eval(inputs, outputs, **kwargs):
|
|
229
245
|
output_text = str(outputs.get("output", outputs.get("answer", "")))
|
|
230
|
-
# Penalize very long outputs (>2000 chars)
|
|
231
246
|
score = min(1.0, 2000 / max(len(output_text), 1))
|
|
232
247
|
return {"key": "token_efficiency", "score": score}
|
|
233
248
|
evaluators.append(token_eval)
|
|
@@ -328,6 +343,7 @@ def main():
|
|
|
328
343
|
parser.add_argument("--dataset-from-file", default=None, help="Create dataset from JSON file")
|
|
329
344
|
parser.add_argument("--dataset-from-langsmith", default=None, help="Create dataset from LangSmith project")
|
|
330
345
|
parser.add_argument("--production-project", default=None, help="Production LangSmith project")
|
|
346
|
+
parser.add_argument("--dataset-name", default=None, help="Explicit dataset name (skip auto-versioning)")
|
|
331
347
|
parser.add_argument("--evaluators", default=None, help="Comma-separated evaluator names")
|
|
332
348
|
parser.add_argument("--skip-baseline", action="store_true", help="Skip baseline evaluation")
|
|
333
349
|
parser.add_argument("--output", default=".evolver.json", help="Output config path")
|
|
@@ -359,9 +375,19 @@ def main():
|
|
|
359
375
|
sys.exit(1)
|
|
360
376
|
|
|
361
377
|
project_name = f"evolver-{args.project_name}"
|
|
362
|
-
dataset_name = f"{args.project_name}-eval-v1"
|
|
363
378
|
goals = [g.strip() for g in args.goals.split(",")]
|
|
364
379
|
|
|
380
|
+
# Resolve dataset name (explicit or auto-versioned)
|
|
381
|
+
if args.dataset_name:
|
|
382
|
+
dataset_name = args.dataset_name
|
|
383
|
+
print(f"Using explicit dataset name: '{dataset_name}'")
|
|
384
|
+
else:
|
|
385
|
+
dataset_name, version = resolve_dataset_name(client, args.project_name)
|
|
386
|
+
if version > 1:
|
|
387
|
+
print(f"Dataset name auto-versioned to '{dataset_name}' (v1-v{version-1} already exist)")
|
|
388
|
+
else:
|
|
389
|
+
print(f"Dataset: '{dataset_name}'")
|
|
390
|
+
|
|
365
391
|
# Create dataset
|
|
366
392
|
print(f"Creating dataset '{dataset_name}'...")
|
|
367
393
|
if args.dataset_from_file:
|
|
@@ -386,18 +412,23 @@ def main():
|
|
|
386
412
|
print(f"Configuring evaluators for goals: {goals}")
|
|
387
413
|
evaluators, evaluator_keys = get_evaluators(goals, args.evaluators)
|
|
388
414
|
print(f" Active evaluators: {evaluator_keys}")
|
|
415
|
+
llm_evaluators = [k for k in evaluator_keys if k in ("correctness", "conciseness")]
|
|
416
|
+
if llm_evaluators:
|
|
417
|
+
print(f" LLM evaluators (agent-based): {llm_evaluators}")
|
|
389
418
|
|
|
390
|
-
# Run baseline
|
|
419
|
+
# Run baseline (code-based evaluators only; LLM scoring done by evaluator agent)
|
|
391
420
|
baseline_experiment = None
|
|
392
421
|
baseline_score = 0.0
|
|
393
422
|
if not args.skip_baseline and count > 0:
|
|
394
|
-
print(f"Running baseline
|
|
423
|
+
print(f"Running baseline target ({count} examples)...")
|
|
395
424
|
try:
|
|
396
425
|
baseline_experiment, baseline_score = run_baseline(
|
|
397
426
|
client, dataset_name, args.entry_point, evaluators,
|
|
398
427
|
)
|
|
399
|
-
print(f" Baseline score: {baseline_score:.3f}")
|
|
428
|
+
print(f" Baseline has_output score: {baseline_score:.3f}")
|
|
400
429
|
print(f" Experiment: {baseline_experiment}")
|
|
430
|
+
if llm_evaluators:
|
|
431
|
+
print(f" Note: LLM scoring pending — evaluator agent will run during /evolver:evolve")
|
|
401
432
|
except Exception as e:
|
|
402
433
|
print(f" Baseline evaluation failed: {e}", file=sys.stderr)
|
|
403
434
|
print(" Continuing with score 0.0")
|