harness-evolver 3.0.6 → 3.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -47,7 +47,7 @@ claude
47
47
  <table>
48
48
  <tr>
49
49
  <td><b>LangSmith-Native</b></td>
50
- <td>No custom eval scripts or task files. Uses LangSmith Datasets for test inputs, Experiments for results, and Evaluators (openevals LLM-as-judge) for scoring. Everything is visible in the LangSmith UI.</td>
50
+ <td>No custom eval scripts or task files. Uses LangSmith Datasets for test inputs, Experiments for results, and an agent-based LLM-as-judge for scoring via langsmith-cli. No external API keys needed. Everything is visible in the LangSmith UI.</td>
51
51
  </tr>
52
52
  <tr>
53
53
  <td><b>Real Code Evolution</b></td>
@@ -92,6 +92,7 @@ claude
92
92
  | **Architect** | Recommends multi-agent topology changes | Blue |
93
93
  | **Critic** | Validates evaluator quality, detects gaming | Red |
94
94
  | **TestGen** | Generates test inputs for LangSmith datasets | Cyan |
95
+ | **Evaluator** | LLM-as-judge — reads outputs via langsmith-cli, scores correctness | Yellow |
95
96
 
96
97
  ---
97
98
 
@@ -104,7 +105,8 @@ claude
104
105
  +- 1.5 Gather trace insights (cluster errors, tokens, latency)
105
106
  +- 1.8 Analyze per-task failures (adaptive briefings)
106
107
  +- 2. Spawn 5 proposers in parallel (each in a git worktree)
107
- +- 3. Evaluate each candidate (client.evaluate() -> LangSmith experiments)
108
+ +- 3. Run target for each candidate (client.evaluate() -> code-based evaluators)
109
+ +- 3.5 Spawn evaluator agent (reads outputs via langsmith-cli, judges, writes scores)
108
110
  +- 4. Compare experiments -> select winner + per-task champion
109
111
  +- 5. Merge winning worktree into main branch
110
112
  +- 5.5 Test suite growth (add regression examples to dataset)
@@ -119,13 +121,15 @@ claude
119
121
  ## Requirements
120
122
 
121
123
  - **LangSmith account** + `LANGSMITH_API_KEY`
122
- - **Python 3.10+** with `langsmith` and `openevals` packages
124
+ - **Python 3.10+** with `langsmith` package
125
+ - **langsmith-cli** (`uv tool install langsmith-cli`) — required for evaluator agent
123
126
  - **Git** (for worktree-based isolation)
124
127
  - **Claude Code** (or Cursor/Codex/Windsurf)
125
128
 
126
129
  ```bash
127
130
  export LANGSMITH_API_KEY="lsv2_pt_..."
128
- pip install langsmith openevals
131
+ pip install langsmith
132
+ uv tool install langsmith-cli
129
133
  ```
130
134
 
131
135
  ---
@@ -0,0 +1,152 @@
1
+ ---
2
+ name: evolver-evaluator
3
+ description: |
4
+ Use this agent to evaluate experiment outputs using LLM-as-judge.
5
+ Reads run inputs/outputs from LangSmith via langsmith-cli, judges correctness,
6
+ and writes scores back as feedback. No external API keys needed.
7
+ tools: Read, Bash, Glob, Grep
8
+ color: yellow
9
+ ---
10
+
11
+ # Evolver — Evaluator Agent (v3)
12
+
13
+ You are an LLM evaluation judge. Your job is to read the outputs of an experiment from LangSmith, evaluate each one for correctness, and write scores back as feedback.
14
+
15
+ You ARE the LLM-as-judge. You replace the need for an external LLM API call.
16
+
17
+ ## Bootstrap
18
+
19
+ 1. Verify langsmith-cli is available:
20
+ ```bash
21
+ langsmith-cli --version
22
+ ```
23
+ If this fails, report the error and stop — langsmith-cli is required.
24
+
25
+ 2. Your prompt contains `<experiment>`, `<evaluators>`, and `<context>` blocks. Parse them to understand:
26
+ - Which experiment to evaluate
27
+ - What evaluation criteria to apply
28
+ - What the agent is supposed to do (domain context)
29
+
30
+ ## Tool: langsmith-cli
31
+
32
+ You interact with LangSmith exclusively through `langsmith-cli`. Always use `--json` for machine-readable output.
33
+
34
+ ### Reading experiment outputs
35
+
36
+ ```bash
37
+ langsmith-cli --json runs list \
38
+ --project "{experiment_name}" \
39
+ --fields id,inputs,outputs,error,reference_example_id \
40
+ --is-root \
41
+ --limit 200
42
+ ```
43
+
44
+ This returns one JSON object per line (JSONL). Each line has:
45
+ - `id` — the run ID (needed to write feedback)
46
+ - `inputs` — what was sent to the agent
47
+ - `outputs` — what the agent responded
48
+ - `error` — error message if the run failed
49
+ - `reference_example_id` — links back to the dataset example
50
+
51
+ ### Writing scores
52
+
53
+ For EACH run, after judging it:
54
+
55
+ ```bash
56
+ langsmith-cli --json feedback create {run_id} \
57
+ --key "{evaluator_key}" \
58
+ --score {score} \
59
+ --comment "{brief_reasoning}" \
60
+ --source model
61
+ ```
62
+
63
+ Use `--source model` since this is an LLM-generated evaluation.
64
+
65
+ ## Your Workflow
66
+
67
+ ### Phase 1: Read All Outputs
68
+
69
+ Fetch all runs from the experiment. Save the output to a file for reference:
70
+
71
+ ```bash
72
+ langsmith-cli --json runs list \
73
+ --project "{experiment_name}" \
74
+ --fields id,inputs,outputs,error,reference_example_id \
75
+ --is-root --limit 200 \
76
+ --output experiment_runs.jsonl
77
+ ```
78
+
79
+ Then read `experiment_runs.jsonl` to see all results.
80
+
81
+ ### Phase 2: Evaluate Each Run
82
+
83
+ For each run, apply the requested evaluators. The evaluators you may be asked to judge:
84
+
85
+ #### correctness
86
+ Judge: **Is the output a correct, accurate, and complete response to the input?**
87
+
88
+ Scoring:
89
+ - `1.0` — Correct and complete. The response accurately addresses the input.
90
+ - `0.0` — Incorrect, incomplete, or off-topic.
91
+
92
+ Consider:
93
+ - Does the response answer what was asked?
94
+ - Is the information factually accurate?
95
+ - Are there hallucinations or made-up facts?
96
+ - Is the response relevant to the domain?
97
+
98
+ #### conciseness
99
+ Judge: **Is the response appropriately concise without sacrificing quality?**
100
+
101
+ Scoring:
102
+ - `1.0` — Concise and complete. No unnecessary verbosity.
103
+ - `0.0` — Excessively verbose, repetitive, or padded.
104
+
105
+ ### Phase 3: Write All Scores
106
+
107
+ For each run you evaluated, write feedback via `langsmith-cli feedback create`.
108
+
109
+ Write scores in batches — evaluate all runs first, then write all scores. This is more efficient than alternating between reading and writing.
110
+
111
+ Example for one run:
112
+ ```bash
113
+ langsmith-cli --json feedback create "run-uuid-here" \
114
+ --key correctness \
115
+ --score 1.0 \
116
+ --comment "Response correctly identifies the applicable regulation and provides accurate guidance." \
117
+ --source model
118
+ ```
119
+
120
+ ### Phase 4: Summary
121
+
122
+ After writing all scores, compute the aggregate:
123
+
124
+ ```bash
125
+ langsmith-cli --json feedback list --run-id "{any_run_id}" --key correctness
126
+ ```
127
+
128
+ ## Error Handling
129
+
130
+ - If a run has `error` set and empty `outputs`: score it `0.0` with comment "Run failed: {error}"
131
+ - If a run has `outputs` but they contain an error message: score `0.0` with comment explaining the failure
132
+ - If `outputs` is empty but no error: score `0.0` with comment "Empty output"
133
+
134
+ ## Rules
135
+
136
+ 1. **Be a fair judge** — evaluate based on the criteria, not your preferences
137
+ 2. **Brief comments** — keep feedback comments under 200 characters
138
+ 3. **Binary scoring for correctness** — use 1.0 or 0.0, not partial scores (unless instructed otherwise)
139
+ 4. **Score EVERY run** — don't skip any, even failed ones
140
+ 5. **Domain awareness** — use the `<context>` block to understand what constitutes a "correct" answer in this domain
141
+
142
+ ## Return Protocol
143
+
144
+ When done, end your response with:
145
+
146
+ ## EVALUATION COMPLETE
147
+ - **Experiment**: {experiment_name}
148
+ - **Runs evaluated**: {N}
149
+ - **Evaluators applied**: {list}
150
+ - **Mean score**: {score}
151
+ - **Pass rate**: {N}/{total} ({percent}%)
152
+ - **Common failure patterns**: {brief list}
package/bin/install.js CHANGED
@@ -2,7 +2,7 @@
2
2
  /**
3
3
  * Harness Evolver v3 installer.
4
4
  * Copies skills/agents/tools to runtime directories (GSD pattern).
5
- * Installs Python dependencies (langsmith + openevals).
5
+ * Installs Python dependencies (langsmith) and langsmith-cli.
6
6
  *
7
7
  * Usage: npx harness-evolver@latest
8
8
  */
@@ -225,15 +225,15 @@ function installPythonDeps() {
225
225
 
226
226
  // Install/upgrade deps in the venv
227
227
  const installCommands = [
228
- `uv pip install --python "${venvPython}" langsmith openevals`,
229
- `"${venvPip}" install --upgrade langsmith openevals`,
230
- `"${venvPython}" -m pip install --upgrade langsmith openevals`,
228
+ `uv pip install --python "${venvPython}" langsmith`,
229
+ `"${venvPip}" install --upgrade langsmith`,
230
+ `"${venvPython}" -m pip install --upgrade langsmith`,
231
231
  ];
232
232
 
233
233
  for (const cmd of installCommands) {
234
234
  try {
235
235
  execSync(cmd, { stdio: "pipe", timeout: 120000 });
236
- console.log(` ${GREEN}✓${RESET} langsmith + openevals installed in venv`);
236
+ console.log(` ${GREEN}✓${RESET} langsmith installed in venv`);
237
237
  return true;
238
238
  } catch {
239
239
  continue;
@@ -241,7 +241,7 @@ function installPythonDeps() {
241
241
  }
242
242
 
243
243
  console.log(` ${YELLOW}!${RESET} Could not install packages in venv.`);
244
- console.log(` Run manually: ${BOLD}~/.evolver/venv/bin/pip install langsmith openevals${RESET}`);
244
+ console.log(` Run manually: ${BOLD}~/.evolver/venv/bin/pip install langsmith${RESET}`);
245
245
  return false;
246
246
  }
247
247
 
@@ -303,26 +303,24 @@ async function configureLangSmith(rl) {
303
303
  }
304
304
  }
305
305
 
306
- // --- Step 2: langsmith-cli ---
306
+ // --- Step 2: langsmith-cli (required for evaluator agent) ---
307
307
  if (hasLangsmithCli) {
308
308
  console.log(` ${GREEN}✓${RESET} langsmith-cli installed`);
309
309
  } else {
310
- console.log(`\n ${BOLD}langsmith-cli${RESET} — optional but useful for debugging traces`);
311
- console.log(` ${DIM}Quick project listing, trace inspection, run stats from terminal.${RESET}`);
312
- const lsCliAnswer = await ask(rl, `\n ${YELLOW}Install langsmith-cli? [Y/n]:${RESET} `);
313
- if (lsCliAnswer.trim().toLowerCase() !== "n") {
314
- console.log(`\n Installing langsmith-cli...`);
315
- try {
316
- execSync("uv tool install langsmith-cli 2>/dev/null || pip install langsmith-cli 2>/dev/null || pip3 install langsmith-cli", { stdio: "pipe", timeout: 60000 });
317
- console.log(` ${GREEN}✓${RESET} langsmith-cli installed`);
310
+ console.log(`\n ${BOLD}langsmith-cli${RESET} — ${YELLOW}required${RESET} for LLM-as-judge evaluation`);
311
+ console.log(` ${DIM}The evaluator agent uses it to read experiment outputs and write scores.${RESET}`);
312
+ console.log(`\n Installing langsmith-cli...`);
313
+ try {
314
+ execSync("uv tool install langsmith-cli 2>/dev/null || pip install langsmith-cli 2>/dev/null || pip3 install langsmith-cli", { stdio: "pipe", timeout: 60000 });
315
+ console.log(` ${GREEN}✓${RESET} langsmith-cli installed`);
318
316
 
319
- // If we have a key, auto-authenticate
320
- if (hasKey && fs.existsSync(langsmithCredsFile)) {
321
- console.log(` ${GREEN}✓${RESET} langsmith-cli auto-authenticated (credentials file exists)`);
322
- }
323
- } catch {
324
- console.log(` ${YELLOW}!${RESET} Could not install. Try manually: ${DIM}uv tool install langsmith-cli${RESET}`);
317
+ // If we have a key, auto-authenticate
318
+ if (hasKey && fs.existsSync(langsmithCredsFile)) {
319
+ console.log(` ${GREEN}✓${RESET} langsmith-cli auto-authenticated (credentials file exists)`);
325
320
  }
321
+ } catch {
322
+ console.log(` ${RED}!${RESET} Could not install langsmith-cli.`);
323
+ console.log(` ${BOLD}This is required.${RESET} Install manually: ${DIM}uv tool install langsmith-cli${RESET}`);
326
324
  }
327
325
  }
328
326
  }
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "harness-evolver",
3
- "version": "3.0.6",
3
+ "version": "3.1.1",
4
4
  "description": "LangSmith-native autonomous agent optimization for Claude Code",
5
5
  "author": "Raphael Valdetaro",
6
6
  "license": "MIT",
@@ -75,13 +75,15 @@ python3 -c "import json; c=json.load(open('.evolver.json')); print(f'v{c[\"itera
75
75
 
76
76
  ### 1.5. Gather Trace Insights
77
77
 
78
- Run trace insights from the best experiment:
78
+ Read the best experiment from config. If null (no baseline was run), skip trace insights for this iteration — proposers will work blind on the first pass:
79
79
 
80
80
  ```bash
81
- BEST=$(python3 -c "import json; print(json.load(open('.evolver.json'))['best_experiment'])")
82
- $EVOLVER_PY $TOOLS/trace_insights.py \
83
- --from-experiment "$BEST" \
84
- --output trace_insights.json 2>/dev/null
81
+ BEST=$(python3 -c "import json; b=json.load(open('.evolver.json')).get('best_experiment'); print(b if b else '')")
82
+ if [ -n "$BEST" ]; then
83
+ $EVOLVER_PY $TOOLS/trace_insights.py \
84
+ --from-experiment "$BEST" \
85
+ --output trace_insights.json 2>/dev/null
86
+ fi
85
87
  ```
86
88
 
87
89
  If a production project is configured, also gather production insights:
@@ -99,17 +101,20 @@ fi
99
101
 
100
102
  ### 1.8. Analyze Per-Task Failures
101
103
 
102
- Read the best experiment results and cluster failures:
104
+ If `$BEST` is set (not the first iteration without baseline), read results and cluster failures:
103
105
 
104
106
  ```bash
105
- $EVOLVER_PY $TOOLS/read_results.py \
106
- --experiment "$BEST" \
107
- --config .evolver.json \
108
- --output best_results.json 2>/dev/null
107
+ if [ -n "$BEST" ]; then
108
+ $EVOLVER_PY $TOOLS/read_results.py \
109
+ --experiment "$BEST" \
110
+ --config .evolver.json \
111
+ --output best_results.json 2>/dev/null
112
+ fi
109
113
  ```
110
114
 
111
- Parse `best_results.json` to find failing examples (score < 0.7). Group by metadata or error pattern.
115
+ If `best_results.json` exists, parse it to find failing examples (score < 0.7). Group by metadata or error pattern.
112
116
  Generate adaptive briefings for Candidates D and E (same logic as v2).
117
+ If no best_results.json (first iteration without baseline), all proposers work from code analysis only — no failure data available.
113
118
 
114
119
  ### 2. Spawn 5 Proposers in Parallel
115
120
 
@@ -172,7 +177,7 @@ If ALL_PASSING: D gets `creative`, E gets `efficiency`.
172
177
 
173
178
  Wait for all 5 to complete.
174
179
 
175
- ### 3. Evaluate Each Candidate
180
+ ### 3. Run Target for Each Candidate
176
181
 
177
182
  For each worktree that has changes (proposer committed something):
178
183
 
@@ -184,7 +189,59 @@ $EVOLVER_PY $TOOLS/run_eval.py \
184
189
  --timeout 120
185
190
  ```
186
191
 
187
- Each candidate becomes a separate LangSmith experiment.
192
+ Each candidate becomes a separate LangSmith experiment. This step runs the agent and applies code-based evaluators (has_output, token_efficiency) only.
193
+
194
+ Collect all experiment names from the output (the `"experiment"` field in each JSON output).
195
+
196
+ ### 3.5. LLM-as-Judge Evaluation (Evaluator Agent)
197
+
198
+ Check if the config has LLM-based evaluators (correctness, conciseness):
199
+
200
+ ```bash
201
+ python3 -c "import json; c=json.load(open('.evolver.json')); llm=[k for k in c['evaluators'] if k in ('correctness','conciseness')]; print(','.join(llm) if llm else '')"
202
+ ```
203
+
204
+ If LLM evaluators are configured, first verify langsmith-cli is available:
205
+
206
+ ```bash
207
+ command -v langsmith-cli >/dev/null 2>&1 || { echo "ERROR: langsmith-cli not found. Install with: uv tool install langsmith-cli"; exit 1; }
208
+ ```
209
+
210
+ Then spawn ONE evaluator agent that scores ALL candidates in a single pass. This is more efficient than spawning one agent per candidate:
211
+
212
+ ```
213
+ Agent(
214
+ subagent_type: "evolver-evaluator",
215
+ description: "Evaluate all candidates for iteration v{NNN}",
216
+ prompt: |
217
+ <experiment>
218
+ Evaluate the following experiments (one per candidate):
219
+ - {experiment_name_a}
220
+ - {experiment_name_b}
221
+ - {experiment_name_c}
222
+ - {experiment_name_d}
223
+ - {experiment_name_e}
224
+ </experiment>
225
+
226
+ <evaluators>
227
+ Apply these evaluators to each run in each experiment:
228
+ - {llm_evaluator_list, e.g. "correctness", "conciseness"}
229
+ </evaluators>
230
+
231
+ <context>
232
+ Agent type: {framework} agent
233
+ Domain: {description from .evolver.json or entry point context}
234
+ Entry point: {entry_point}
235
+
236
+ For each experiment:
237
+ 1. Read all runs via: langsmith-cli --json runs list --project "{experiment_name}" --fields id,inputs,outputs,error --is-root --limit 200
238
+ 2. Judge each run's output against the input
239
+ 3. Write scores via: langsmith-cli --json feedback create {run_id} --key {evaluator} --score {0.0|1.0} --comment "{reason}" --source model
240
+ </context>
241
+ )
242
+ ```
243
+
244
+ Wait for the evaluator agent to complete before proceeding.
188
245
 
189
246
  ### 4. Compare All Candidates
190
247
 
@@ -42,7 +42,7 @@ TOOLS=$([ -d ".evolver/tools" ] && echo ".evolver/tools" || echo "$HOME/.evolver
42
42
  EVOLVER_PY=$([ -f "$HOME/.evolver/venv/bin/python" ] && echo "$HOME/.evolver/venv/bin/python" || echo "python3")
43
43
  ```
44
44
 
45
- Use `$EVOLVER_PY` instead of `python3` for ALL tool invocations. This ensures the venv with langsmith+openevals is used.
45
+ Use `$EVOLVER_PY` instead of `python3` for ALL tool invocations. This ensures the venv with langsmith is used.
46
46
 
47
47
  ## Phase 1: Explore Project (automatic)
48
48
 
@@ -201,4 +201,4 @@ Next: run /evolver:evolve to start optimizing.
201
201
  - If `.evolver.json` already exists, ask before overwriting.
202
202
  - If the agent needs a venv, the run command should activate it: `cd {dir} && .venv/bin/python main.py`
203
203
  - If LangSmith connection fails, check API key and network.
204
- - The setup installs `langsmith` and `openevals` if missing.
204
+ - The setup requires `langsmith` (Python SDK) and `langsmith-cli` (for evaluator agent).
package/tools/run_eval.py CHANGED
@@ -2,7 +2,9 @@
2
2
  """Run LangSmith evaluation for a candidate in a worktree.
3
3
 
4
4
  Wraps client.evaluate() — runs the user's agent against the dataset
5
- with configured evaluators, from within a specific directory (worktree).
5
+ with code-based evaluators only (has_output, token_efficiency).
6
+ LLM-as-judge scoring (correctness, conciseness) is handled post-hoc
7
+ by the evolver-evaluator agent via langsmith-cli.
6
8
 
7
9
  Usage:
8
10
  python3 run_eval.py \
@@ -11,7 +13,7 @@ Usage:
11
13
  --experiment-prefix v001a \
12
14
  [--timeout 120]
13
15
 
14
- Requires: pip install langsmith openevals
16
+ Requires: pip install langsmith
15
17
  """
16
18
 
17
19
  import argparse
@@ -124,34 +126,30 @@ def make_target(entry_point, cwd):
124
126
 
125
127
 
126
128
  def load_evaluators(evaluator_keys):
127
- """Load evaluators by key name."""
128
- from openevals.llm import create_llm_as_judge
129
- from openevals.prompts import CORRECTNESS_PROMPT, CONCISENESS_PROMPT
129
+ """Load code-based evaluators only.
130
130
 
131
+ LLM-as-judge evaluators (correctness, conciseness) are handled
132
+ post-hoc by the evolver-evaluator agent via langsmith-cli.
133
+ """
131
134
  evaluators = []
135
+
136
+ # Always include has_output — verifies the agent produced something
137
+ def has_output_eval(inputs, outputs, **kwargs):
138
+ has = bool(outputs and outputs.get("output", outputs.get("answer", "")))
139
+ return {"key": "has_output", "score": 1.0 if has else 0.0}
140
+ evaluators.append(has_output_eval)
141
+
132
142
  for key in evaluator_keys:
133
- if key == "correctness":
134
- evaluators.append(create_llm_as_judge(
135
- prompt=CORRECTNESS_PROMPT,
136
- feedback_key="correctness",
137
- model="openai:gpt-4.1-mini",
138
- ))
139
- elif key == "conciseness":
140
- evaluators.append(create_llm_as_judge(
141
- prompt=CONCISENESS_PROMPT,
142
- feedback_key="conciseness",
143
- model="openai:gpt-4.1-mini",
144
- ))
145
- elif key == "latency":
146
- def latency_eval(inputs, outputs, **kwargs):
147
- return {"key": "has_output", "score": 1.0 if outputs else 0.0}
148
- evaluators.append(latency_eval)
143
+ if key == "latency":
144
+ # Latency is captured in traces, just check output exists
145
+ pass # has_output already covers this
149
146
  elif key == "token_efficiency":
150
147
  def token_eval(inputs, outputs, **kwargs):
151
148
  output_text = str(outputs.get("output", outputs.get("answer", "")))
152
149
  score = min(1.0, 2000 / max(len(output_text), 1))
153
150
  return {"key": "token_efficiency", "score": score}
154
151
  evaluators.append(token_eval)
152
+ # correctness, conciseness — skipped, handled by evaluator agent
155
153
 
156
154
  return evaluators
157
155
 
@@ -176,10 +174,16 @@ def main():
176
174
  target = make_target(config["entry_point"], args.worktree_path)
177
175
  evaluators = load_evaluators(config["evaluators"])
178
176
 
177
+ # Identify which evaluators need the agent (LLM-as-judge)
178
+ llm_evaluators = [k for k in config["evaluators"] if k in ("correctness", "conciseness")]
179
+ code_evaluators = [k for k in config["evaluators"] if k not in ("correctness", "conciseness")]
180
+
179
181
  print(f"Running evaluation: {args.experiment_prefix}")
180
182
  print(f" Dataset: {config['dataset']}")
181
183
  print(f" Worktree: {args.worktree_path}")
182
- print(f" Evaluators: {config['evaluators']}")
184
+ print(f" Code evaluators: {['has_output'] + code_evaluators}")
185
+ if llm_evaluators:
186
+ print(f" Pending LLM evaluators (agent): {llm_evaluators}")
183
187
 
184
188
  try:
185
189
  results = client.evaluate(
@@ -192,7 +196,7 @@ def main():
192
196
 
193
197
  experiment_name = results.experiment_name
194
198
 
195
- # Calculate mean score
199
+ # Calculate mean score from code-based evaluators only
196
200
  scores = []
197
201
  per_example = {}
198
202
  for result in results:
@@ -218,10 +222,13 @@ def main():
218
222
  "num_examples": len(per_example),
219
223
  "num_scores": len(scores),
220
224
  "per_example": per_example,
225
+ "pending_llm_evaluators": llm_evaluators,
221
226
  }
222
227
 
223
228
  print(json.dumps(output))
224
- print(f"\nEvaluation complete: {mean_score:.3f} ({len(per_example)} examples)")
229
+ print(f"\nTarget runs complete: {len(per_example)} examples")
230
+ if llm_evaluators:
231
+ print(f"Awaiting evaluator agent for: {llm_evaluators}")
225
232
 
226
233
  except Exception as e:
227
234
  print(f"Evaluation failed: {e}", file=sys.stderr)
package/tools/setup.py CHANGED
@@ -19,7 +19,7 @@ Usage:
19
19
  [--production-project my-prod-project] \
20
20
  [--evaluators correctness,conciseness]
21
21
 
22
- Requires: pip install langsmith openevals
22
+ Requires: pip install langsmith
23
23
  """
24
24
 
25
25
  import argparse
@@ -78,19 +78,38 @@ def ensure_langsmith_api_key():
78
78
 
79
79
 
80
80
  def check_dependencies():
81
- """Verify langsmith and openevals are installed."""
81
+ """Verify langsmith is installed."""
82
82
  missing = []
83
83
  try:
84
84
  import langsmith # noqa: F401
85
85
  except ImportError:
86
86
  missing.append("langsmith")
87
- try:
88
- import openevals # noqa: F401
89
- except ImportError:
90
- missing.append("openevals")
91
87
  return missing
92
88
 
93
89
 
90
+ def resolve_dataset_name(client, base_name):
91
+ """Find an available dataset name by auto-incrementing the version suffix.
92
+
93
+ Tries base_name-eval-v1, v2, v3... until an unused name is found.
94
+ Returns (resolved_name, version_number).
95
+ """
96
+ existing = set()
97
+ try:
98
+ for ds in client.list_datasets():
99
+ existing.add(ds.name)
100
+ except Exception:
101
+ pass
102
+
103
+ for v in range(1, 100):
104
+ candidate = f"{base_name}-eval-v{v}"
105
+ if candidate not in existing:
106
+ return candidate, v
107
+
108
+ # Fallback: timestamp-based
109
+ ts = datetime.now(timezone.utc).strftime("%Y%m%d%H%M%S")
110
+ return f"{base_name}-eval-{ts}", 0
111
+
112
+
94
113
  def create_dataset_from_file(client, dataset_name, file_path):
95
114
  """Create a LangSmith dataset from a JSON file of inputs."""
96
115
  with open(file_path) as f:
@@ -177,17 +196,19 @@ def create_empty_dataset(client, dataset_name):
177
196
 
178
197
 
179
198
  def get_evaluators(goals, evaluator_names=None):
180
- """Build evaluator list based on optimization goals."""
181
- from openevals.llm import create_llm_as_judge
182
- from openevals.prompts import CORRECTNESS_PROMPT, CONCISENESS_PROMPT
199
+ """Build evaluator list based on optimization goals.
183
200
 
201
+ Returns only code-based evaluators. LLM-as-judge evaluators
202
+ (correctness, conciseness) are handled post-hoc by the
203
+ evolver-evaluator agent via langsmith-cli.
204
+ """
184
205
  evaluators = []
185
206
  evaluator_keys = []
186
207
 
187
- # Map goals to evaluators
188
- goal_map = {
189
- "accuracy": ("correctness", CORRECTNESS_PROMPT),
190
- "conciseness": ("conciseness", CONCISENESS_PROMPT),
208
+ # Map goals to evaluator keys (LLM-based are recorded but not instantiated)
209
+ goal_to_key = {
210
+ "accuracy": "correctness",
211
+ "conciseness": "conciseness",
191
212
  }
192
213
 
193
214
  if evaluator_names:
@@ -195,39 +216,33 @@ def get_evaluators(goals, evaluator_names=None):
195
216
  else:
196
217
  names = []
197
218
  for goal in goals:
198
- if goal in goal_map:
199
- names.append(goal_map[goal][0])
219
+ if goal in goal_to_key:
220
+ names.append(goal_to_key[goal])
200
221
  if not names:
201
222
  names = ["correctness"] # default
202
223
 
224
+ # Record all evaluator keys (for config) but only instantiate code-based ones
203
225
  for name in names:
204
226
  if name in ("correctness", "accuracy"):
205
- evaluators.append(create_llm_as_judge(
206
- prompt=CORRECTNESS_PROMPT,
207
- feedback_key="correctness",
208
- model="openai:gpt-4.1-mini",
209
- ))
210
227
  evaluator_keys.append("correctness")
228
+ # LLM-as-judge — handled by evaluator agent, not here
211
229
  elif name in ("conciseness", "brevity"):
212
- evaluators.append(create_llm_as_judge(
213
- prompt=CONCISENESS_PROMPT,
214
- feedback_key="conciseness",
215
- model="openai:gpt-4.1-mini",
216
- ))
217
230
  evaluator_keys.append("conciseness")
231
+ # LLM-as-judge — handled by evaluator agent, not here
232
+
233
+ # Always include has_output
234
+ def has_output_eval(inputs, outputs, **kwargs):
235
+ has = bool(outputs and outputs.get("output", outputs.get("answer", "")))
236
+ return {"key": "has_output", "score": 1.0 if has else 0.0}
237
+ evaluators.append(has_output_eval)
218
238
 
219
239
  # Code-based evaluators for latency/tokens
220
240
  if "latency" in goals:
221
- def latency_eval(inputs, outputs, **kwargs):
222
- # Latency is captured in traces, not scored here
223
- return {"key": "has_output", "score": 1.0 if outputs else 0.0}
224
- evaluators.append(latency_eval)
225
241
  evaluator_keys.append("latency")
226
242
 
227
243
  if "token_efficiency" in goals:
228
244
  def token_eval(inputs, outputs, **kwargs):
229
245
  output_text = str(outputs.get("output", outputs.get("answer", "")))
230
- # Penalize very long outputs (>2000 chars)
231
246
  score = min(1.0, 2000 / max(len(output_text), 1))
232
247
  return {"key": "token_efficiency", "score": score}
233
248
  evaluators.append(token_eval)
@@ -328,6 +343,7 @@ def main():
328
343
  parser.add_argument("--dataset-from-file", default=None, help="Create dataset from JSON file")
329
344
  parser.add_argument("--dataset-from-langsmith", default=None, help="Create dataset from LangSmith project")
330
345
  parser.add_argument("--production-project", default=None, help="Production LangSmith project")
346
+ parser.add_argument("--dataset-name", default=None, help="Explicit dataset name (skip auto-versioning)")
331
347
  parser.add_argument("--evaluators", default=None, help="Comma-separated evaluator names")
332
348
  parser.add_argument("--skip-baseline", action="store_true", help="Skip baseline evaluation")
333
349
  parser.add_argument("--output", default=".evolver.json", help="Output config path")
@@ -359,9 +375,19 @@ def main():
359
375
  sys.exit(1)
360
376
 
361
377
  project_name = f"evolver-{args.project_name}"
362
- dataset_name = f"{args.project_name}-eval-v1"
363
378
  goals = [g.strip() for g in args.goals.split(",")]
364
379
 
380
+ # Resolve dataset name (explicit or auto-versioned)
381
+ if args.dataset_name:
382
+ dataset_name = args.dataset_name
383
+ print(f"Using explicit dataset name: '{dataset_name}'")
384
+ else:
385
+ dataset_name, version = resolve_dataset_name(client, args.project_name)
386
+ if version > 1:
387
+ print(f"Dataset name auto-versioned to '{dataset_name}' (v1-v{version-1} already exist)")
388
+ else:
389
+ print(f"Dataset: '{dataset_name}'")
390
+
365
391
  # Create dataset
366
392
  print(f"Creating dataset '{dataset_name}'...")
367
393
  if args.dataset_from_file:
@@ -386,18 +412,23 @@ def main():
386
412
  print(f"Configuring evaluators for goals: {goals}")
387
413
  evaluators, evaluator_keys = get_evaluators(goals, args.evaluators)
388
414
  print(f" Active evaluators: {evaluator_keys}")
415
+ llm_evaluators = [k for k in evaluator_keys if k in ("correctness", "conciseness")]
416
+ if llm_evaluators:
417
+ print(f" LLM evaluators (agent-based): {llm_evaluators}")
389
418
 
390
- # Run baseline
419
+ # Run baseline (code-based evaluators only; LLM scoring done by evaluator agent)
391
420
  baseline_experiment = None
392
421
  baseline_score = 0.0
393
422
  if not args.skip_baseline and count > 0:
394
- print(f"Running baseline evaluation ({count} examples)...")
423
+ print(f"Running baseline target ({count} examples)...")
395
424
  try:
396
425
  baseline_experiment, baseline_score = run_baseline(
397
426
  client, dataset_name, args.entry_point, evaluators,
398
427
  )
399
- print(f" Baseline score: {baseline_score:.3f}")
428
+ print(f" Baseline has_output score: {baseline_score:.3f}")
400
429
  print(f" Experiment: {baseline_experiment}")
430
+ if llm_evaluators:
431
+ print(f" Note: LLM scoring pending — evaluator agent will run during /evolver:evolve")
401
432
  except Exception as e:
402
433
  print(f" Baseline evaluation failed: {e}", file=sys.stderr)
403
434
  print(" Continuing with score 0.0")