harness-evolver 3.0.6 → 3.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -47,7 +47,7 @@ claude
47
47
  <table>
48
48
  <tr>
49
49
  <td><b>LangSmith-Native</b></td>
50
- <td>No custom eval scripts or task files. Uses LangSmith Datasets for test inputs, Experiments for results, and Evaluators (openevals LLM-as-judge) for scoring. Everything is visible in the LangSmith UI.</td>
50
+ <td>No custom eval scripts or task files. Uses LangSmith Datasets for test inputs, Experiments for results, and an agent-based LLM-as-judge for scoring via langsmith-cli. No external API keys needed. Everything is visible in the LangSmith UI.</td>
51
51
  </tr>
52
52
  <tr>
53
53
  <td><b>Real Code Evolution</b></td>
@@ -92,6 +92,7 @@ claude
92
92
  | **Architect** | Recommends multi-agent topology changes | Blue |
93
93
  | **Critic** | Validates evaluator quality, detects gaming | Red |
94
94
  | **TestGen** | Generates test inputs for LangSmith datasets | Cyan |
95
+ | **Evaluator** | LLM-as-judge — reads outputs via langsmith-cli, scores correctness | Yellow |
95
96
 
96
97
  ---
97
98
 
@@ -104,7 +105,8 @@ claude
104
105
  +- 1.5 Gather trace insights (cluster errors, tokens, latency)
105
106
  +- 1.8 Analyze per-task failures (adaptive briefings)
106
107
  +- 2. Spawn 5 proposers in parallel (each in a git worktree)
107
- +- 3. Evaluate each candidate (client.evaluate() -> LangSmith experiments)
108
+ +- 3. Run target for each candidate (client.evaluate() -> code-based evaluators)
109
+ +- 3.5 Spawn evaluator agent (reads outputs via langsmith-cli, judges, writes scores)
108
110
  +- 4. Compare experiments -> select winner + per-task champion
109
111
  +- 5. Merge winning worktree into main branch
110
112
  +- 5.5 Test suite growth (add regression examples to dataset)
@@ -119,13 +121,15 @@ claude
119
121
  ## Requirements
120
122
 
121
123
  - **LangSmith account** + `LANGSMITH_API_KEY`
122
- - **Python 3.10+** with `langsmith` and `openevals` packages
124
+ - **Python 3.10+** with `langsmith` package
125
+ - **langsmith-cli** (`uv tool install langsmith-cli`) — required for evaluator agent
123
126
  - **Git** (for worktree-based isolation)
124
127
  - **Claude Code** (or Cursor/Codex/Windsurf)
125
128
 
126
129
  ```bash
127
130
  export LANGSMITH_API_KEY="lsv2_pt_..."
128
- pip install langsmith openevals
131
+ pip install langsmith
132
+ uv tool install langsmith-cli
129
133
  ```
130
134
 
131
135
  ---
@@ -0,0 +1,152 @@
1
+ ---
2
+ name: evolver-evaluator
3
+ description: |
4
+ Use this agent to evaluate experiment outputs using LLM-as-judge.
5
+ Reads run inputs/outputs from LangSmith via langsmith-cli, judges correctness,
6
+ and writes scores back as feedback. No external API keys needed.
7
+ tools: Read, Bash, Glob, Grep
8
+ color: yellow
9
+ ---
10
+
11
+ # Evolver — Evaluator Agent (v3)
12
+
13
+ You are an LLM evaluation judge. Your job is to read the outputs of an experiment from LangSmith, evaluate each one for correctness, and write scores back as feedback.
14
+
15
+ You ARE the LLM-as-judge. You replace the need for an external LLM API call.
16
+
17
+ ## Bootstrap
18
+
19
+ 1. Verify langsmith-cli is available:
20
+ ```bash
21
+ langsmith-cli --version
22
+ ```
23
+ If this fails, report the error and stop — langsmith-cli is required.
24
+
25
+ 2. Your prompt contains `<experiment>`, `<evaluators>`, and `<context>` blocks. Parse them to understand:
26
+ - Which experiment to evaluate
27
+ - What evaluation criteria to apply
28
+ - What the agent is supposed to do (domain context)
29
+
30
+ ## Tool: langsmith-cli
31
+
32
+ You interact with LangSmith exclusively through `langsmith-cli`. Always use `--json` for machine-readable output.
33
+
34
+ ### Reading experiment outputs
35
+
36
+ ```bash
37
+ langsmith-cli --json runs list \
38
+ --project "{experiment_name}" \
39
+ --fields id,inputs,outputs,error,reference_example_id \
40
+ --is-root \
41
+ --limit 200
42
+ ```
43
+
44
+ This returns one JSON object per line (JSONL). Each line has:
45
+ - `id` — the run ID (needed to write feedback)
46
+ - `inputs` — what was sent to the agent
47
+ - `outputs` — what the agent responded
48
+ - `error` — error message if the run failed
49
+ - `reference_example_id` — links back to the dataset example
50
+
51
+ ### Writing scores
52
+
53
+ For EACH run, after judging it:
54
+
55
+ ```bash
56
+ langsmith-cli --json feedback create {run_id} \
57
+ --key "{evaluator_key}" \
58
+ --score {score} \
59
+ --comment "{brief_reasoning}" \
60
+ --source model
61
+ ```
62
+
63
+ Use `--source model` since this is an LLM-generated evaluation.
64
+
65
+ ## Your Workflow
66
+
67
+ ### Phase 1: Read All Outputs
68
+
69
+ Fetch all runs from the experiment. Save the output to a file for reference:
70
+
71
+ ```bash
72
+ langsmith-cli --json runs list \
73
+ --project "{experiment_name}" \
74
+ --fields id,inputs,outputs,error,reference_example_id \
75
+ --is-root --limit 200 \
76
+ --output experiment_runs.jsonl
77
+ ```
78
+
79
+ Then read `experiment_runs.jsonl` to see all results.
80
+
81
+ ### Phase 2: Evaluate Each Run
82
+
83
+ For each run, apply the requested evaluators. The evaluators you may be asked to judge:
84
+
85
+ #### correctness
86
+ Judge: **Is the output a correct, accurate, and complete response to the input?**
87
+
88
+ Scoring:
89
+ - `1.0` — Correct and complete. The response accurately addresses the input.
90
+ - `0.0` — Incorrect, incomplete, or off-topic.
91
+
92
+ Consider:
93
+ - Does the response answer what was asked?
94
+ - Is the information factually accurate?
95
+ - Are there hallucinations or made-up facts?
96
+ - Is the response relevant to the domain?
97
+
98
+ #### conciseness
99
+ Judge: **Is the response appropriately concise without sacrificing quality?**
100
+
101
+ Scoring:
102
+ - `1.0` — Concise and complete. No unnecessary verbosity.
103
+ - `0.0` — Excessively verbose, repetitive, or padded.
104
+
105
+ ### Phase 3: Write All Scores
106
+
107
+ For each run you evaluated, write feedback via `langsmith-cli feedback create`.
108
+
109
+ Write scores in batches — evaluate all runs first, then write all scores. This is more efficient than alternating between reading and writing.
110
+
111
+ Example for one run:
112
+ ```bash
113
+ langsmith-cli --json feedback create "run-uuid-here" \
114
+ --key correctness \
115
+ --score 1.0 \
116
+ --comment "Response correctly identifies the applicable regulation and provides accurate guidance." \
117
+ --source model
118
+ ```
119
+
120
+ ### Phase 4: Summary
121
+
122
+ After writing all scores, compute the aggregate:
123
+
124
+ ```bash
125
+ langsmith-cli --json feedback list --run-id "{any_run_id}" --key correctness
126
+ ```
127
+
128
+ ## Error Handling
129
+
130
+ - If a run has `error` set and empty `outputs`: score it `0.0` with comment "Run failed: {error}"
131
+ - If a run has `outputs` but they contain an error message: score `0.0` with comment explaining the failure
132
+ - If `outputs` is empty but no error: score `0.0` with comment "Empty output"
133
+
134
+ ## Rules
135
+
136
+ 1. **Be a fair judge** — evaluate based on the criteria, not your preferences
137
+ 2. **Brief comments** — keep feedback comments under 200 characters
138
+ 3. **Binary scoring for correctness** — use 1.0 or 0.0, not partial scores (unless instructed otherwise)
139
+ 4. **Score EVERY run** — don't skip any, even failed ones
140
+ 5. **Domain awareness** — use the `<context>` block to understand what constitutes a "correct" answer in this domain
141
+
142
+ ## Return Protocol
143
+
144
+ When done, end your response with:
145
+
146
+ ## EVALUATION COMPLETE
147
+ - **Experiment**: {experiment_name}
148
+ - **Runs evaluated**: {N}
149
+ - **Evaluators applied**: {list}
150
+ - **Mean score**: {score}
151
+ - **Pass rate**: {N}/{total} ({percent}%)
152
+ - **Common failure patterns**: {brief list}
package/bin/install.js CHANGED
@@ -2,7 +2,7 @@
2
2
  /**
3
3
  * Harness Evolver v3 installer.
4
4
  * Copies skills/agents/tools to runtime directories (GSD pattern).
5
- * Installs Python dependencies (langsmith + openevals).
5
+ * Installs Python dependencies (langsmith) and langsmith-cli.
6
6
  *
7
7
  * Usage: npx harness-evolver@latest
8
8
  */
@@ -225,15 +225,15 @@ function installPythonDeps() {
225
225
 
226
226
  // Install/upgrade deps in the venv
227
227
  const installCommands = [
228
- `uv pip install --python "${venvPython}" langsmith openevals`,
229
- `"${venvPip}" install --upgrade langsmith openevals`,
230
- `"${venvPython}" -m pip install --upgrade langsmith openevals`,
228
+ `uv pip install --python "${venvPython}" langsmith`,
229
+ `"${venvPip}" install --upgrade langsmith`,
230
+ `"${venvPython}" -m pip install --upgrade langsmith`,
231
231
  ];
232
232
 
233
233
  for (const cmd of installCommands) {
234
234
  try {
235
235
  execSync(cmd, { stdio: "pipe", timeout: 120000 });
236
- console.log(` ${GREEN}✓${RESET} langsmith + openevals installed in venv`);
236
+ console.log(` ${GREEN}✓${RESET} langsmith installed in venv`);
237
237
  return true;
238
238
  } catch {
239
239
  continue;
@@ -241,7 +241,7 @@ function installPythonDeps() {
241
241
  }
242
242
 
243
243
  console.log(` ${YELLOW}!${RESET} Could not install packages in venv.`);
244
- console.log(` Run manually: ${BOLD}~/.evolver/venv/bin/pip install langsmith openevals${RESET}`);
244
+ console.log(` Run manually: ${BOLD}~/.evolver/venv/bin/pip install langsmith${RESET}`);
245
245
  return false;
246
246
  }
247
247
 
@@ -303,26 +303,24 @@ async function configureLangSmith(rl) {
303
303
  }
304
304
  }
305
305
 
306
- // --- Step 2: langsmith-cli ---
306
+ // --- Step 2: langsmith-cli (required for evaluator agent) ---
307
307
  if (hasLangsmithCli) {
308
308
  console.log(` ${GREEN}✓${RESET} langsmith-cli installed`);
309
309
  } else {
310
- console.log(`\n ${BOLD}langsmith-cli${RESET} — optional but useful for debugging traces`);
311
- console.log(` ${DIM}Quick project listing, trace inspection, run stats from terminal.${RESET}`);
312
- const lsCliAnswer = await ask(rl, `\n ${YELLOW}Install langsmith-cli? [Y/n]:${RESET} `);
313
- if (lsCliAnswer.trim().toLowerCase() !== "n") {
314
- console.log(`\n Installing langsmith-cli...`);
315
- try {
316
- execSync("uv tool install langsmith-cli 2>/dev/null || pip install langsmith-cli 2>/dev/null || pip3 install langsmith-cli", { stdio: "pipe", timeout: 60000 });
317
- console.log(` ${GREEN}✓${RESET} langsmith-cli installed`);
310
+ console.log(`\n ${BOLD}langsmith-cli${RESET} — ${YELLOW}required${RESET} for LLM-as-judge evaluation`);
311
+ console.log(` ${DIM}The evaluator agent uses it to read experiment outputs and write scores.${RESET}`);
312
+ console.log(`\n Installing langsmith-cli...`);
313
+ try {
314
+ execSync("uv tool install langsmith-cli 2>/dev/null || pip install langsmith-cli 2>/dev/null || pip3 install langsmith-cli", { stdio: "pipe", timeout: 60000 });
315
+ console.log(` ${GREEN}✓${RESET} langsmith-cli installed`);
318
316
 
319
- // If we have a key, auto-authenticate
320
- if (hasKey && fs.existsSync(langsmithCredsFile)) {
321
- console.log(` ${GREEN}✓${RESET} langsmith-cli auto-authenticated (credentials file exists)`);
322
- }
323
- } catch {
324
- console.log(` ${YELLOW}!${RESET} Could not install. Try manually: ${DIM}uv tool install langsmith-cli${RESET}`);
317
+ // If we have a key, auto-authenticate
318
+ if (hasKey && fs.existsSync(langsmithCredsFile)) {
319
+ console.log(` ${GREEN}✓${RESET} langsmith-cli auto-authenticated (credentials file exists)`);
325
320
  }
321
+ } catch {
322
+ console.log(` ${RED}!${RESET} Could not install langsmith-cli.`);
323
+ console.log(` ${BOLD}This is required.${RESET} Install manually: ${DIM}uv tool install langsmith-cli${RESET}`);
326
324
  }
327
325
  }
328
326
  }
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "harness-evolver",
3
- "version": "3.0.6",
3
+ "version": "3.1.0",
4
4
  "description": "LangSmith-native autonomous agent optimization for Claude Code",
5
5
  "author": "Raphael Valdetaro",
6
6
  "license": "MIT",
@@ -172,7 +172,7 @@ If ALL_PASSING: D gets `creative`, E gets `efficiency`.
172
172
 
173
173
  Wait for all 5 to complete.
174
174
 
175
- ### 3. Evaluate Each Candidate
175
+ ### 3. Run Target for Each Candidate
176
176
 
177
177
  For each worktree that has changes (proposer committed something):
178
178
 
@@ -184,7 +184,59 @@ $EVOLVER_PY $TOOLS/run_eval.py \
184
184
  --timeout 120
185
185
  ```
186
186
 
187
- Each candidate becomes a separate LangSmith experiment.
187
+ Each candidate becomes a separate LangSmith experiment. This step runs the agent and applies code-based evaluators (has_output, token_efficiency) only.
188
+
189
+ Collect all experiment names from the output (the `"experiment"` field in each JSON output).
190
+
191
+ ### 3.5. LLM-as-Judge Evaluation (Evaluator Agent)
192
+
193
+ Check if the config has LLM-based evaluators (correctness, conciseness):
194
+
195
+ ```bash
196
+ python3 -c "import json; c=json.load(open('.evolver.json')); llm=[k for k in c['evaluators'] if k in ('correctness','conciseness')]; print(','.join(llm) if llm else '')"
197
+ ```
198
+
199
+ If LLM evaluators are configured, first verify langsmith-cli is available:
200
+
201
+ ```bash
202
+ command -v langsmith-cli >/dev/null 2>&1 || { echo "ERROR: langsmith-cli not found. Install with: uv tool install langsmith-cli"; exit 1; }
203
+ ```
204
+
205
+ Then spawn ONE evaluator agent that scores ALL candidates in a single pass. This is more efficient than spawning one agent per candidate:
206
+
207
+ ```
208
+ Agent(
209
+ subagent_type: "evolver-evaluator",
210
+ description: "Evaluate all candidates for iteration v{NNN}",
211
+ prompt: |
212
+ <experiment>
213
+ Evaluate the following experiments (one per candidate):
214
+ - {experiment_name_a}
215
+ - {experiment_name_b}
216
+ - {experiment_name_c}
217
+ - {experiment_name_d}
218
+ - {experiment_name_e}
219
+ </experiment>
220
+
221
+ <evaluators>
222
+ Apply these evaluators to each run in each experiment:
223
+ - {llm_evaluator_list, e.g. "correctness", "conciseness"}
224
+ </evaluators>
225
+
226
+ <context>
227
+ Agent type: {framework} agent
228
+ Domain: {description from .evolver.json or entry point context}
229
+ Entry point: {entry_point}
230
+
231
+ For each experiment:
232
+ 1. Read all runs via: langsmith-cli --json runs list --project "{experiment_name}" --fields id,inputs,outputs,error --is-root --limit 200
233
+ 2. Judge each run's output against the input
234
+ 3. Write scores via: langsmith-cli --json feedback create {run_id} --key {evaluator} --score {0.0|1.0} --comment "{reason}" --source model
235
+ </context>
236
+ )
237
+ ```
238
+
239
+ Wait for the evaluator agent to complete before proceeding.
188
240
 
189
241
  ### 4. Compare All Candidates
190
242
 
@@ -42,7 +42,7 @@ TOOLS=$([ -d ".evolver/tools" ] && echo ".evolver/tools" || echo "$HOME/.evolver
42
42
  EVOLVER_PY=$([ -f "$HOME/.evolver/venv/bin/python" ] && echo "$HOME/.evolver/venv/bin/python" || echo "python3")
43
43
  ```
44
44
 
45
- Use `$EVOLVER_PY` instead of `python3` for ALL tool invocations. This ensures the venv with langsmith+openevals is used.
45
+ Use `$EVOLVER_PY` instead of `python3` for ALL tool invocations. This ensures the venv with langsmith is used.
46
46
 
47
47
  ## Phase 1: Explore Project (automatic)
48
48
 
@@ -201,4 +201,4 @@ Next: run /evolver:evolve to start optimizing.
201
201
  - If `.evolver.json` already exists, ask before overwriting.
202
202
  - If the agent needs a venv, the run command should activate it: `cd {dir} && .venv/bin/python main.py`
203
203
  - If LangSmith connection fails, check API key and network.
204
- - The setup installs `langsmith` and `openevals` if missing.
204
+ - The setup requires `langsmith` (Python SDK) and `langsmith-cli` (for evaluator agent).
package/tools/run_eval.py CHANGED
@@ -2,7 +2,9 @@
2
2
  """Run LangSmith evaluation for a candidate in a worktree.
3
3
 
4
4
  Wraps client.evaluate() — runs the user's agent against the dataset
5
- with configured evaluators, from within a specific directory (worktree).
5
+ with code-based evaluators only (has_output, token_efficiency).
6
+ LLM-as-judge scoring (correctness, conciseness) is handled post-hoc
7
+ by the evolver-evaluator agent via langsmith-cli.
6
8
 
7
9
  Usage:
8
10
  python3 run_eval.py \
@@ -11,7 +13,7 @@ Usage:
11
13
  --experiment-prefix v001a \
12
14
  [--timeout 120]
13
15
 
14
- Requires: pip install langsmith openevals
16
+ Requires: pip install langsmith
15
17
  """
16
18
 
17
19
  import argparse
@@ -124,34 +126,30 @@ def make_target(entry_point, cwd):
124
126
 
125
127
 
126
128
  def load_evaluators(evaluator_keys):
127
- """Load evaluators by key name."""
128
- from openevals.llm import create_llm_as_judge
129
- from openevals.prompts import CORRECTNESS_PROMPT, CONCISENESS_PROMPT
129
+ """Load code-based evaluators only.
130
130
 
131
+ LLM-as-judge evaluators (correctness, conciseness) are handled
132
+ post-hoc by the evolver-evaluator agent via langsmith-cli.
133
+ """
131
134
  evaluators = []
135
+
136
+ # Always include has_output — verifies the agent produced something
137
+ def has_output_eval(inputs, outputs, **kwargs):
138
+ has = bool(outputs and outputs.get("output", outputs.get("answer", "")))
139
+ return {"key": "has_output", "score": 1.0 if has else 0.0}
140
+ evaluators.append(has_output_eval)
141
+
132
142
  for key in evaluator_keys:
133
- if key == "correctness":
134
- evaluators.append(create_llm_as_judge(
135
- prompt=CORRECTNESS_PROMPT,
136
- feedback_key="correctness",
137
- model="openai:gpt-4.1-mini",
138
- ))
139
- elif key == "conciseness":
140
- evaluators.append(create_llm_as_judge(
141
- prompt=CONCISENESS_PROMPT,
142
- feedback_key="conciseness",
143
- model="openai:gpt-4.1-mini",
144
- ))
145
- elif key == "latency":
146
- def latency_eval(inputs, outputs, **kwargs):
147
- return {"key": "has_output", "score": 1.0 if outputs else 0.0}
148
- evaluators.append(latency_eval)
143
+ if key == "latency":
144
+ # Latency is captured in traces, just check output exists
145
+ pass # has_output already covers this
149
146
  elif key == "token_efficiency":
150
147
  def token_eval(inputs, outputs, **kwargs):
151
148
  output_text = str(outputs.get("output", outputs.get("answer", "")))
152
149
  score = min(1.0, 2000 / max(len(output_text), 1))
153
150
  return {"key": "token_efficiency", "score": score}
154
151
  evaluators.append(token_eval)
152
+ # correctness, conciseness — skipped, handled by evaluator agent
155
153
 
156
154
  return evaluators
157
155
 
@@ -176,10 +174,16 @@ def main():
176
174
  target = make_target(config["entry_point"], args.worktree_path)
177
175
  evaluators = load_evaluators(config["evaluators"])
178
176
 
177
+ # Identify which evaluators need the agent (LLM-as-judge)
178
+ llm_evaluators = [k for k in config["evaluators"] if k in ("correctness", "conciseness")]
179
+ code_evaluators = [k for k in config["evaluators"] if k not in ("correctness", "conciseness")]
180
+
179
181
  print(f"Running evaluation: {args.experiment_prefix}")
180
182
  print(f" Dataset: {config['dataset']}")
181
183
  print(f" Worktree: {args.worktree_path}")
182
- print(f" Evaluators: {config['evaluators']}")
184
+ print(f" Code evaluators: {['has_output'] + code_evaluators}")
185
+ if llm_evaluators:
186
+ print(f" Pending LLM evaluators (agent): {llm_evaluators}")
183
187
 
184
188
  try:
185
189
  results = client.evaluate(
@@ -192,7 +196,7 @@ def main():
192
196
 
193
197
  experiment_name = results.experiment_name
194
198
 
195
- # Calculate mean score
199
+ # Calculate mean score from code-based evaluators only
196
200
  scores = []
197
201
  per_example = {}
198
202
  for result in results:
@@ -218,10 +222,13 @@ def main():
218
222
  "num_examples": len(per_example),
219
223
  "num_scores": len(scores),
220
224
  "per_example": per_example,
225
+ "pending_llm_evaluators": llm_evaluators,
221
226
  }
222
227
 
223
228
  print(json.dumps(output))
224
- print(f"\nEvaluation complete: {mean_score:.3f} ({len(per_example)} examples)")
229
+ print(f"\nTarget runs complete: {len(per_example)} examples")
230
+ if llm_evaluators:
231
+ print(f"Awaiting evaluator agent for: {llm_evaluators}")
225
232
 
226
233
  except Exception as e:
227
234
  print(f"Evaluation failed: {e}", file=sys.stderr)
package/tools/setup.py CHANGED
@@ -19,7 +19,7 @@ Usage:
19
19
  [--production-project my-prod-project] \
20
20
  [--evaluators correctness,conciseness]
21
21
 
22
- Requires: pip install langsmith openevals
22
+ Requires: pip install langsmith
23
23
  """
24
24
 
25
25
  import argparse
@@ -78,16 +78,12 @@ def ensure_langsmith_api_key():
78
78
 
79
79
 
80
80
  def check_dependencies():
81
- """Verify langsmith and openevals are installed."""
81
+ """Verify langsmith is installed."""
82
82
  missing = []
83
83
  try:
84
84
  import langsmith # noqa: F401
85
85
  except ImportError:
86
86
  missing.append("langsmith")
87
- try:
88
- import openevals # noqa: F401
89
- except ImportError:
90
- missing.append("openevals")
91
87
  return missing
92
88
 
93
89
 
@@ -177,17 +173,19 @@ def create_empty_dataset(client, dataset_name):
177
173
 
178
174
 
179
175
  def get_evaluators(goals, evaluator_names=None):
180
- """Build evaluator list based on optimization goals."""
181
- from openevals.llm import create_llm_as_judge
182
- from openevals.prompts import CORRECTNESS_PROMPT, CONCISENESS_PROMPT
176
+ """Build evaluator list based on optimization goals.
183
177
 
178
+ Returns only code-based evaluators. LLM-as-judge evaluators
179
+ (correctness, conciseness) are handled post-hoc by the
180
+ evolver-evaluator agent via langsmith-cli.
181
+ """
184
182
  evaluators = []
185
183
  evaluator_keys = []
186
184
 
187
- # Map goals to evaluators
188
- goal_map = {
189
- "accuracy": ("correctness", CORRECTNESS_PROMPT),
190
- "conciseness": ("conciseness", CONCISENESS_PROMPT),
185
+ # Map goals to evaluator keys (LLM-based are recorded but not instantiated)
186
+ goal_to_key = {
187
+ "accuracy": "correctness",
188
+ "conciseness": "conciseness",
191
189
  }
192
190
 
193
191
  if evaluator_names:
@@ -195,39 +193,33 @@ def get_evaluators(goals, evaluator_names=None):
195
193
  else:
196
194
  names = []
197
195
  for goal in goals:
198
- if goal in goal_map:
199
- names.append(goal_map[goal][0])
196
+ if goal in goal_to_key:
197
+ names.append(goal_to_key[goal])
200
198
  if not names:
201
199
  names = ["correctness"] # default
202
200
 
201
+ # Record all evaluator keys (for config) but only instantiate code-based ones
203
202
  for name in names:
204
203
  if name in ("correctness", "accuracy"):
205
- evaluators.append(create_llm_as_judge(
206
- prompt=CORRECTNESS_PROMPT,
207
- feedback_key="correctness",
208
- model="openai:gpt-4.1-mini",
209
- ))
210
204
  evaluator_keys.append("correctness")
205
+ # LLM-as-judge — handled by evaluator agent, not here
211
206
  elif name in ("conciseness", "brevity"):
212
- evaluators.append(create_llm_as_judge(
213
- prompt=CONCISENESS_PROMPT,
214
- feedback_key="conciseness",
215
- model="openai:gpt-4.1-mini",
216
- ))
217
207
  evaluator_keys.append("conciseness")
208
+ # LLM-as-judge — handled by evaluator agent, not here
209
+
210
+ # Always include has_output
211
+ def has_output_eval(inputs, outputs, **kwargs):
212
+ has = bool(outputs and outputs.get("output", outputs.get("answer", "")))
213
+ return {"key": "has_output", "score": 1.0 if has else 0.0}
214
+ evaluators.append(has_output_eval)
218
215
 
219
216
  # Code-based evaluators for latency/tokens
220
217
  if "latency" in goals:
221
- def latency_eval(inputs, outputs, **kwargs):
222
- # Latency is captured in traces, not scored here
223
- return {"key": "has_output", "score": 1.0 if outputs else 0.0}
224
- evaluators.append(latency_eval)
225
218
  evaluator_keys.append("latency")
226
219
 
227
220
  if "token_efficiency" in goals:
228
221
  def token_eval(inputs, outputs, **kwargs):
229
222
  output_text = str(outputs.get("output", outputs.get("answer", "")))
230
- # Penalize very long outputs (>2000 chars)
231
223
  score = min(1.0, 2000 / max(len(output_text), 1))
232
224
  return {"key": "token_efficiency", "score": score}
233
225
  evaluators.append(token_eval)
@@ -386,18 +378,23 @@ def main():
386
378
  print(f"Configuring evaluators for goals: {goals}")
387
379
  evaluators, evaluator_keys = get_evaluators(goals, args.evaluators)
388
380
  print(f" Active evaluators: {evaluator_keys}")
381
+ llm_evaluators = [k for k in evaluator_keys if k in ("correctness", "conciseness")]
382
+ if llm_evaluators:
383
+ print(f" LLM evaluators (agent-based): {llm_evaluators}")
389
384
 
390
- # Run baseline
385
+ # Run baseline (code-based evaluators only; LLM scoring done by evaluator agent)
391
386
  baseline_experiment = None
392
387
  baseline_score = 0.0
393
388
  if not args.skip_baseline and count > 0:
394
- print(f"Running baseline evaluation ({count} examples)...")
389
+ print(f"Running baseline target ({count} examples)...")
395
390
  try:
396
391
  baseline_experiment, baseline_score = run_baseline(
397
392
  client, dataset_name, args.entry_point, evaluators,
398
393
  )
399
- print(f" Baseline score: {baseline_score:.3f}")
394
+ print(f" Baseline has_output score: {baseline_score:.3f}")
400
395
  print(f" Experiment: {baseline_experiment}")
396
+ if llm_evaluators:
397
+ print(f" Note: LLM scoring pending — evaluator agent will run during /evolver:evolve")
401
398
  except Exception as e:
402
399
  print(f" Baseline evaluation failed: {e}", file=sys.stderr)
403
400
  print(" Continuing with score 0.0")