npm - harness-evolver - Versions diffs - 3.0.6 → 3.1.1 - Mend

harness-evolver 3.0.6 → 3.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

package/README.md +8 -4
package/agents/evolver-evaluator.md +152 -0
package/bin/install.js +19 -21
package/package.json +1 -1
package/skills/evolve/SKILL.md +70 -13
package/skills/setup/SKILL.md +2 -2
package/tools/__pycache__/detect_stack.cpython-314.pyc +0 -0
package/tools/__pycache__/trace_logger.cpython-314.pyc +0 -0
package/tools/run_eval.py +31 -24
package/tools/setup.py +65 -34

package/README.md CHANGED Viewed

@@ -47,7 +47,7 @@ claude
 <table>
 <tr>
 <td><b>LangSmith-Native</b></td>
-<td>No custom eval scripts or task files. Uses LangSmith Datasets for test inputs, Experiments for results, and Evaluators (openevals LLM-as-judge) for scoring. Everything is visible in the LangSmith UI.</td>
+<td>No custom eval scripts or task files. Uses LangSmith Datasets for test inputs, Experiments for results, and an agent-based LLM-as-judge for scoring via langsmith-cli. No external API keys needed. Everything is visible in the LangSmith UI.</td>
 </tr>
 <tr>
 <td><b>Real Code Evolution</b></td>
@@ -92,6 +92,7 @@ claude
 | **Architect** | Recommends multi-agent topology changes | Blue |
 | **Critic** | Validates evaluator quality, detects gaming | Red |
 | **TestGen** | Generates test inputs for LangSmith datasets | Cyan |
+| **Evaluator** | LLM-as-judge — reads outputs via langsmith-cli, scores correctness | Yellow |
 ---
@@ -104,7 +105,8 @@ claude
   +- 1.5 Gather trace insights (cluster errors, tokens, latency)
   +- 1.8 Analyze per-task failures (adaptive briefings)
   +- 2.  Spawn 5 proposers in parallel (each in a git worktree)
-  +- 3.  Evaluate each candidate (client.evaluate() -> LangSmith experiments)
+  +- 3.  Run target for each candidate (client.evaluate() -> code-based evaluators)
+  +- 3.5 Spawn evaluator agent (reads outputs via langsmith-cli, judges, writes scores)
   +- 4.  Compare experiments -> select winner + per-task champion
   +- 5.  Merge winning worktree into main branch
   +- 5.5 Test suite growth (add regression examples to dataset)
@@ -119,13 +121,15 @@ claude
 ## Requirements
 - **LangSmith account** + `LANGSMITH_API_KEY`
-- **Python 3.10+** with `langsmith` and `openevals` packages
+- **Python 3.10+** with `langsmith` package
+- **langsmith-cli** (`uv tool install langsmith-cli`) — required for evaluator agent
 - **Git** (for worktree-based isolation)
 - **Claude Code** (or Cursor/Codex/Windsurf)
 ```bash
 export LANGSMITH_API_KEY="lsv2_pt_..."
-pip install langsmith openevals
+pip install langsmith
+uv tool install langsmith-cli
 ```
 ---

package/agents/evolver-evaluator.md ADDED Viewed

@@ -0,0 +1,152 @@
+---
+name: evolver-evaluator
+description: |
+  Use this agent to evaluate experiment outputs using LLM-as-judge.
+  Reads run inputs/outputs from LangSmith via langsmith-cli, judges correctness,
+  and writes scores back as feedback. No external API keys needed.
+tools: Read, Bash, Glob, Grep
+color: yellow
+---
+# Evolver — Evaluator Agent (v3)
+You are an LLM evaluation judge. Your job is to read the outputs of an experiment from LangSmith, evaluate each one for correctness, and write scores back as feedback.
+You ARE the LLM-as-judge. You replace the need for an external LLM API call.
+## Bootstrap
+1. Verify langsmith-cli is available:
+```bash
+langsmith-cli --version
+```
+If this fails, report the error and stop — langsmith-cli is required.
+2. Your prompt contains `<experiment>`, `<evaluators>`, and `<context>` blocks. Parse them to understand:
+- Which experiment to evaluate
+- What evaluation criteria to apply
+- What the agent is supposed to do (domain context)
+## Tool: langsmith-cli
+You interact with LangSmith exclusively through `langsmith-cli`. Always use `--json` for machine-readable output.
+### Reading experiment outputs
+```bash
+langsmith-cli --json runs list \
+    --project "{experiment_name}" \
+    --fields id,inputs,outputs,error,reference_example_id \
+    --is-root \
+    --limit 200
+```
+This returns one JSON object per line (JSONL). Each line has:
+- `id` — the run ID (needed to write feedback)
+- `inputs` — what was sent to the agent
+- `outputs` — what the agent responded
+- `error` — error message if the run failed
+- `reference_example_id` — links back to the dataset example
+### Writing scores
+For EACH run, after judging it:
+```bash
+langsmith-cli --json feedback create {run_id} \
+    --key "{evaluator_key}" \
+    --score {score} \
+    --comment "{brief_reasoning}" \
+    --source model
+```
+Use `--source model` since this is an LLM-generated evaluation.
+## Your Workflow
+### Phase 1: Read All Outputs
+Fetch all runs from the experiment. Save the output to a file for reference:
+```bash
+langsmith-cli --json runs list \
+    --project "{experiment_name}" \
+    --fields id,inputs,outputs,error,reference_example_id \
+    --is-root --limit 200 \
+    --output experiment_runs.jsonl
+```
+Then read `experiment_runs.jsonl` to see all results.
+### Phase 2: Evaluate Each Run
+For each run, apply the requested evaluators. The evaluators you may be asked to judge:
+#### correctness
+Judge: **Is the output a correct, accurate, and complete response to the input?**
+Scoring:
+- `1.0` — Correct and complete. The response accurately addresses the input.
+- `0.0` — Incorrect, incomplete, or off-topic.
+Consider:
+- Does the response answer what was asked?
+- Is the information factually accurate?
+- Are there hallucinations or made-up facts?
+- Is the response relevant to the domain?
+#### conciseness
+Judge: **Is the response appropriately concise without sacrificing quality?**
+Scoring:
+- `1.0` — Concise and complete. No unnecessary verbosity.
+- `0.0` — Excessively verbose, repetitive, or padded.
+### Phase 3: Write All Scores
+For each run you evaluated, write feedback via `langsmith-cli feedback create`.
+Write scores in batches — evaluate all runs first, then write all scores. This is more efficient than alternating between reading and writing.
+Example for one run:
+```bash
+langsmith-cli --json feedback create "run-uuid-here" \
+    --key correctness \
+    --score 1.0 \
+    --comment "Response correctly identifies the applicable regulation and provides accurate guidance." \
+    --source model
+```
+### Phase 4: Summary
+After writing all scores, compute the aggregate:
+```bash
+langsmith-cli --json feedback list --run-id "{any_run_id}" --key correctness
+```
+## Error Handling
+- If a run has `error` set and empty `outputs`: score it `0.0` with comment "Run failed: {error}"
+- If a run has `outputs` but they contain an error message: score `0.0` with comment explaining the failure
+- If `outputs` is empty but no error: score `0.0` with comment "Empty output"
+## Rules
+1. **Be a fair judge** — evaluate based on the criteria, not your preferences
+2. **Brief comments** — keep feedback comments under 200 characters
+3. **Binary scoring for correctness** — use 1.0 or 0.0, not partial scores (unless instructed otherwise)
+4. **Score EVERY run** — don't skip any, even failed ones
+5. **Domain awareness** — use the `<context>` block to understand what constitutes a "correct" answer in this domain
+## Return Protocol
+When done, end your response with:
+## EVALUATION COMPLETE
+- **Experiment**: {experiment_name}
+- **Runs evaluated**: {N}
+- **Evaluators applied**: {list}
+- **Mean score**: {score}
+- **Pass rate**: {N}/{total} ({percent}%)
+- **Common failure patterns**: {brief list}

package/bin/install.js CHANGED Viewed

@@ -2,7 +2,7 @@
 /**
  * Harness Evolver v3 installer.
  * Copies skills/agents/tools to runtime directories (GSD pattern).
- * Installs Python dependencies (langsmith + openevals).
+ * Installs Python dependencies (langsmith) and langsmith-cli.
  *
  * Usage: npx harness-evolver@latest
  */
@@ -225,15 +225,15 @@ function installPythonDeps() {
   // Install/upgrade deps in the venv
   const installCommands = [
-    `uv pip install --python "${venvPython}" langsmith openevals`,
-    `"${venvPip}" install --upgrade langsmith openevals`,
-    `"${venvPython}" -m pip install --upgrade langsmith openevals`,
+    `uv pip install --python "${venvPython}" langsmith`,
+    `"${venvPip}" install --upgrade langsmith`,
+    `"${venvPython}" -m pip install --upgrade langsmith`,
   ];
   for (const cmd of installCommands) {
     try {
       execSync(cmd, { stdio: "pipe", timeout: 120000 });
-      console.log(`  ${GREEN}✓${RESET} langsmith + openevals installed in venv`);
+      console.log(`  ${GREEN}✓${RESET} langsmith installed in venv`);
       return true;
     } catch {
       continue;
@@ -241,7 +241,7 @@ function installPythonDeps() {
   }
   console.log(`  ${YELLOW}!${RESET} Could not install packages in venv.`);
-  console.log(`    Run manually: ${BOLD}~/.evolver/venv/bin/pip install langsmith openevals${RESET}`);
+  console.log(`    Run manually: ${BOLD}~/.evolver/venv/bin/pip install langsmith${RESET}`);
   return false;
 }
@@ -303,26 +303,24 @@ async function configureLangSmith(rl) {
     }
   }
-  // --- Step 2: langsmith-cli ---
+  // --- Step 2: langsmith-cli (required for evaluator agent) ---
   if (hasLangsmithCli) {
     console.log(`  ${GREEN}✓${RESET} langsmith-cli installed`);
   } else {
-    console.log(`\n  ${BOLD}langsmith-cli${RESET} — optional but useful for debugging traces`);
-    console.log(`  ${DIM}Quick project listing, trace inspection, run stats from terminal.${RESET}`);
-    const lsCliAnswer = await ask(rl, `\n  ${YELLOW}Install langsmith-cli? [Y/n]:${RESET} `);
-    if (lsCliAnswer.trim().toLowerCase() !== "n") {
-      console.log(`\n  Installing langsmith-cli...`);
-      try {
-        execSync("uv tool install langsmith-cli 2>/dev/null || pip install langsmith-cli 2>/dev/null || pip3 install langsmith-cli", { stdio: "pipe", timeout: 60000 });
-        console.log(`  ${GREEN}✓${RESET} langsmith-cli installed`);
+    console.log(`\n  ${BOLD}langsmith-cli${RESET} — ${YELLOW}required${RESET} for LLM-as-judge evaluation`);
+    console.log(`  ${DIM}The evaluator agent uses it to read experiment outputs and write scores.${RESET}`);
+    console.log(`\n  Installing langsmith-cli...`);
+    try {
+      execSync("uv tool install langsmith-cli 2>/dev/null || pip install langsmith-cli 2>/dev/null || pip3 install langsmith-cli", { stdio: "pipe", timeout: 60000 });
+      console.log(`  ${GREEN}✓${RESET} langsmith-cli installed`);
-        // If we have a key, auto-authenticate
-        if (hasKey && fs.existsSync(langsmithCredsFile)) {
-          console.log(`  ${GREEN}✓${RESET} langsmith-cli auto-authenticated (credentials file exists)`);
-        }
-      } catch {
-        console.log(`  ${YELLOW}!${RESET} Could not install. Try manually: ${DIM}uv tool install langsmith-cli${RESET}`);
+      // If we have a key, auto-authenticate
+      if (hasKey && fs.existsSync(langsmithCredsFile)) {
+        console.log(`  ${GREEN}✓${RESET} langsmith-cli auto-authenticated (credentials file exists)`);
       }
+    } catch {
+      console.log(`  ${RED}!${RESET} Could not install langsmith-cli.`);
+      console.log(`    ${BOLD}This is required.${RESET} Install manually: ${DIM}uv tool install langsmith-cli${RESET}`);
     }
   }
 }

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "harness-evolver",
-  "version": "3.0.6",
+  "version": "3.1.1",
   "description": "LangSmith-native autonomous agent optimization for Claude Code",
   "author": "Raphael Valdetaro",
   "license": "MIT",

package/skills/evolve/SKILL.md CHANGED Viewed

@@ -75,13 +75,15 @@ python3 -c "import json; c=json.load(open('.evolver.json')); print(f'v{c[\"itera
 ### 1.5. Gather Trace Insights
-Run trace insights from the best experiment:
+Read the best experiment from config. If null (no baseline was run), skip trace insights for this iteration — proposers will work blind on the first pass:
 ```bash
-BEST=$(python3 -c "import json; print(json.load(open('.evolver.json'))['best_experiment'])")
-$EVOLVER_PY $TOOLS/trace_insights.py \
-    --from-experiment "$BEST" \
-    --output trace_insights.json 2>/dev/null
+BEST=$(python3 -c "import json; b=json.load(open('.evolver.json')).get('best_experiment'); print(b if b else '')")
+if [ -n "$BEST" ]; then
+    $EVOLVER_PY $TOOLS/trace_insights.py \
+        --from-experiment "$BEST" \
+        --output trace_insights.json 2>/dev/null
+fi
 ```
 If a production project is configured, also gather production insights:
@@ -99,17 +101,20 @@ fi
 ### 1.8. Analyze Per-Task Failures
-Read the best experiment results and cluster failures:
+If `$BEST` is set (not the first iteration without baseline), read results and cluster failures:
 ```bash
-$EVOLVER_PY $TOOLS/read_results.py \
-    --experiment "$BEST" \
-    --config .evolver.json \
-    --output best_results.json 2>/dev/null
+if [ -n "$BEST" ]; then
+    $EVOLVER_PY $TOOLS/read_results.py \
+        --experiment "$BEST" \
+        --config .evolver.json \
+        --output best_results.json 2>/dev/null
+fi
 ```
-Parse `best_results.json` to find failing examples (score < 0.7). Group by metadata or error pattern.
+If `best_results.json` exists, parse it to find failing examples (score < 0.7). Group by metadata or error pattern.
 Generate adaptive briefings for Candidates D and E (same logic as v2).
+If no best_results.json (first iteration without baseline), all proposers work from code analysis only — no failure data available.
 ### 2. Spawn 5 Proposers in Parallel
@@ -172,7 +177,7 @@ If ALL_PASSING: D gets `creative`, E gets `efficiency`.
 Wait for all 5 to complete.
-### 3. Evaluate Each Candidate
+### 3. Run Target for Each Candidate
 For each worktree that has changes (proposer committed something):
@@ -184,7 +189,59 @@ $EVOLVER_PY $TOOLS/run_eval.py \
     --timeout 120
 ```
-Each candidate becomes a separate LangSmith experiment.
+Each candidate becomes a separate LangSmith experiment. This step runs the agent and applies code-based evaluators (has_output, token_efficiency) only.
+Collect all experiment names from the output (the `"experiment"` field in each JSON output).
+### 3.5. LLM-as-Judge Evaluation (Evaluator Agent)
+Check if the config has LLM-based evaluators (correctness, conciseness):
+```bash
+python3 -c "import json; c=json.load(open('.evolver.json')); llm=[k for k in c['evaluators'] if k in ('correctness','conciseness')]; print(','.join(llm) if llm else '')"
+```
+If LLM evaluators are configured, first verify langsmith-cli is available:
+```bash
+command -v langsmith-cli >/dev/null 2>&1 || { echo "ERROR: langsmith-cli not found. Install with: uv tool install langsmith-cli"; exit 1; }
+```
+Then spawn ONE evaluator agent that scores ALL candidates in a single pass. This is more efficient than spawning one agent per candidate:
+```
+Agent(
+  subagent_type: "evolver-evaluator",
+  description: "Evaluate all candidates for iteration v{NNN}",
+  prompt: |
+    <experiment>
+    Evaluate the following experiments (one per candidate):
+    - {experiment_name_a}
+    - {experiment_name_b}
+    - {experiment_name_c}
+    - {experiment_name_d}
+    - {experiment_name_e}
+    </experiment>
+    <evaluators>
+    Apply these evaluators to each run in each experiment:
+    - {llm_evaluator_list, e.g. "correctness", "conciseness"}
+    </evaluators>
+    <context>
+    Agent type: {framework} agent
+    Domain: {description from .evolver.json or entry point context}
+    Entry point: {entry_point}
+    For each experiment:
+    1. Read all runs via: langsmith-cli --json runs list --project "{experiment_name}" --fields id,inputs,outputs,error --is-root --limit 200
+    2. Judge each run's output against the input
+    3. Write scores via: langsmith-cli --json feedback create {run_id} --key {evaluator} --score {0.0|1.0} --comment "{reason}" --source model
+    </context>
+)
+```
+Wait for the evaluator agent to complete before proceeding.
 ### 4. Compare All Candidates

package/skills/setup/SKILL.md CHANGED Viewed

@@ -42,7 +42,7 @@ TOOLS=$([ -d ".evolver/tools" ] && echo ".evolver/tools" || echo "$HOME/.evolver
 EVOLVER_PY=$([ -f "$HOME/.evolver/venv/bin/python" ] && echo "$HOME/.evolver/venv/bin/python" || echo "python3")
 ```
-Use `$EVOLVER_PY` instead of `python3` for ALL tool invocations. This ensures the venv with langsmith+openevals is used.
+Use `$EVOLVER_PY` instead of `python3` for ALL tool invocations. This ensures the venv with langsmith is used.
 ## Phase 1: Explore Project (automatic)
@@ -201,4 +201,4 @@ Next: run /evolver:evolve to start optimizing.
 - If `.evolver.json` already exists, ask before overwriting.
 - If the agent needs a venv, the run command should activate it: `cd {dir} && .venv/bin/python main.py`
 - If LangSmith connection fails, check API key and network.
-- The setup installs `langsmith` and `openevals` if missing.
+- The setup requires `langsmith` (Python SDK) and `langsmith-cli` (for evaluator agent).

package/tools/__pycache__/detect_stack.cpython-314.pyc ADDED Viewed

Binary file

package/tools/__pycache__/trace_logger.cpython-314.pyc ADDED Viewed

Binary file

package/tools/run_eval.py CHANGED Viewed

@@ -2,7 +2,9 @@
 """Run LangSmith evaluation for a candidate in a worktree.
 Wraps client.evaluate() — runs the user's agent against the dataset
-with configured evaluators, from within a specific directory (worktree).
+with code-based evaluators only (has_output, token_efficiency).
+LLM-as-judge scoring (correctness, conciseness) is handled post-hoc
+by the evolver-evaluator agent via langsmith-cli.
 Usage:
     python3 run_eval.py \
@@ -11,7 +13,7 @@ Usage:
         --experiment-prefix v001a \
         [--timeout 120]
-Requires: pip install langsmith openevals
+Requires: pip install langsmith
 """
 import argparse
@@ -124,34 +126,30 @@ def make_target(entry_point, cwd):
 def load_evaluators(evaluator_keys):
-    """Load evaluators by key name."""
-    from openevals.llm import create_llm_as_judge
-    from openevals.prompts import CORRECTNESS_PROMPT, CONCISENESS_PROMPT
+    """Load code-based evaluators only.
+    LLM-as-judge evaluators (correctness, conciseness) are handled
+    post-hoc by the evolver-evaluator agent via langsmith-cli.
+    """
     evaluators = []
+    # Always include has_output — verifies the agent produced something
+    def has_output_eval(inputs, outputs, **kwargs):
+        has = bool(outputs and outputs.get("output", outputs.get("answer", "")))
+        return {"key": "has_output", "score": 1.0 if has else 0.0}
+    evaluators.append(has_output_eval)
     for key in evaluator_keys:
-        if key == "correctness":
-            evaluators.append(create_llm_as_judge(
-                prompt=CORRECTNESS_PROMPT,
-                feedback_key="correctness",
-                model="openai:gpt-4.1-mini",
-            ))
-        elif key == "conciseness":
-            evaluators.append(create_llm_as_judge(
-                prompt=CONCISENESS_PROMPT,
-                feedback_key="conciseness",
-                model="openai:gpt-4.1-mini",
-            ))
-        elif key == "latency":
-            def latency_eval(inputs, outputs, **kwargs):
-                return {"key": "has_output", "score": 1.0 if outputs else 0.0}
-            evaluators.append(latency_eval)
+        if key == "latency":
+            # Latency is captured in traces, just check output exists
+            pass  # has_output already covers this
         elif key == "token_efficiency":
             def token_eval(inputs, outputs, **kwargs):
                 output_text = str(outputs.get("output", outputs.get("answer", "")))
                 score = min(1.0, 2000 / max(len(output_text), 1))
                 return {"key": "token_efficiency", "score": score}
             evaluators.append(token_eval)
+        # correctness, conciseness — skipped, handled by evaluator agent
     return evaluators
@@ -176,10 +174,16 @@ def main():
     target = make_target(config["entry_point"], args.worktree_path)
     evaluators = load_evaluators(config["evaluators"])
+    # Identify which evaluators need the agent (LLM-as-judge)
+    llm_evaluators = [k for k in config["evaluators"] if k in ("correctness", "conciseness")]
+    code_evaluators = [k for k in config["evaluators"] if k not in ("correctness", "conciseness")]
     print(f"Running evaluation: {args.experiment_prefix}")
     print(f"  Dataset: {config['dataset']}")
     print(f"  Worktree: {args.worktree_path}")
-    print(f"  Evaluators: {config['evaluators']}")
+    print(f"  Code evaluators: {['has_output'] + code_evaluators}")
+    if llm_evaluators:
+        print(f"  Pending LLM evaluators (agent): {llm_evaluators}")
     try:
         results = client.evaluate(
@@ -192,7 +196,7 @@ def main():
         experiment_name = results.experiment_name
-        # Calculate mean score
+        # Calculate mean score from code-based evaluators only
         scores = []
         per_example = {}
         for result in results:
@@ -218,10 +222,13 @@ def main():
             "num_examples": len(per_example),
             "num_scores": len(scores),
             "per_example": per_example,
+            "pending_llm_evaluators": llm_evaluators,
         }
         print(json.dumps(output))
-        print(f"\nEvaluation complete: {mean_score:.3f} ({len(per_example)} examples)")
+        print(f"\nTarget runs complete: {len(per_example)} examples")
+        if llm_evaluators:
+            print(f"Awaiting evaluator agent for: {llm_evaluators}")
     except Exception as e:
         print(f"Evaluation failed: {e}", file=sys.stderr)

package/tools/setup.py CHANGED Viewed

@@ -19,7 +19,7 @@ Usage:
         [--production-project my-prod-project] \
         [--evaluators correctness,conciseness]
-Requires: pip install langsmith openevals
+Requires: pip install langsmith
 """
 import argparse
@@ -78,19 +78,38 @@ def ensure_langsmith_api_key():
 def check_dependencies():
-    """Verify langsmith and openevals are installed."""
+    """Verify langsmith is installed."""
     missing = []
     try:
         import langsmith  # noqa: F401
     except ImportError:
         missing.append("langsmith")
-    try:
-        import openevals  # noqa: F401
-    except ImportError:
-        missing.append("openevals")
     return missing
+def resolve_dataset_name(client, base_name):
+    """Find an available dataset name by auto-incrementing the version suffix.
+    Tries base_name-eval-v1, v2, v3... until an unused name is found.
+    Returns (resolved_name, version_number).
+    """
+    existing = set()
+    try:
+        for ds in client.list_datasets():
+            existing.add(ds.name)
+    except Exception:
+        pass
+    for v in range(1, 100):
+        candidate = f"{base_name}-eval-v{v}"
+        if candidate not in existing:
+            return candidate, v
+    # Fallback: timestamp-based
+    ts = datetime.now(timezone.utc).strftime("%Y%m%d%H%M%S")
+    return f"{base_name}-eval-{ts}", 0
 def create_dataset_from_file(client, dataset_name, file_path):
     """Create a LangSmith dataset from a JSON file of inputs."""
     with open(file_path) as f:
@@ -177,17 +196,19 @@ def create_empty_dataset(client, dataset_name):
 def get_evaluators(goals, evaluator_names=None):
-    """Build evaluator list based on optimization goals."""
-    from openevals.llm import create_llm_as_judge
-    from openevals.prompts import CORRECTNESS_PROMPT, CONCISENESS_PROMPT
+    """Build evaluator list based on optimization goals.
+    Returns only code-based evaluators. LLM-as-judge evaluators
+    (correctness, conciseness) are handled post-hoc by the
+    evolver-evaluator agent via langsmith-cli.
+    """
     evaluators = []
     evaluator_keys = []
-    # Map goals to evaluators
-    goal_map = {
-        "accuracy": ("correctness", CORRECTNESS_PROMPT),
-        "conciseness": ("conciseness", CONCISENESS_PROMPT),
+    # Map goals to evaluator keys (LLM-based are recorded but not instantiated)
+    goal_to_key = {
+        "accuracy": "correctness",
+        "conciseness": "conciseness",
     }
     if evaluator_names:
@@ -195,39 +216,33 @@ def get_evaluators(goals, evaluator_names=None):
     else:
         names = []
         for goal in goals:
-            if goal in goal_map:
-                names.append(goal_map[goal][0])
+            if goal in goal_to_key:
+                names.append(goal_to_key[goal])
         if not names:
             names = ["correctness"]  # default
+    # Record all evaluator keys (for config) but only instantiate code-based ones
     for name in names:
         if name in ("correctness", "accuracy"):
-            evaluators.append(create_llm_as_judge(
-                prompt=CORRECTNESS_PROMPT,
-                feedback_key="correctness",
-                model="openai:gpt-4.1-mini",
-            ))
             evaluator_keys.append("correctness")
+            # LLM-as-judge — handled by evaluator agent, not here
         elif name in ("conciseness", "brevity"):
-            evaluators.append(create_llm_as_judge(
-                prompt=CONCISENESS_PROMPT,
-                feedback_key="conciseness",
-                model="openai:gpt-4.1-mini",
-            ))
             evaluator_keys.append("conciseness")
+            # LLM-as-judge — handled by evaluator agent, not here
+    # Always include has_output
+    def has_output_eval(inputs, outputs, **kwargs):
+        has = bool(outputs and outputs.get("output", outputs.get("answer", "")))
+        return {"key": "has_output", "score": 1.0 if has else 0.0}
+    evaluators.append(has_output_eval)
     # Code-based evaluators for latency/tokens
     if "latency" in goals:
-        def latency_eval(inputs, outputs, **kwargs):
-            # Latency is captured in traces, not scored here
-            return {"key": "has_output", "score": 1.0 if outputs else 0.0}
-        evaluators.append(latency_eval)
         evaluator_keys.append("latency")
     if "token_efficiency" in goals:
         def token_eval(inputs, outputs, **kwargs):
             output_text = str(outputs.get("output", outputs.get("answer", "")))
-            # Penalize very long outputs (>2000 chars)
             score = min(1.0, 2000 / max(len(output_text), 1))
             return {"key": "token_efficiency", "score": score}
         evaluators.append(token_eval)
@@ -328,6 +343,7 @@ def main():
     parser.add_argument("--dataset-from-file", default=None, help="Create dataset from JSON file")
     parser.add_argument("--dataset-from-langsmith", default=None, help="Create dataset from LangSmith project")
     parser.add_argument("--production-project", default=None, help="Production LangSmith project")
+    parser.add_argument("--dataset-name", default=None, help="Explicit dataset name (skip auto-versioning)")
     parser.add_argument("--evaluators", default=None, help="Comma-separated evaluator names")
     parser.add_argument("--skip-baseline", action="store_true", help="Skip baseline evaluation")
     parser.add_argument("--output", default=".evolver.json", help="Output config path")
@@ -359,9 +375,19 @@ def main():
         sys.exit(1)
     project_name = f"evolver-{args.project_name}"
-    dataset_name = f"{args.project_name}-eval-v1"
     goals = [g.strip() for g in args.goals.split(",")]
+    # Resolve dataset name (explicit or auto-versioned)
+    if args.dataset_name:
+        dataset_name = args.dataset_name
+        print(f"Using explicit dataset name: '{dataset_name}'")
+    else:
+        dataset_name, version = resolve_dataset_name(client, args.project_name)
+        if version > 1:
+            print(f"Dataset name auto-versioned to '{dataset_name}' (v1-v{version-1} already exist)")
+        else:
+            print(f"Dataset: '{dataset_name}'")
     # Create dataset
     print(f"Creating dataset '{dataset_name}'...")
     if args.dataset_from_file:
@@ -386,18 +412,23 @@ def main():
     print(f"Configuring evaluators for goals: {goals}")
     evaluators, evaluator_keys = get_evaluators(goals, args.evaluators)
     print(f"  Active evaluators: {evaluator_keys}")
+    llm_evaluators = [k for k in evaluator_keys if k in ("correctness", "conciseness")]
+    if llm_evaluators:
+        print(f"  LLM evaluators (agent-based): {llm_evaluators}")
-    # Run baseline
+    # Run baseline (code-based evaluators only; LLM scoring done by evaluator agent)
     baseline_experiment = None
     baseline_score = 0.0
     if not args.skip_baseline and count > 0:
-        print(f"Running baseline evaluation ({count} examples)...")
+        print(f"Running baseline target ({count} examples)...")
         try:
             baseline_experiment, baseline_score = run_baseline(
                 client, dataset_name, args.entry_point, evaluators,
             )
-            print(f"  Baseline score: {baseline_score:.3f}")
+            print(f"  Baseline has_output score: {baseline_score:.3f}")
             print(f"  Experiment: {baseline_experiment}")
+            if llm_evaluators:
+                print(f"  Note: LLM scoring pending — evaluator agent will run during /evolver:evolve")
         except Exception as e:
             print(f"  Baseline evaluation failed: {e}", file=sys.stderr)
             print("  Continuing with score 0.0")