npm - harness-evolver - Versions diffs - 4.5.0 → 4.5.2 - Mend

harness-evolver 4.5.0 → 4.5.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

package/.claude-plugin/plugin.json +1 -1
package/README.md +81 -28
package/package.json +1 -1
package/skills/evolve/SKILL.md +69 -2
package/tools/add_evaluator.py +28 -11
package/tools/constraint_check.py +28 -3
package/tools/evolution_chart.py +10 -1
package/tools/run_eval.py +41 -1

package/.claude-plugin/plugin.json CHANGED Viewed

@@ -1,7 +1,7 @@
 {
   "name": "harness-evolver",
   "description": "LangSmith-native autonomous agent optimization — evolves LLM agent code using multi-agent proposers, LangSmith experiments, and git worktrees",
-  "version": "4.5.0",
+  "version": "4.5.2",
   "author": {
     "name": "Raphael Valdetaro"
   },

package/README.md CHANGED Viewed

@@ -48,8 +48,9 @@ export LANGSMITH_API_KEY="lsv2_pt_..."
 claude
 /evolver:setup      # explores project, configures LangSmith
+/evolver:health     # check dataset quality (auto-corrects issues)
 /evolver:evolve     # runs the optimization loop
-/evolver:status     # check progress
+/evolver:status     # check progress (rich ASCII chart)
 /evolver:deploy     # tag, push, finalize
 ```
@@ -64,19 +65,43 @@ claude
 </tr>
 <tr>
 <td><b>Real Code Evolution</b></td>
-<td>Proposers modify your actual agent code — not a wrapper. Each candidate works in an isolated git worktree. Winners are merged automatically.</td>
+<td>Proposers modify your actual agent code — not a wrapper. Each candidate works in an isolated git worktree. Winners are merged automatically. Config files (.evolver.json, .env) are auto-propagated to worktrees.</td>
 </tr>
 <tr>
 <td><b>Self-Organizing Proposers</b></td>
 <td>Each iteration generates dynamic investigation lenses from failure data, architecture analysis, production traces, and evolution memory. Proposers self-organize their approach — no fixed strategies. They can self-abstain when their contribution would be redundant. Inspired by <a href="https://arxiv.org/abs/2603.28990">Dochkina (2026)</a>.</td>
 </tr>
 <tr>
+<td><b>Rubric-Based Evaluation</b></td>
+<td>Dataset examples support <code>expected_behavior</code> rubrics — specific criteria the judge evaluates against ("should mention null safety and Android development"), not just generic correctness. Partial scoring (0.5) for partially-met rubrics. Inspired by <a href="https://github.com/NousResearch/hermes-agent-self-evolution">Hermes Agent Self-Evolution</a>.</td>
+</tr>
+<tr>
+<td><b>Constraint Gates</b></td>
+<td>Proposals must pass hard constraints before merge: code growth ≤30%, entry point syntax valid, test suite passes. Candidates that fail are rejected and the next-best is tried. Prevents code bloat and broken merges.</td>
+</tr>
+<tr>
+<td><b>Weighted Evaluators + Pareto</b></td>
+<td>Configure <code>evaluator_weights</code> to prioritize what matters (e.g., correctness 50%, latency 30%). When candidates offer genuinely different tradeoffs, the Pareto front is reported instead of forcing a single winner.</td>
+</tr>
+<tr>
 <td><b>Agent-Based Evaluation</b></td>
-<td>The evaluator agent reads experiment outputs via langsmith-cli, judges correctness using the same Claude model powering the other agents, and writes scores back. No OpenAI API key or openevals dependency needed.</td>
+<td>The evaluator agent reads experiment outputs via langsmith-cli, judges correctness using rubrics when available, and writes scores back. Judge feedback (textual comments explaining WHY scores were given) is surfaced to proposers for targeted mutations.</td>
+</tr>
+<tr>
+<td><b>Canary Preflight</b></td>
+<td>Before running the full evaluation, 1 example is tested as a canary. If the agent produces no output, evaluation stops immediately — no API quota wasted on broken agents.</td>
+</tr>
+<tr>
+<td><b>Secret Detection</b></td>
+<td>Detects 15+ secret patterns (API keys, tokens, PEM keys) in production traces and dataset examples. Secrets are filtered from <code>seed_from_traces</code> and flagged as critical issues in dataset health checks.</td>
+</tr>
+<tr>
+<td><b>Evolution Chart</b></td>
+<td>Rich ASCII visualization with ANSI colors: sparkline trend, score progression table (per-evaluator breakdown), what-changed narrative, horizontal bar chart, and code growth tracking with warnings.</td>
 </tr>
 <tr>
 <td><b>Production Traces</b></td>
-<td>Auto-discovers existing LangSmith production projects. Uses real user inputs for test generation and real error patterns for targeted optimization.</td>
+<td>Auto-discovers existing LangSmith production projects. Uses real user inputs for test generation and real error patterns for targeted optimization. Can also mine Claude Code session history for eval data.</td>
 </tr>
 <tr>
 <td><b>Active Critic</b></td>
@@ -92,11 +117,11 @@ claude
 </tr>
 <tr>
 <td><b>Dataset Health</b></td>
-<td>Pre-flight dataset quality check: size adequacy, difficulty distribution, dead example detection, production coverage analysis, train/held-out splits. Auto-corrects issues before evolution starts.</td>
+<td>Pre-flight dataset quality check: size adequacy, difficulty distribution, dead example detection, production coverage analysis, train/held-out splits, and secret scanning. Auto-corrects issues before evolution starts.</td>
 </tr>
 <tr>
 <td><b>Smart Gating</b></td>
-<td>Claude assesses gate conditions directly — score plateau, target reached, diminishing returns. No hardcoded thresholds. State validation ensures config hasn't diverged from LangSmith.</td>
+<td>Claude assesses gate conditions directly — score plateau, target reached, diminishing returns. Holdout enforcement ensures final comparison uses unseen data. Baseline is re-scored with LLM-judge before the loop to prevent inflated starting scores.</td>
 </tr>
 <tr>
 <td><b>Background Mode</b></td>
@@ -111,9 +136,9 @@ claude
 | Command | What it does |
 |---|---|
 | `/evolver:setup` | Explore project, configure LangSmith (dataset, evaluators), run baseline |
-| `/evolver:health` | Check dataset quality (size, difficulty, coverage, splits), auto-correct issues |
+| `/evolver:health` | Check dataset quality (size, difficulty, coverage, splits, secrets), auto-correct |
 | `/evolver:evolve` | Run the optimization loop (dynamic self-organizing proposers in worktrees) |
-| `/evolver:status` | Show progress, scores, history |
+| `/evolver:status` | Show progress with rich ASCII evolution chart |
 | `/evolver:deploy` | Tag, push, clean up temporary files |
 ---
@@ -123,11 +148,11 @@ claude
 | Agent | Role | Color |
 |---|---|---|
 | **Proposer** | Self-organizing — investigates a data-driven lens, decides own approach, may abstain | Green |
-| **Evaluator** | LLM-as-judge — reads outputs via langsmith-cli, scores correctness | Yellow |
+| **Evaluator** | LLM-as-judge — rubric-aware scoring via langsmith-cli, textual feedback | Yellow |
 | **Architect** | ULTRAPLAN mode — deep topology analysis with Opus model | Blue |
 | **Critic** | Active — detects gaming AND implements stricter evaluators | Red |
 | **Consolidator** | Cross-iteration memory consolidation (autoDream-inspired) | Cyan |
-| **TestGen** | Generates test inputs + adversarial injection mode | Cyan |
+| **TestGen** | Generates test inputs with rubrics + adversarial injection mode | Cyan |
 ---
@@ -136,20 +161,22 @@ claude
 ```
 /evolver:evolve
   |
-  +- 0.5  Validate state (skeptical memory — check .evolver.json vs LangSmith)
-  +- 0.6  /evolver:health — dataset quality check + auto-correct
+  +- 0.5  Validate state (check .evolver.json vs LangSmith)
+  +- 0.6  /evolver:health — dataset quality + secret scan + auto-correct
+  +- 0.7  Baseline LLM-judge — re-score baseline with correctness if only has_output exists
   +- 1.   Read state (.evolver.json + LangSmith experiments)
-  +- 1.5  Gather trace insights (cluster errors, tokens, latency)
-  +- 1.8  Analyze per-task failures (train split only — proposers don't see held-out)
+  +- 1.5  Gather trace insights + judge feedback (cluster errors, tokens, latency)
+  +- 1.8  Analyze per-task failures with judge comments (train split only)
   +- 1.8a Claude generates strategy.md + lenses.json from analysis data
   +- 1.9  Prepare shared proposer context (KV cache-optimized prefix)
   +- 2.   Spawn N self-organizing proposers in parallel (each in a git worktree)
-  +- 3.   Run target for each candidate (code-based evaluators)
-  +- 3.5  Spawn evaluator agent (LLM-as-judge via langsmith-cli)
-  +- 4.   Compare experiments -> select winner + per-task champion
+  +- 3.   Copy .evolver.json + .env to worktrees, run canary, evaluate candidates
+  +- 3.5  Spawn evaluator agent (rubric-aware LLM-as-judge via langsmith-cli)
+  +- 4.   Compare experiments on held-out split -> winner + Pareto front
+  +- 4.5  Constraint gate — reject candidates that break size/tests/entry-point
   +- 5.   Merge winning worktree into main branch
   +- 5.5  Regression tracking (auto-add guard examples to dataset)
-  +- 6.   Report results
+  +- 6.   Report results + evolution chart
   +- 6.2  Consolidator agent updates evolution memory (runs in background)
   +- 6.5  Auto-trigger Active Critic (detect + fix evaluator gaming)
   +- 7.   Auto-trigger ULTRAPLAN Architect (opus model, deep analysis)
@@ -166,27 +193,31 @@ Plugin hook (SessionStart)
 Skills (markdown)
   ├── /evolver:setup    → explores project, smart defaults, runs setup.py
-  ├── /evolver:health   → dataset quality check + auto-correct
+  ├── /evolver:health   → dataset quality + secret scan + auto-correct
   ├── /evolver:evolve   → orchestrates the evolution loop
-  ├── /evolver:status   → reads .evolver.json + LangSmith
+  ├── /evolver:status   → rich ASCII evolution chart + stagnation detection
   └── /evolver:deploy   → tags and pushes
 Agents (markdown)
   ├── Proposer (xN)     → self-organizing, lens-driven, isolated git worktrees
-  ├── Evaluator          → LLM-as-judge via langsmith-cli
+  ├── Evaluator          → rubric-aware LLM-as-judge via langsmith-cli
   ├── Critic             → detects gaming + implements stricter evaluators
   ├── Architect          → ULTRAPLAN deep analysis (opus model)
   ├── Consolidator       → cross-iteration memory (autoDream-inspired)
-  └── TestGen            → generates test inputs + adversarial injection
+  └── TestGen            → generates test inputs with rubrics + adversarial injection
-Tools (Python + langsmith SDK)
-  ├── setup.py              → creates datasets, configures evaluators
-  ├── run_eval.py           → runs target against dataset
-  ├── read_results.py       → compares experiments
+Tools (Python)
+  ├── setup.py              → creates datasets, configures evaluators + weights
+  ├── run_eval.py           → runs target against dataset (canary preflight, {input_text})
+  ├── read_results.py       → weighted scoring, Pareto front, judge feedback
   ├── trace_insights.py     → clusters errors from traces
-  ├── seed_from_traces.py   → imports production traces
+  ├── seed_from_traces.py   → imports production traces (secret-filtered)
+  ├── evolution_chart.py    → rich ASCII chart (stdlib-only)
+  ├── constraint_check.py   → validates proposals (growth, syntax, tests) (stdlib-only)
+  ├── secret_filter.py      → detects 15+ secret patterns (stdlib-only)
+  ├── mine_sessions.py      → extracts eval data from Claude Code history (stdlib-only)
+  ├── dataset_health.py     → dataset quality diagnostic + secret scanning
   ├── validate_state.py     → validates config vs LangSmith state
-  ├── dataset_health.py     → dataset quality diagnostic (size, difficulty, coverage, splits)
   ├── regression_tracker.py → tracks regressions, adds guard examples
   ├── add_evaluator.py      → programmatically adds evaluators
   └── adversarial_inject.py → detects memorization, injects adversarial tests
@@ -194,6 +225,27 @@ Tools (Python + langsmith SDK)
 ---
+## Entry Point Placeholders
+When configuring your agent's entry point during setup, use the placeholder that matches how your agent takes input:
+| Placeholder | Behavior | Use when |
+|---|---|---|
+| `{input_text}` | Extracts plain text, shell-escapes it | Agent takes `--query "text"` or positional args |
+| `{input}` | Passes path to a JSON file | Agent reads structured JSON from file |
+| `{input_json}` | Passes raw JSON string inline | Agent parses JSON from command line |
+**Example:**
+```bash
+# Agent that takes a query as text:
+python agent.py --query {input_text}
+# Agent that reads a JSON file:
+python agent.py {input}
+```
+---
 ## Requirements
 - **LangSmith account** + `LANGSMITH_API_KEY`
@@ -223,6 +275,7 @@ LangSmith traces **any** AI framework. The evolver works with all of them:
 - [Meta-Harness: End-to-End Optimization of Model Harnesses](https://arxiv.org/abs/2603.28052) — Lee et al., 2026
 - [Drop the Hierarchy and Roles: How Self-Organizing LLM Agents Outperform Designed Structures](https://arxiv.org/abs/2603.28990) — Dochkina, 2026
+- [Hermes Agent Self-Evolution](https://github.com/NousResearch/hermes-agent-self-evolution) — NousResearch (rubric-based eval, constraint gates)
 - [Darwin Godel Machine](https://sakana.ai/dgm/) — Sakana AI
 - [AlphaEvolve](https://deepmind.google/blog/alphaevolve/) — DeepMind
 - [LangSmith Evaluation](https://docs.smith.langchain.com/evaluation) — LangChain

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "harness-evolver",
-  "version": "4.5.0",
+  "version": "4.5.2",
   "description": "LangSmith-native autonomous agent optimization for Claude Code",
   "author": "Raphael Valdetaro",
   "license": "MIT",

package/skills/evolve/SKILL.md CHANGED Viewed

@@ -133,6 +133,61 @@ If critical issues found, ask user whether to continue or fix first via AskUserQ
 Invoke `/evolver:health` to check and auto-correct dataset issues. If health_report.json shows critical issues that couldn't be auto-corrected, ask user whether to proceed via AskUserQuestion.
+### 0.7. Ensure Baseline Has LLM-Judge Scores
+The baseline experiment (from setup) only runs code-based evaluators (has_output, token_efficiency). Without LLM-judge scores, the baseline score is inflated — any agent that produces text gets 1.0, making gate checks stop evolution prematurely.
+Check if LLM evaluators are configured and the baseline needs scoring:
+```bash
+LLM_EVALS=$(python3 -c "import json; c=json.load(open('.evolver.json')); llm=[k for k in c['evaluators'] if k in ('correctness','conciseness')]; print(','.join(llm) if llm else '')")
+BASELINE=$(python3 -c "import json; print(json.load(open('.evolver.json')).get('baseline_experiment', ''))")
+```
+If `LLM_EVALS` is non-empty and `BASELINE` exists, check if LLM scores already exist:
+```bash
+HAS_LLM_SCORES=$($EVOLVER_PY $TOOLS/read_results.py --experiment "$BASELINE" --config .evolver.json 2>/dev/null | python3 -c "
+import sys, json
+try:
+    r = json.load(sys.stdin)
+    scored_keys = set()
+    for ex in r.get('per_example', {}).values():
+        scored_keys.update(ex.get('scores', {}).keys())
+    llm_keys = set('correctness,conciseness'.split(','))
+    configured = set(k for k in llm_keys if k in '$LLM_EVALS'.split(','))
+    print('yes' if configured.issubset(scored_keys) else 'no')
+except: print('no')
+")
+```
+If `HAS_LLM_SCORES` is "no", trigger the evaluator agent on the baseline:
+```
+Agent(
+  subagent_type: "evolver-evaluator",
+  description: "Score baseline with LLM-judge",
+  prompt: "Experiments to evaluate: {baseline_experiment}. Evaluators: {llm_evaluator_list}. Framework: {framework}. Entry point: {entry_point}. Dataset: {dataset_name}. NOTE: This is the baseline — score it fairly so evolution has a meaningful starting point. Some examples have expected_behavior rubrics in their metadata — fetch example metadata and use rubrics for scoring when available."
+)
+```
+After the evaluator completes, re-read the baseline score and update `.evolver.json`:
+```bash
+$EVOLVER_PY $TOOLS/read_results.py --experiment "$BASELINE" --config .evolver.json --output best_results.json 2>/dev/null
+python3 -c "
+import json
+br = json.load(open('best_results.json'))
+c = json.load(open('.evolver.json'))
+new_score = br.get('combined_score', c['best_score'])
+c['best_score'] = new_score
+if c.get('history'):
+    c['history'][0]['score'] = new_score
+json.dump(c, open('.evolver.json', 'w'), indent=2)
+print(f'Baseline re-scored with LLM-judge: {new_score:.3f}')
+"
+```
 ### 0.8. Resolve Project Directory
 If the project is in a subdirectory of the git repo (e.g., `playground/react-agent/`), worktrees replicate the full repo structure. Read `project_dir` from `.evolver.json` to resolve paths correctly:
@@ -340,10 +395,22 @@ Only run evaluation (Step 3) for proposers that committed changes (not abstained
 ### 3. Run Target for Each Candidate (Parallel)
-Run evaluations for ALL candidates simultaneously — they're independent:
+First, copy config files into each worktree (untracked files aren't replicated by git — this was the #1 bug in all real-world runs):
+```bash
+for WORKTREE in {worktree_paths_with_commits}; do
+    WORKTREE_PROJECT="$WORKTREE"
+    [ -n "$PROJECT_DIR" ] && WORKTREE_PROJECT="$WORKTREE/$PROJECT_DIR"
+    # Copy untracked config files needed by run_eval.py and the agent
+    cp .evolver.json "$WORKTREE_PROJECT/.evolver.json" 2>/dev/null
+    [ -f .env ] && cp .env "$WORKTREE_PROJECT/.env" 2>/dev/null
+done
+```
+Then run evaluations for ALL candidates simultaneously:
 ```bash
-# Launch all evaluations in parallel
 for WORKTREE in {worktree_paths_with_commits}; do
     WORKTREE_PROJECT="$WORKTREE"
     [ -n "$PROJECT_DIR" ] && WORKTREE_PROJECT="$WORKTREE/$PROJECT_DIR"

package/tools/add_evaluator.py CHANGED Viewed

@@ -39,26 +39,39 @@ CODE_EVALUATOR_TEMPLATES = {
 def add_evaluator(config_path, evaluator_name, eval_type, pattern=None):
-    """Add evaluator to config."""
+    """Add evaluator to config using partial update to avoid race conditions.
+    Re-reads the config immediately before writing to minimize the window
+    where concurrent updates (e.g., main loop updating best_score) could
+    be lost. Only modifies 'evaluators' and 'code_evaluators' fields.
+    """
+    # First read to check if evaluator already exists
     with open(config_path) as f:
         config = json.load(f)
-    evaluators = config.get("evaluators", [])
-    if evaluator_name in evaluators:
+    if evaluator_name in config.get("evaluators", []):
         print(f"Evaluator '{evaluator_name}' already exists", file=sys.stderr)
         return False
-    evaluators.append(evaluator_name)
-    config["evaluators"] = evaluators
+    # Prepare what we need to add
+    new_code_eval = None
     if eval_type == "code" and pattern:
-        code_evals = config.get("code_evaluators", {})
-        code_evals[evaluator_name] = {"pattern": pattern, "type": "regex"}
-        config["code_evaluators"] = code_evals
+        new_code_eval = {"pattern": pattern, "type": "regex"}
     elif eval_type == "code" and evaluator_name in CODE_EVALUATOR_TEMPLATES:
+        new_code_eval = CODE_EVALUATOR_TEMPLATES[evaluator_name]
+    # Re-read config right before write to pick up concurrent changes
+    with open(config_path) as f:
+        config = json.load(f)
+    evaluators = config.get("evaluators", [])
+    if evaluator_name not in evaluators:
+        evaluators.append(evaluator_name)
+    config["evaluators"] = evaluators
+    if new_code_eval:
         code_evals = config.get("code_evaluators", {})
-        code_evals[evaluator_name] = CODE_EVALUATOR_TEMPLATES[evaluator_name]
+        code_evals[evaluator_name] = new_code_eval
         config["code_evaluators"] = code_evals
     with open(config_path, "w") as f:
@@ -77,12 +90,16 @@ def main():
     args = parser.parse_args()
     if args.remove:
+        # Re-read right before write to avoid race conditions
         with open(args.config) as f:
             config = json.load(f)
         evaluators = config.get("evaluators", [])
         if args.evaluator in evaluators:
             evaluators.remove(args.evaluator)
             config["evaluators"] = evaluators
+            code_evals = config.get("code_evaluators", {})
+            code_evals.pop(args.evaluator, None)
+            config["code_evaluators"] = code_evals
             with open(args.config, "w") as f:
                 json.dump(config, f, indent=2)
             print(f"Removed evaluator: {args.evaluator}")

package/tools/constraint_check.py CHANGED Viewed

@@ -80,7 +80,30 @@ def check_entry_point(worktree_path, entry_point):
     return {"pass": True, "reason": "entry point exists and has valid syntax"}
-def check_tests(worktree_path):
+def find_project_python(worktree_path, config=None):
+    """Find the project's Python interpreter (venv > entry_point > system).
+    Checks for venv in the worktree, then extracts from entry_point config,
+    then falls back to system python3.
+    """
+    # Check for venv in worktree
+    for venv_dir in [".venv", "venv"]:
+        venv_python = os.path.join(worktree_path, venv_dir, "bin", "python")
+        if os.path.isfile(venv_python):
+            return venv_python
+    # Extract from entry_point in config
+    if config:
+        entry = config.get("entry_point", "")
+        for part in entry.split():
+            if part.endswith("/python") or part.endswith("/python3"):
+                if os.path.isfile(part):
+                    return part
+    return "python3"
+def check_tests(worktree_path, config=None):
     """Run test suite if it exists. Returns pass if no tests found."""
     test_dirs = ["tests", "test"]
     has_tests = False
@@ -95,9 +118,11 @@ def check_tests(worktree_path):
     if not has_tests:
         return {"pass": True, "reason": "no test suite found (skipped)", "skipped": True}
+    python = find_project_python(worktree_path, config)
     try:
         result = subprocess.run(
-            ["python3", "-m", "pytest", "-q", "--tb=no"],
+            [python, "-m", "pytest", "-q", "--tb=no"],
             capture_output=True, text=True,
             cwd=worktree_path, timeout=120,
         )
@@ -135,7 +160,7 @@ def main():
             args.max_growth,
         ),
         "entry_point": check_entry_point(args.worktree_path, ep_for_check),
-        "tests": check_tests(args.worktree_path),
+        "tests": check_tests(args.worktree_path, config),
     }
     all_pass = all(r["pass"] for r in results.values())

package/tools/evolution_chart.py CHANGED Viewed

@@ -73,7 +73,16 @@ def render_header(config, history, scores, c):
     project = config.get('project', 'unknown')
     dataset = config.get('dataset', 'unknown')
     evals = config.get('evaluators', [])
-    total = history[0].get('total', config.get('num_examples', '?'))
+    # Find example count from multiple sources
+    total = history[0].get('total')
+    if not total:
+        # Check any history entry that has it
+        for h in history:
+            if h.get('total'):
+                total = h['total']
+                break
+    if not total:
+        total = config.get('num_examples', '?')
     base_score = scores[0]
     best_score = max(scores)
     iters = len(history) - 1

package/tools/run_eval.py CHANGED Viewed

@@ -72,7 +72,20 @@ def make_target(entry_point, cwd):
         try:
             cmd = entry_point
-            if "{input}" in cmd:
+            # {input_text}: extract plain text from inputs dict (for agents expecting --query "text")
+            if "{input_text}" in cmd:
+                import shlex
+                text = ""
+                for key in ("input", "question", "query", "prompt", "text", "user_input"):
+                    if key in inputs and isinstance(inputs[key], str):
+                        text = inputs[key]
+                        break
+                if not text and inputs:
+                    first_val = next(iter(inputs.values()), "")
+                    text = str(first_val) if not isinstance(first_val, str) else first_val
+                cmd = cmd.replace("{input_text}", shlex.quote(text))
+            elif "{input}" in cmd:
                 # Placeholder: replace with path to JSON file
                 cmd = cmd.replace("{input}", input_path)
             elif "{input_json}" in cmd:
@@ -167,6 +180,7 @@ def main():
     parser.add_argument("--experiment-prefix", required=True, help="Experiment name prefix (e.g. v001a)")
     parser.add_argument("--timeout", type=int, default=120, help="Per-task timeout in seconds")
     parser.add_argument("--concurrency", type=int, default=None, help="Max concurrent evaluations (default: from config or 1)")
+    parser.add_argument("--no-canary", action="store_true", help="Skip canary preflight check")
     args = parser.parse_args()
     with open(args.config) as f:
@@ -187,6 +201,32 @@ def main():
     llm_evaluators = [k for k in config["evaluators"] if k in ("correctness", "conciseness")]
     code_evaluators = [k for k in config["evaluators"] if k not in ("correctness", "conciseness")]
+    # Canary run: verify agent works before burning through full dataset
+    if not args.no_canary:
+        print("  Canary: running 1 example preflight...", file=sys.stderr)
+        try:
+            canary_examples = list(client.list_examples(dataset_name=config["dataset"], limit=1))
+            if canary_examples:
+                canary_result = target(canary_examples[0].inputs)
+                canary_output = canary_result.get("output", "")
+                canary_error = canary_result.get("error", "")
+                if not canary_output and canary_error:
+                    print(f"  CANARY FAILED: Agent produced no output.", file=sys.stderr)
+                    print(f"  Error: {canary_error}", file=sys.stderr)
+                    print(f"  Fix the agent before running full evaluation.", file=sys.stderr)
+                    output = {
+                        "experiment": None,
+                        "prefix": args.experiment_prefix,
+                        "combined_score": 0.0,
+                        "error": f"Canary failed: {canary_error[:200]}",
+                    }
+                    print(json.dumps(output))
+                    sys.exit(2)
+                else:
+                    print(f"  Canary passed: got output ({len(str(canary_output))} chars)", file=sys.stderr)
+        except Exception as e:
+            print(f"  Canary check failed: {e} (proceeding anyway)", file=sys.stderr)
     print(f"Running evaluation: {args.experiment_prefix}")
     print(f"  Dataset: {config['dataset']}")
     print(f"  Worktree: {args.worktree_path}")