harness-evolver 4.5.0 → 4.5.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,7 +1,7 @@
1
1
  {
2
2
  "name": "harness-evolver",
3
3
  "description": "LangSmith-native autonomous agent optimization — evolves LLM agent code using multi-agent proposers, LangSmith experiments, and git worktrees",
4
- "version": "4.5.0",
4
+ "version": "4.5.2",
5
5
  "author": {
6
6
  "name": "Raphael Valdetaro"
7
7
  },
package/README.md CHANGED
@@ -48,8 +48,9 @@ export LANGSMITH_API_KEY="lsv2_pt_..."
48
48
  claude
49
49
 
50
50
  /evolver:setup # explores project, configures LangSmith
51
+ /evolver:health # check dataset quality (auto-corrects issues)
51
52
  /evolver:evolve # runs the optimization loop
52
- /evolver:status # check progress
53
+ /evolver:status # check progress (rich ASCII chart)
53
54
  /evolver:deploy # tag, push, finalize
54
55
  ```
55
56
 
@@ -64,19 +65,43 @@ claude
64
65
  </tr>
65
66
  <tr>
66
67
  <td><b>Real Code Evolution</b></td>
67
- <td>Proposers modify your actual agent code — not a wrapper. Each candidate works in an isolated git worktree. Winners are merged automatically.</td>
68
+ <td>Proposers modify your actual agent code — not a wrapper. Each candidate works in an isolated git worktree. Winners are merged automatically. Config files (.evolver.json, .env) are auto-propagated to worktrees.</td>
68
69
  </tr>
69
70
  <tr>
70
71
  <td><b>Self-Organizing Proposers</b></td>
71
72
  <td>Each iteration generates dynamic investigation lenses from failure data, architecture analysis, production traces, and evolution memory. Proposers self-organize their approach — no fixed strategies. They can self-abstain when their contribution would be redundant. Inspired by <a href="https://arxiv.org/abs/2603.28990">Dochkina (2026)</a>.</td>
72
73
  </tr>
73
74
  <tr>
75
+ <td><b>Rubric-Based Evaluation</b></td>
76
+ <td>Dataset examples support <code>expected_behavior</code> rubrics — specific criteria the judge evaluates against ("should mention null safety and Android development"), not just generic correctness. Partial scoring (0.5) for partially-met rubrics. Inspired by <a href="https://github.com/NousResearch/hermes-agent-self-evolution">Hermes Agent Self-Evolution</a>.</td>
77
+ </tr>
78
+ <tr>
79
+ <td><b>Constraint Gates</b></td>
80
+ <td>Proposals must pass hard constraints before merge: code growth ≤30%, entry point syntax valid, test suite passes. Candidates that fail are rejected and the next-best is tried. Prevents code bloat and broken merges.</td>
81
+ </tr>
82
+ <tr>
83
+ <td><b>Weighted Evaluators + Pareto</b></td>
84
+ <td>Configure <code>evaluator_weights</code> to prioritize what matters (e.g., correctness 50%, latency 30%). When candidates offer genuinely different tradeoffs, the Pareto front is reported instead of forcing a single winner.</td>
85
+ </tr>
86
+ <tr>
74
87
  <td><b>Agent-Based Evaluation</b></td>
75
- <td>The evaluator agent reads experiment outputs via langsmith-cli, judges correctness using the same Claude model powering the other agents, and writes scores back. No OpenAI API key or openevals dependency needed.</td>
88
+ <td>The evaluator agent reads experiment outputs via langsmith-cli, judges correctness using rubrics when available, and writes scores back. Judge feedback (textual comments explaining WHY scores were given) is surfaced to proposers for targeted mutations.</td>
89
+ </tr>
90
+ <tr>
91
+ <td><b>Canary Preflight</b></td>
92
+ <td>Before running the full evaluation, 1 example is tested as a canary. If the agent produces no output, evaluation stops immediately — no API quota wasted on broken agents.</td>
93
+ </tr>
94
+ <tr>
95
+ <td><b>Secret Detection</b></td>
96
+ <td>Detects 15+ secret patterns (API keys, tokens, PEM keys) in production traces and dataset examples. Secrets are filtered from <code>seed_from_traces</code> and flagged as critical issues in dataset health checks.</td>
97
+ </tr>
98
+ <tr>
99
+ <td><b>Evolution Chart</b></td>
100
+ <td>Rich ASCII visualization with ANSI colors: sparkline trend, score progression table (per-evaluator breakdown), what-changed narrative, horizontal bar chart, and code growth tracking with warnings.</td>
76
101
  </tr>
77
102
  <tr>
78
103
  <td><b>Production Traces</b></td>
79
- <td>Auto-discovers existing LangSmith production projects. Uses real user inputs for test generation and real error patterns for targeted optimization.</td>
104
+ <td>Auto-discovers existing LangSmith production projects. Uses real user inputs for test generation and real error patterns for targeted optimization. Can also mine Claude Code session history for eval data.</td>
80
105
  </tr>
81
106
  <tr>
82
107
  <td><b>Active Critic</b></td>
@@ -92,11 +117,11 @@ claude
92
117
  </tr>
93
118
  <tr>
94
119
  <td><b>Dataset Health</b></td>
95
- <td>Pre-flight dataset quality check: size adequacy, difficulty distribution, dead example detection, production coverage analysis, train/held-out splits. Auto-corrects issues before evolution starts.</td>
120
+ <td>Pre-flight dataset quality check: size adequacy, difficulty distribution, dead example detection, production coverage analysis, train/held-out splits, and secret scanning. Auto-corrects issues before evolution starts.</td>
96
121
  </tr>
97
122
  <tr>
98
123
  <td><b>Smart Gating</b></td>
99
- <td>Claude assesses gate conditions directly — score plateau, target reached, diminishing returns. No hardcoded thresholds. State validation ensures config hasn't diverged from LangSmith.</td>
124
+ <td>Claude assesses gate conditions directly — score plateau, target reached, diminishing returns. Holdout enforcement ensures final comparison uses unseen data. Baseline is re-scored with LLM-judge before the loop to prevent inflated starting scores.</td>
100
125
  </tr>
101
126
  <tr>
102
127
  <td><b>Background Mode</b></td>
@@ -111,9 +136,9 @@ claude
111
136
  | Command | What it does |
112
137
  |---|---|
113
138
  | `/evolver:setup` | Explore project, configure LangSmith (dataset, evaluators), run baseline |
114
- | `/evolver:health` | Check dataset quality (size, difficulty, coverage, splits), auto-correct issues |
139
+ | `/evolver:health` | Check dataset quality (size, difficulty, coverage, splits, secrets), auto-correct |
115
140
  | `/evolver:evolve` | Run the optimization loop (dynamic self-organizing proposers in worktrees) |
116
- | `/evolver:status` | Show progress, scores, history |
141
+ | `/evolver:status` | Show progress with rich ASCII evolution chart |
117
142
  | `/evolver:deploy` | Tag, push, clean up temporary files |
118
143
 
119
144
  ---
@@ -123,11 +148,11 @@ claude
123
148
  | Agent | Role | Color |
124
149
  |---|---|---|
125
150
  | **Proposer** | Self-organizing — investigates a data-driven lens, decides own approach, may abstain | Green |
126
- | **Evaluator** | LLM-as-judge — reads outputs via langsmith-cli, scores correctness | Yellow |
151
+ | **Evaluator** | LLM-as-judge — rubric-aware scoring via langsmith-cli, textual feedback | Yellow |
127
152
  | **Architect** | ULTRAPLAN mode — deep topology analysis with Opus model | Blue |
128
153
  | **Critic** | Active — detects gaming AND implements stricter evaluators | Red |
129
154
  | **Consolidator** | Cross-iteration memory consolidation (autoDream-inspired) | Cyan |
130
- | **TestGen** | Generates test inputs + adversarial injection mode | Cyan |
155
+ | **TestGen** | Generates test inputs with rubrics + adversarial injection mode | Cyan |
131
156
 
132
157
  ---
133
158
 
@@ -136,20 +161,22 @@ claude
136
161
  ```
137
162
  /evolver:evolve
138
163
  |
139
- +- 0.5 Validate state (skeptical memory — check .evolver.json vs LangSmith)
140
- +- 0.6 /evolver:health — dataset quality check + auto-correct
164
+ +- 0.5 Validate state (check .evolver.json vs LangSmith)
165
+ +- 0.6 /evolver:health — dataset quality + secret scan + auto-correct
166
+ +- 0.7 Baseline LLM-judge — re-score baseline with correctness if only has_output exists
141
167
  +- 1. Read state (.evolver.json + LangSmith experiments)
142
- +- 1.5 Gather trace insights (cluster errors, tokens, latency)
143
- +- 1.8 Analyze per-task failures (train split only — proposers don't see held-out)
168
+ +- 1.5 Gather trace insights + judge feedback (cluster errors, tokens, latency)
169
+ +- 1.8 Analyze per-task failures with judge comments (train split only)
144
170
  +- 1.8a Claude generates strategy.md + lenses.json from analysis data
145
171
  +- 1.9 Prepare shared proposer context (KV cache-optimized prefix)
146
172
  +- 2. Spawn N self-organizing proposers in parallel (each in a git worktree)
147
- +- 3. Run target for each candidate (code-based evaluators)
148
- +- 3.5 Spawn evaluator agent (LLM-as-judge via langsmith-cli)
149
- +- 4. Compare experiments -> select winner + per-task champion
173
+ +- 3. Copy .evolver.json + .env to worktrees, run canary, evaluate candidates
174
+ +- 3.5 Spawn evaluator agent (rubric-aware LLM-as-judge via langsmith-cli)
175
+ +- 4. Compare experiments on held-out split -> winner + Pareto front
176
+ +- 4.5 Constraint gate — reject candidates that break size/tests/entry-point
150
177
  +- 5. Merge winning worktree into main branch
151
178
  +- 5.5 Regression tracking (auto-add guard examples to dataset)
152
- +- 6. Report results
179
+ +- 6. Report results + evolution chart
153
180
  +- 6.2 Consolidator agent updates evolution memory (runs in background)
154
181
  +- 6.5 Auto-trigger Active Critic (detect + fix evaluator gaming)
155
182
  +- 7. Auto-trigger ULTRAPLAN Architect (opus model, deep analysis)
@@ -166,27 +193,31 @@ Plugin hook (SessionStart)
166
193
 
167
194
  Skills (markdown)
168
195
  ├── /evolver:setup → explores project, smart defaults, runs setup.py
169
- ├── /evolver:health → dataset quality check + auto-correct
196
+ ├── /evolver:health → dataset quality + secret scan + auto-correct
170
197
  ├── /evolver:evolve → orchestrates the evolution loop
171
- ├── /evolver:status → reads .evolver.json + LangSmith
198
+ ├── /evolver:status → rich ASCII evolution chart + stagnation detection
172
199
  └── /evolver:deploy → tags and pushes
173
200
 
174
201
  Agents (markdown)
175
202
  ├── Proposer (xN) → self-organizing, lens-driven, isolated git worktrees
176
- ├── Evaluator → LLM-as-judge via langsmith-cli
203
+ ├── Evaluator → rubric-aware LLM-as-judge via langsmith-cli
177
204
  ├── Critic → detects gaming + implements stricter evaluators
178
205
  ├── Architect → ULTRAPLAN deep analysis (opus model)
179
206
  ├── Consolidator → cross-iteration memory (autoDream-inspired)
180
- └── TestGen → generates test inputs + adversarial injection
207
+ └── TestGen → generates test inputs with rubrics + adversarial injection
181
208
 
182
- Tools (Python + langsmith SDK)
183
- ├── setup.py → creates datasets, configures evaluators
184
- ├── run_eval.py → runs target against dataset
185
- ├── read_results.py → compares experiments
209
+ Tools (Python)
210
+ ├── setup.py → creates datasets, configures evaluators + weights
211
+ ├── run_eval.py → runs target against dataset (canary preflight, {input_text})
212
+ ├── read_results.py → weighted scoring, Pareto front, judge feedback
186
213
  ├── trace_insights.py → clusters errors from traces
187
- ├── seed_from_traces.py → imports production traces
214
+ ├── seed_from_traces.py → imports production traces (secret-filtered)
215
+ ├── evolution_chart.py → rich ASCII chart (stdlib-only)
216
+ ├── constraint_check.py → validates proposals (growth, syntax, tests) (stdlib-only)
217
+ ├── secret_filter.py → detects 15+ secret patterns (stdlib-only)
218
+ ├── mine_sessions.py → extracts eval data from Claude Code history (stdlib-only)
219
+ ├── dataset_health.py → dataset quality diagnostic + secret scanning
188
220
  ├── validate_state.py → validates config vs LangSmith state
189
- ├── dataset_health.py → dataset quality diagnostic (size, difficulty, coverage, splits)
190
221
  ├── regression_tracker.py → tracks regressions, adds guard examples
191
222
  ├── add_evaluator.py → programmatically adds evaluators
192
223
  └── adversarial_inject.py → detects memorization, injects adversarial tests
@@ -194,6 +225,27 @@ Tools (Python + langsmith SDK)
194
225
 
195
226
  ---
196
227
 
228
+ ## Entry Point Placeholders
229
+
230
+ When configuring your agent's entry point during setup, use the placeholder that matches how your agent takes input:
231
+
232
+ | Placeholder | Behavior | Use when |
233
+ |---|---|---|
234
+ | `{input_text}` | Extracts plain text, shell-escapes it | Agent takes `--query "text"` or positional args |
235
+ | `{input}` | Passes path to a JSON file | Agent reads structured JSON from file |
236
+ | `{input_json}` | Passes raw JSON string inline | Agent parses JSON from command line |
237
+
238
+ **Example:**
239
+ ```bash
240
+ # Agent that takes a query as text:
241
+ python agent.py --query {input_text}
242
+
243
+ # Agent that reads a JSON file:
244
+ python agent.py {input}
245
+ ```
246
+
247
+ ---
248
+
197
249
  ## Requirements
198
250
 
199
251
  - **LangSmith account** + `LANGSMITH_API_KEY`
@@ -223,6 +275,7 @@ LangSmith traces **any** AI framework. The evolver works with all of them:
223
275
 
224
276
  - [Meta-Harness: End-to-End Optimization of Model Harnesses](https://arxiv.org/abs/2603.28052) — Lee et al., 2026
225
277
  - [Drop the Hierarchy and Roles: How Self-Organizing LLM Agents Outperform Designed Structures](https://arxiv.org/abs/2603.28990) — Dochkina, 2026
278
+ - [Hermes Agent Self-Evolution](https://github.com/NousResearch/hermes-agent-self-evolution) — NousResearch (rubric-based eval, constraint gates)
226
279
  - [Darwin Godel Machine](https://sakana.ai/dgm/) — Sakana AI
227
280
  - [AlphaEvolve](https://deepmind.google/blog/alphaevolve/) — DeepMind
228
281
  - [LangSmith Evaluation](https://docs.smith.langchain.com/evaluation) — LangChain
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "harness-evolver",
3
- "version": "4.5.0",
3
+ "version": "4.5.2",
4
4
  "description": "LangSmith-native autonomous agent optimization for Claude Code",
5
5
  "author": "Raphael Valdetaro",
6
6
  "license": "MIT",
@@ -133,6 +133,61 @@ If critical issues found, ask user whether to continue or fix first via AskUserQ
133
133
 
134
134
  Invoke `/evolver:health` to check and auto-correct dataset issues. If health_report.json shows critical issues that couldn't be auto-corrected, ask user whether to proceed via AskUserQuestion.
135
135
 
136
+ ### 0.7. Ensure Baseline Has LLM-Judge Scores
137
+
138
+ The baseline experiment (from setup) only runs code-based evaluators (has_output, token_efficiency). Without LLM-judge scores, the baseline score is inflated — any agent that produces text gets 1.0, making gate checks stop evolution prematurely.
139
+
140
+ Check if LLM evaluators are configured and the baseline needs scoring:
141
+
142
+ ```bash
143
+ LLM_EVALS=$(python3 -c "import json; c=json.load(open('.evolver.json')); llm=[k for k in c['evaluators'] if k in ('correctness','conciseness')]; print(','.join(llm) if llm else '')")
144
+ BASELINE=$(python3 -c "import json; print(json.load(open('.evolver.json')).get('baseline_experiment', ''))")
145
+ ```
146
+
147
+ If `LLM_EVALS` is non-empty and `BASELINE` exists, check if LLM scores already exist:
148
+
149
+ ```bash
150
+ HAS_LLM_SCORES=$($EVOLVER_PY $TOOLS/read_results.py --experiment "$BASELINE" --config .evolver.json 2>/dev/null | python3 -c "
151
+ import sys, json
152
+ try:
153
+ r = json.load(sys.stdin)
154
+ scored_keys = set()
155
+ for ex in r.get('per_example', {}).values():
156
+ scored_keys.update(ex.get('scores', {}).keys())
157
+ llm_keys = set('correctness,conciseness'.split(','))
158
+ configured = set(k for k in llm_keys if k in '$LLM_EVALS'.split(','))
159
+ print('yes' if configured.issubset(scored_keys) else 'no')
160
+ except: print('no')
161
+ ")
162
+ ```
163
+
164
+ If `HAS_LLM_SCORES` is "no", trigger the evaluator agent on the baseline:
165
+
166
+ ```
167
+ Agent(
168
+ subagent_type: "evolver-evaluator",
169
+ description: "Score baseline with LLM-judge",
170
+ prompt: "Experiments to evaluate: {baseline_experiment}. Evaluators: {llm_evaluator_list}. Framework: {framework}. Entry point: {entry_point}. Dataset: {dataset_name}. NOTE: This is the baseline — score it fairly so evolution has a meaningful starting point. Some examples have expected_behavior rubrics in their metadata — fetch example metadata and use rubrics for scoring when available."
171
+ )
172
+ ```
173
+
174
+ After the evaluator completes, re-read the baseline score and update `.evolver.json`:
175
+
176
+ ```bash
177
+ $EVOLVER_PY $TOOLS/read_results.py --experiment "$BASELINE" --config .evolver.json --output best_results.json 2>/dev/null
178
+ python3 -c "
179
+ import json
180
+ br = json.load(open('best_results.json'))
181
+ c = json.load(open('.evolver.json'))
182
+ new_score = br.get('combined_score', c['best_score'])
183
+ c['best_score'] = new_score
184
+ if c.get('history'):
185
+ c['history'][0]['score'] = new_score
186
+ json.dump(c, open('.evolver.json', 'w'), indent=2)
187
+ print(f'Baseline re-scored with LLM-judge: {new_score:.3f}')
188
+ "
189
+ ```
190
+
136
191
  ### 0.8. Resolve Project Directory
137
192
 
138
193
  If the project is in a subdirectory of the git repo (e.g., `playground/react-agent/`), worktrees replicate the full repo structure. Read `project_dir` from `.evolver.json` to resolve paths correctly:
@@ -340,10 +395,22 @@ Only run evaluation (Step 3) for proposers that committed changes (not abstained
340
395
 
341
396
  ### 3. Run Target for Each Candidate (Parallel)
342
397
 
343
- Run evaluations for ALL candidates simultaneously they're independent:
398
+ First, copy config files into each worktree (untracked files aren't replicated by git — this was the #1 bug in all real-world runs):
399
+
400
+ ```bash
401
+ for WORKTREE in {worktree_paths_with_commits}; do
402
+ WORKTREE_PROJECT="$WORKTREE"
403
+ [ -n "$PROJECT_DIR" ] && WORKTREE_PROJECT="$WORKTREE/$PROJECT_DIR"
404
+
405
+ # Copy untracked config files needed by run_eval.py and the agent
406
+ cp .evolver.json "$WORKTREE_PROJECT/.evolver.json" 2>/dev/null
407
+ [ -f .env ] && cp .env "$WORKTREE_PROJECT/.env" 2>/dev/null
408
+ done
409
+ ```
410
+
411
+ Then run evaluations for ALL candidates simultaneously:
344
412
 
345
413
  ```bash
346
- # Launch all evaluations in parallel
347
414
  for WORKTREE in {worktree_paths_with_commits}; do
348
415
  WORKTREE_PROJECT="$WORKTREE"
349
416
  [ -n "$PROJECT_DIR" ] && WORKTREE_PROJECT="$WORKTREE/$PROJECT_DIR"
@@ -39,26 +39,39 @@ CODE_EVALUATOR_TEMPLATES = {
39
39
 
40
40
 
41
41
  def add_evaluator(config_path, evaluator_name, eval_type, pattern=None):
42
- """Add evaluator to config."""
42
+ """Add evaluator to config using partial update to avoid race conditions.
43
+
44
+ Re-reads the config immediately before writing to minimize the window
45
+ where concurrent updates (e.g., main loop updating best_score) could
46
+ be lost. Only modifies 'evaluators' and 'code_evaluators' fields.
47
+ """
48
+ # First read to check if evaluator already exists
43
49
  with open(config_path) as f:
44
50
  config = json.load(f)
45
51
 
46
- evaluators = config.get("evaluators", [])
47
-
48
- if evaluator_name in evaluators:
52
+ if evaluator_name in config.get("evaluators", []):
49
53
  print(f"Evaluator '{evaluator_name}' already exists", file=sys.stderr)
50
54
  return False
51
55
 
52
- evaluators.append(evaluator_name)
53
- config["evaluators"] = evaluators
54
-
56
+ # Prepare what we need to add
57
+ new_code_eval = None
55
58
  if eval_type == "code" and pattern:
56
- code_evals = config.get("code_evaluators", {})
57
- code_evals[evaluator_name] = {"pattern": pattern, "type": "regex"}
58
- config["code_evaluators"] = code_evals
59
+ new_code_eval = {"pattern": pattern, "type": "regex"}
59
60
  elif eval_type == "code" and evaluator_name in CODE_EVALUATOR_TEMPLATES:
61
+ new_code_eval = CODE_EVALUATOR_TEMPLATES[evaluator_name]
62
+
63
+ # Re-read config right before write to pick up concurrent changes
64
+ with open(config_path) as f:
65
+ config = json.load(f)
66
+
67
+ evaluators = config.get("evaluators", [])
68
+ if evaluator_name not in evaluators:
69
+ evaluators.append(evaluator_name)
70
+ config["evaluators"] = evaluators
71
+
72
+ if new_code_eval:
60
73
  code_evals = config.get("code_evaluators", {})
61
- code_evals[evaluator_name] = CODE_EVALUATOR_TEMPLATES[evaluator_name]
74
+ code_evals[evaluator_name] = new_code_eval
62
75
  config["code_evaluators"] = code_evals
63
76
 
64
77
  with open(config_path, "w") as f:
@@ -77,12 +90,16 @@ def main():
77
90
  args = parser.parse_args()
78
91
 
79
92
  if args.remove:
93
+ # Re-read right before write to avoid race conditions
80
94
  with open(args.config) as f:
81
95
  config = json.load(f)
82
96
  evaluators = config.get("evaluators", [])
83
97
  if args.evaluator in evaluators:
84
98
  evaluators.remove(args.evaluator)
85
99
  config["evaluators"] = evaluators
100
+ code_evals = config.get("code_evaluators", {})
101
+ code_evals.pop(args.evaluator, None)
102
+ config["code_evaluators"] = code_evals
86
103
  with open(args.config, "w") as f:
87
104
  json.dump(config, f, indent=2)
88
105
  print(f"Removed evaluator: {args.evaluator}")
@@ -80,7 +80,30 @@ def check_entry_point(worktree_path, entry_point):
80
80
  return {"pass": True, "reason": "entry point exists and has valid syntax"}
81
81
 
82
82
 
83
- def check_tests(worktree_path):
83
+ def find_project_python(worktree_path, config=None):
84
+ """Find the project's Python interpreter (venv > entry_point > system).
85
+
86
+ Checks for venv in the worktree, then extracts from entry_point config,
87
+ then falls back to system python3.
88
+ """
89
+ # Check for venv in worktree
90
+ for venv_dir in [".venv", "venv"]:
91
+ venv_python = os.path.join(worktree_path, venv_dir, "bin", "python")
92
+ if os.path.isfile(venv_python):
93
+ return venv_python
94
+
95
+ # Extract from entry_point in config
96
+ if config:
97
+ entry = config.get("entry_point", "")
98
+ for part in entry.split():
99
+ if part.endswith("/python") or part.endswith("/python3"):
100
+ if os.path.isfile(part):
101
+ return part
102
+
103
+ return "python3"
104
+
105
+
106
+ def check_tests(worktree_path, config=None):
84
107
  """Run test suite if it exists. Returns pass if no tests found."""
85
108
  test_dirs = ["tests", "test"]
86
109
  has_tests = False
@@ -95,9 +118,11 @@ def check_tests(worktree_path):
95
118
  if not has_tests:
96
119
  return {"pass": True, "reason": "no test suite found (skipped)", "skipped": True}
97
120
 
121
+ python = find_project_python(worktree_path, config)
122
+
98
123
  try:
99
124
  result = subprocess.run(
100
- ["python3", "-m", "pytest", "-q", "--tb=no"],
125
+ [python, "-m", "pytest", "-q", "--tb=no"],
101
126
  capture_output=True, text=True,
102
127
  cwd=worktree_path, timeout=120,
103
128
  )
@@ -135,7 +160,7 @@ def main():
135
160
  args.max_growth,
136
161
  ),
137
162
  "entry_point": check_entry_point(args.worktree_path, ep_for_check),
138
- "tests": check_tests(args.worktree_path),
163
+ "tests": check_tests(args.worktree_path, config),
139
164
  }
140
165
 
141
166
  all_pass = all(r["pass"] for r in results.values())
@@ -73,7 +73,16 @@ def render_header(config, history, scores, c):
73
73
  project = config.get('project', 'unknown')
74
74
  dataset = config.get('dataset', 'unknown')
75
75
  evals = config.get('evaluators', [])
76
- total = history[0].get('total', config.get('num_examples', '?'))
76
+ # Find example count from multiple sources
77
+ total = history[0].get('total')
78
+ if not total:
79
+ # Check any history entry that has it
80
+ for h in history:
81
+ if h.get('total'):
82
+ total = h['total']
83
+ break
84
+ if not total:
85
+ total = config.get('num_examples', '?')
77
86
  base_score = scores[0]
78
87
  best_score = max(scores)
79
88
  iters = len(history) - 1
package/tools/run_eval.py CHANGED
@@ -72,7 +72,20 @@ def make_target(entry_point, cwd):
72
72
 
73
73
  try:
74
74
  cmd = entry_point
75
- if "{input}" in cmd:
75
+
76
+ # {input_text}: extract plain text from inputs dict (for agents expecting --query "text")
77
+ if "{input_text}" in cmd:
78
+ import shlex
79
+ text = ""
80
+ for key in ("input", "question", "query", "prompt", "text", "user_input"):
81
+ if key in inputs and isinstance(inputs[key], str):
82
+ text = inputs[key]
83
+ break
84
+ if not text and inputs:
85
+ first_val = next(iter(inputs.values()), "")
86
+ text = str(first_val) if not isinstance(first_val, str) else first_val
87
+ cmd = cmd.replace("{input_text}", shlex.quote(text))
88
+ elif "{input}" in cmd:
76
89
  # Placeholder: replace with path to JSON file
77
90
  cmd = cmd.replace("{input}", input_path)
78
91
  elif "{input_json}" in cmd:
@@ -167,6 +180,7 @@ def main():
167
180
  parser.add_argument("--experiment-prefix", required=True, help="Experiment name prefix (e.g. v001a)")
168
181
  parser.add_argument("--timeout", type=int, default=120, help="Per-task timeout in seconds")
169
182
  parser.add_argument("--concurrency", type=int, default=None, help="Max concurrent evaluations (default: from config or 1)")
183
+ parser.add_argument("--no-canary", action="store_true", help="Skip canary preflight check")
170
184
  args = parser.parse_args()
171
185
 
172
186
  with open(args.config) as f:
@@ -187,6 +201,32 @@ def main():
187
201
  llm_evaluators = [k for k in config["evaluators"] if k in ("correctness", "conciseness")]
188
202
  code_evaluators = [k for k in config["evaluators"] if k not in ("correctness", "conciseness")]
189
203
 
204
+ # Canary run: verify agent works before burning through full dataset
205
+ if not args.no_canary:
206
+ print(" Canary: running 1 example preflight...", file=sys.stderr)
207
+ try:
208
+ canary_examples = list(client.list_examples(dataset_name=config["dataset"], limit=1))
209
+ if canary_examples:
210
+ canary_result = target(canary_examples[0].inputs)
211
+ canary_output = canary_result.get("output", "")
212
+ canary_error = canary_result.get("error", "")
213
+ if not canary_output and canary_error:
214
+ print(f" CANARY FAILED: Agent produced no output.", file=sys.stderr)
215
+ print(f" Error: {canary_error}", file=sys.stderr)
216
+ print(f" Fix the agent before running full evaluation.", file=sys.stderr)
217
+ output = {
218
+ "experiment": None,
219
+ "prefix": args.experiment_prefix,
220
+ "combined_score": 0.0,
221
+ "error": f"Canary failed: {canary_error[:200]}",
222
+ }
223
+ print(json.dumps(output))
224
+ sys.exit(2)
225
+ else:
226
+ print(f" Canary passed: got output ({len(str(canary_output))} chars)", file=sys.stderr)
227
+ except Exception as e:
228
+ print(f" Canary check failed: {e} (proceeding anyway)", file=sys.stderr)
229
+
190
230
  print(f"Running evaluation: {args.experiment_prefix}")
191
231
  print(f" Dataset: {config['dataset']}")
192
232
  print(f" Worktree: {args.worktree_path}")