npm - harness-evolver - Versions diffs - 4.4.0 → 4.5.1 - Mend

harness-evolver 4.4.0 → 4.5.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

package/.claude-plugin/plugin.json +1 -1
package/agents/evolver-evaluator.md +18 -1
package/agents/evolver-testgen.md +5 -3
package/package.json +1 -1
package/skills/evolve/SKILL.md +117 -6
package/tools/constraint_check.py +154 -0
package/tools/dataset_health.py +31 -0
package/tools/evolution_chart.py +16 -2
package/tools/mine_sessions.py +150 -0
package/tools/read_results.py +110 -4
package/tools/run_eval.py +41 -1
package/tools/secret_filter.py +97 -0
package/tools/seed_from_traces.py +15 -0
package/tools/setup.py +17 -1

package/.claude-plugin/plugin.json CHANGED Viewed

@@ -1,7 +1,7 @@
 {
   "name": "harness-evolver",
   "description": "LangSmith-native autonomous agent optimization — evolves LLM agent code using multi-agent proposers, LangSmith experiments, and git worktrees",
-  "version": "4.4.0",
+  "version": "4.5.1",
   "author": {
     "name": "Raphael Valdetaro"
   },

package/agents/evolver-evaluator.md CHANGED Viewed

@@ -85,7 +85,24 @@ For each run, apply the requested evaluators. The evaluators you may be asked to
 #### correctness
 Judge: **Is the output a correct, accurate, and complete response to the input?**
-Scoring:
+**Rubric-aware scoring:** Some dataset examples have an `expected_behavior` rubric in their metadata. Before scoring, fetch example metadata:
+```bash
+langsmith-cli --json examples list \
+    --dataset "{dataset_name}" \
+    --fields id,metadata \
+    --limit 200 \
+    --output example_metadata.jsonl
+```
+Build a map of `reference_example_id → expected_behavior`. When scoring a run whose example has a rubric, evaluate against the rubric criteria specifically.
+**With rubric:**
+- `1.0` — Response satisfies all criteria in the rubric
+- `0.5` — Response partially satisfies the rubric (some criteria met, others missing)
+- `0.0` — Response fails to meet the rubric criteria
+**Without rubric** (generic scoring):
 - `1.0` — Correct and complete. The response accurately addresses the input.
 - `0.0` — Incorrect, incomplete, or off-topic.

package/agents/evolver-testgen.md CHANGED Viewed

@@ -37,16 +37,18 @@ Do NOT copy production inputs verbatim — generate VARIATIONS.
 ### Phase 3: Generate Inputs
-Generate 30 test inputs as a JSON file:
+Generate 30 test inputs as a JSON file. Each example MUST include an `expected_behavior` rubric — a description of what a correct response should cover (NOT exact expected text):
 ```json
 [
-  {"input": "your first test question"},
-  {"input": "your second test question"},
+  {"input": "What is Kotlin?", "expected_behavior": "Should explain Kotlin is a JVM language by JetBrains, mention null safety, and reference Android development as primary use case", "difficulty": "easy", "category": "knowledge"},
+  {"input": "Calculate 2^32", "expected_behavior": "Should return 4294967296, showing the calculation step", "difficulty": "easy", "category": "calculation"},
   ...
 ]
 ```
+The `expected_behavior` is a **rubric**, not exact text. The LLM judge uses it to score responses. Write 1-3 specific, verifiable criteria per example.
 Distribution:
 - **40% Standard** (12): typical, well-formed inputs
 - **20% Edge Cases** (6): boundary conditions, minimal inputs

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "harness-evolver",
-  "version": "4.4.0",
+  "version": "4.5.1",
   "description": "LangSmith-native autonomous agent optimization for Claude Code",
   "author": "Raphael Valdetaro",
   "license": "MIT",

package/skills/evolve/SKILL.md CHANGED Viewed

@@ -133,6 +133,61 @@ If critical issues found, ask user whether to continue or fix first via AskUserQ
 Invoke `/evolver:health` to check and auto-correct dataset issues. If health_report.json shows critical issues that couldn't be auto-corrected, ask user whether to proceed via AskUserQuestion.
+### 0.7. Ensure Baseline Has LLM-Judge Scores
+The baseline experiment (from setup) only runs code-based evaluators (has_output, token_efficiency). Without LLM-judge scores, the baseline score is inflated — any agent that produces text gets 1.0, making gate checks stop evolution prematurely.
+Check if LLM evaluators are configured and the baseline needs scoring:
+```bash
+LLM_EVALS=$(python3 -c "import json; c=json.load(open('.evolver.json')); llm=[k for k in c['evaluators'] if k in ('correctness','conciseness')]; print(','.join(llm) if llm else '')")
+BASELINE=$(python3 -c "import json; print(json.load(open('.evolver.json')).get('baseline_experiment', ''))")
+```
+If `LLM_EVALS` is non-empty and `BASELINE` exists, check if LLM scores already exist:
+```bash
+HAS_LLM_SCORES=$($EVOLVER_PY $TOOLS/read_results.py --experiment "$BASELINE" --config .evolver.json 2>/dev/null | python3 -c "
+import sys, json
+try:
+    r = json.load(sys.stdin)
+    scored_keys = set()
+    for ex in r.get('per_example', {}).values():
+        scored_keys.update(ex.get('scores', {}).keys())
+    llm_keys = set('correctness,conciseness'.split(','))
+    configured = set(k for k in llm_keys if k in '$LLM_EVALS'.split(','))
+    print('yes' if configured.issubset(scored_keys) else 'no')
+except: print('no')
+")
+```
+If `HAS_LLM_SCORES` is "no", trigger the evaluator agent on the baseline:
+```
+Agent(
+  subagent_type: "evolver-evaluator",
+  description: "Score baseline with LLM-judge",
+  prompt: "Experiments to evaluate: {baseline_experiment}. Evaluators: {llm_evaluator_list}. Framework: {framework}. Entry point: {entry_point}. Dataset: {dataset_name}. NOTE: This is the baseline — score it fairly so evolution has a meaningful starting point. Some examples have expected_behavior rubrics in their metadata — fetch example metadata and use rubrics for scoring when available."
+)
+```
+After the evaluator completes, re-read the baseline score and update `.evolver.json`:
+```bash
+$EVOLVER_PY $TOOLS/read_results.py --experiment "$BASELINE" --config .evolver.json --output best_results.json 2>/dev/null
+python3 -c "
+import json
+br = json.load(open('best_results.json'))
+c = json.load(open('.evolver.json'))
+new_score = br.get('combined_score', c['best_score'])
+c['best_score'] = new_score
+if c.get('history'):
+    c['history'][0]['score'] = new_score
+json.dump(c, open('.evolver.json', 'w'), indent=2)
+print(f'Baseline re-scored with LLM-judge: {new_score:.3f}')
+"
+```
 ### 0.8. Resolve Project Directory
 If the project is in a subdirectory of the git repo (e.g., `playground/react-agent/`), worktrees replicate the full repo structure. Read `project_dir` from `.evolver.json` to resolve paths correctly:
@@ -189,6 +244,14 @@ wait  # Wait for all data gathering to complete
 ```
 If `best_results.json` exists, parse it to find failing examples (score < 0.7). Group by metadata or error pattern.
+**For each failing example, include the judge's feedback comment** (from the `feedback` field) in the strategy. This gives proposers specific, actionable information about WHY examples fail:
+```
+## Failing Examples (with judge feedback)
+- "What is Kotlin?" (score: 0.3) — Judge: "Response was factually correct but missed null safety and Android development use cases"
+- "Calculate 2^32" (score: 0.0) — Judge: "Run failed with timeout error"
+```
 This failure data feeds into the strategy and lens generation step (1.8a).
 If no best_results.json (first iteration without baseline), all proposers work from code analysis only — no failure data available.
@@ -332,10 +395,22 @@ Only run evaluation (Step 3) for proposers that committed changes (not abstained
 ### 3. Run Target for Each Candidate (Parallel)
-Run evaluations for ALL candidates simultaneously — they're independent:
+First, copy config files into each worktree (untracked files aren't replicated by git — this was the #1 bug in all real-world runs):
+```bash
+for WORKTREE in {worktree_paths_with_commits}; do
+    WORKTREE_PROJECT="$WORKTREE"
+    [ -n "$PROJECT_DIR" ] && WORKTREE_PROJECT="$WORKTREE/$PROJECT_DIR"
+    # Copy untracked config files needed by run_eval.py and the agent
+    cp .evolver.json "$WORKTREE_PROJECT/.evolver.json" 2>/dev/null
+    [ -f .env ] && cp .env "$WORKTREE_PROJECT/.env" 2>/dev/null
+done
+```
+Then run evaluations for ALL candidates simultaneously:
 ```bash
-# Launch all evaluations in parallel
 for WORKTREE in {worktree_paths_with_commits}; do
     WORKTREE_PROJECT="$WORKTREE"
     [ -n "$PROJECT_DIR" ] && WORKTREE_PROJECT="$WORKTREE/$PROJECT_DIR"
@@ -373,7 +448,7 @@ Then spawn ONE evaluator agent that scores ALL candidates in a single pass. This
 Agent(
   subagent_type: "evolver-evaluator",
   description: "Evaluate all candidates for iteration v{NNN}",
-  prompt: "Experiments to evaluate: {comma-separated experiment names from non-abstained proposers}. Evaluators: {llm_evaluator_list}. Framework: {framework}. Entry point: {entry_point}."
+  prompt: "Experiments to evaluate: {comma-separated experiment names from non-abstained proposers}. Evaluators: {llm_evaluator_list}. Framework: {framework}. Entry point: {entry_point}. Dataset: {dataset_name}. NOTE: Some examples have expected_behavior rubrics in their metadata — fetch example metadata and use rubrics for scoring when available."
 )
 ```
@@ -385,17 +460,47 @@ Wait for the evaluator agent to complete before proceeding.
 $EVOLVER_PY $TOOLS/read_results.py \
     --experiments "{comma-separated list of experiment names from non-abstained proposers}" \
     --config .evolver.json \
+    --split held_out \
     --output comparison.json
 ```
 Parse `comparison.json`:
-- `comparison.winner` — highest combined score
+- `comparison.winner` — highest combined score **on held-out data** (never seen during optimization)
 - `comparison.champion` — per-task champion (for next iteration's context)
+- `comparison.pareto_front` — non-dominated candidates across evaluators (if >1, report tradeoffs)
 - `comparison.all_candidates` — all scores for reporting
+If `comparison.pareto_front` has more than 1 entry, report it:
+```
+Pareto front ({N} non-dominated candidates):
+  v{NNN}-1: {evaluator_scores} (winner by combined score)
+  v{NNN}-3: {evaluator_scores} (different tradeoff)
+```
+### 4.5. Constraint Gate
+Before merging, validate the winner passes hard constraints:
+```bash
+$EVOLVER_PY $TOOLS/constraint_check.py \
+    --config .evolver.json \
+    --worktree-path "{winner_worktree_path}" \
+    --baseline-path "." \
+    --output constraint_result.json
+```
+If `all_pass` is false, skip this candidate and try the next-best from `comparison.all_candidates`. If NO candidates pass constraints, log a warning and proceed to next iteration without merging:
+```
+WARNING: No candidates passed constraint gates. Skipping merge.
+  growth: {growth_pct}% (limit: 30%)
+  entry_point: {pass/fail}
+  tests: {pass/fail}
+```
 ### 5. Merge Winner
-If the winner scored higher than the current best:
+If the winner scored higher than the current best AND passed constraint gates:
 ```bash
 # Get the winning worktree's branch
@@ -413,6 +518,11 @@ Extract winner metrics for the chart:
 - `per_evaluator` → average each evaluator's scores across per_example from best_results.json
 - `approach` → first line of `## Approach` section from winner's proposal.md
 - `lens` → the `source` field from the winning proposer's lens in lenses.json
+- `code_loc` → count lines of code after merge for growth tracking:
+```bash
+CODE_LOC=$(find . -name "*.py" -not -path "./.venv/*" -not -path "./venv/*" -not -path "./__pycache__/*" | xargs wc -l 2>/dev/null | tail -1 | awk '{print $1}')
+```
 ```python
 import json
@@ -431,7 +541,8 @@ c['history'].append({
     'total': {winner_total},
     'per_evaluator': {winner_per_evaluator_dict},
     'approach': '{approach_from_proposal_md}',
-    'lens': '{lens_source}'
+    'lens': '{lens_source}',
+    'code_loc': {code_loc}
 })
 json.dump(c, open('.evolver.json', 'w'), indent=2)
 ```

package/tools/constraint_check.py ADDED Viewed

@@ -0,0 +1,154 @@
+#!/usr/bin/env python3
+"""Constraint checker for evolution proposals.
+Validates that a candidate proposal doesn't violate hard constraints
+before it's merged. Inspired by Hermes Agent Self-Evolution.
+Usage:
+    python3 constraint_check.py \
+        --config .evolver.json \
+        --worktree-path /tmp/worktree \
+        --baseline-path /path/to/main \
+        --output constraint_result.json
+Stdlib-only — no langsmith dependency.
+"""
+import argparse
+import json
+import os
+import subprocess
+import sys
+def count_loc(directory, extensions=(".py", ".js", ".ts", ".jsx", ".tsx")):
+    """Count lines of code in a directory, excluding venvs and node_modules."""
+    total = 0
+    skip_dirs = {".venv", "venv", "node_modules", "__pycache__", ".git"}
+    for root, dirs, files in os.walk(directory):
+        dirs[:] = [d for d in dirs if d not in skip_dirs]
+        for f in files:
+            if any(f.endswith(ext) for ext in extensions):
+                try:
+                    with open(os.path.join(root, f)) as fh:
+                        total += sum(1 for _ in fh)
+                except (OSError, UnicodeDecodeError):
+                    pass
+    return total
+def check_growth(baseline_loc, candidate_loc, max_growth_pct=30):
+    """Check code didn't grow beyond threshold."""
+    if baseline_loc == 0:
+        return {"pass": True, "reason": "no baseline LOC"}
+    growth = ((candidate_loc - baseline_loc) / baseline_loc) * 100
+    passed = growth <= max_growth_pct
+    return {
+        "pass": passed,
+        "baseline_loc": baseline_loc,
+        "candidate_loc": candidate_loc,
+        "growth_pct": round(growth, 1),
+        "max_growth_pct": max_growth_pct,
+        "reason": f"Code growth {growth:.1f}% {'<=' if passed else '>'} {max_growth_pct}% limit",
+    }
+def check_entry_point(worktree_path, entry_point):
+    """Check that the entry point is still runnable (syntax check)."""
+    parts = entry_point.split()
+    script_file = None
+    for part in parts:
+        if part.endswith((".py", ".js", ".ts", ".sh")):
+            script_file = part
+            break
+    if not script_file:
+        return {"pass": True, "reason": "no script file detected in entry_point"}
+    full_path = os.path.join(worktree_path, script_file)
+    if not os.path.exists(full_path):
+        return {"pass": False, "reason": f"entry point file missing: {script_file}"}
+    if script_file.endswith(".py"):
+        result = subprocess.run(
+            ["python3", "-m", "py_compile", full_path],
+            capture_output=True, text=True,
+        )
+        if result.returncode != 0:
+            return {"pass": False, "reason": f"syntax error: {result.stderr[:200]}"}
+    return {"pass": True, "reason": "entry point exists and has valid syntax"}
+def check_tests(worktree_path):
+    """Run test suite if it exists. Returns pass if no tests found."""
+    test_dirs = ["tests", "test"]
+    has_tests = False
+    for td in test_dirs:
+        test_path = os.path.join(worktree_path, td)
+        if os.path.isdir(test_path):
+            for f in os.listdir(test_path):
+                if f.startswith("test_") and f.endswith(".py"):
+                    has_tests = True
+                    break
+    if not has_tests:
+        return {"pass": True, "reason": "no test suite found (skipped)", "skipped": True}
+    try:
+        result = subprocess.run(
+            ["python3", "-m", "pytest", "-q", "--tb=no"],
+            capture_output=True, text=True,
+            cwd=worktree_path, timeout=120,
+        )
+        passed = result.returncode == 0
+        return {
+            "pass": passed,
+            "reason": result.stdout.strip()[:200] if passed else result.stderr.strip()[:200],
+            "skipped": False,
+        }
+    except FileNotFoundError:
+        return {"pass": True, "reason": "pytest not available (skipped)", "skipped": True}
+    except subprocess.TimeoutExpired:
+        return {"pass": False, "reason": "test suite timed out after 120s", "skipped": False}
+def main():
+    parser = argparse.ArgumentParser(description="Check constraints on a proposal")
+    parser.add_argument("--config", default=".evolver.json")
+    parser.add_argument("--worktree-path", required=True, help="Candidate worktree path")
+    parser.add_argument("--baseline-path", default=".", help="Baseline (main) path")
+    parser.add_argument("--max-growth", type=int, default=30, help="Max code growth %% (default 30)")
+    parser.add_argument("--output", default=None)
+    args = parser.parse_args()
+    with open(args.config) as f:
+        config = json.load(f)
+    entry_point = config.get("entry_point", "")
+    ep_for_check = entry_point.split("python ")[-1].split("python3 ")[-1]
+    results = {
+        "growth": check_growth(
+            count_loc(args.baseline_path),
+            count_loc(args.worktree_path),
+            args.max_growth,
+        ),
+        "entry_point": check_entry_point(args.worktree_path, ep_for_check),
+        "tests": check_tests(args.worktree_path),
+    }
+    all_pass = all(r["pass"] for r in results.values())
+    output = {"all_pass": all_pass, "constraints": results}
+    out_str = json.dumps(output, indent=2)
+    if args.output:
+        with open(args.output, "w") as f:
+            f.write(out_str)
+    print(out_str)
+    sys.exit(0 if all_pass else 1)
+if __name__ == "__main__":
+    main()

package/tools/dataset_health.py CHANGED Viewed

@@ -15,6 +15,14 @@ import os
 import sys
 from datetime import datetime, timezone
+# Secret detection (local import from same directory)
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+try:
+    from secret_filter import has_secrets
+except ImportError:
+    def has_secrets(text):
+        return False
 def ensure_langsmith_api_key():
     """Load API key from langsmith-cli credentials if not in env."""
@@ -344,10 +352,32 @@ def main():
                     except Exception:
                         pass
+    # Check for secrets in dataset examples
+    secrets_check = {"checked": True, "flagged_count": 0, "flagged_ids": [], "clean": True}
+    for ex in examples:
+        text = str(getattr(ex, 'inputs', '') or '') + str(getattr(ex, 'outputs', '') or '')
+        if has_secrets(text):
+            secrets_check["flagged_count"] += 1
+            secrets_check["flagged_ids"].append(str(ex.id))
+            secrets_check["clean"] = False
+    secrets_check["flagged_ids"] = secrets_check["flagged_ids"][:10]  # Cap at 10
     # Compute health score and build report
     health_score = compute_health_score(size_info, difficulty, dead, coverage, splits)
     issues, corrections = build_issues_and_corrections(size_info, difficulty, dead, coverage, splits)
+    # Add secret issues
+    if not secrets_check["clean"]:
+        issues.append({
+            "severity": "critical",
+            "message": f"{secrets_check['flagged_count']} example(s) contain potential secrets (API keys, tokens)",
+        })
+        corrections.append({
+            "action": "remove_secrets",
+            "description": f"Remove or redact {secrets_check['flagged_count']} examples with detected secrets",
+            "example_ids": secrets_check["flagged_ids"],
+        })
     report = {
         "generated_at": datetime.now(timezone.utc).isoformat(),
         "health_score": health_score,
@@ -357,6 +387,7 @@ def main():
         "dead_examples": dead,
         "coverage": coverage,
         "splits": splits,
+        "secrets": secrets_check,
         "issues": issues,
         "corrections": corrections,
     }

package/tools/evolution_chart.py CHANGED Viewed

@@ -101,7 +101,9 @@ def render_score_table(history, scores, c):
     lines = []
     lines.append(f'  {c.B}SCORE PROGRESSION{c.RST}')
     lines.append(f'  {c.D}{"─" * W}{c.RST}')
-    lines.append(f'  {c.D}{"Version":<10}{"Score":>6}{"Δ":>8}{"vs Base":>9}{"Pass":>7}{"Err":>5}{"Tokens":>8}{"Latency":>9}{c.RST}')
+    has_loc = any(h.get('code_loc') for h in history)
+    loc_hdr = f'{"LOC":>6}' if has_loc else ''
+    lines.append(f'  {c.D}{"Version":<10}{"Score":>6}{"Δ":>8}{"vs Base":>9}{"Pass":>7}{"Err":>5}{"Tokens":>8}{"Latency":>9}{loc_hdr}{c.RST}')
     lines.append(f'  {c.D}{"─" * W}{c.RST}')
     for i, h in enumerate(history):
@@ -140,7 +142,19 @@ def render_score_table(history, scores, c):
         tok_str = fmt_tokens(tokens)
         lat_str = f'{latency}ms' if latency else '—'
-        lines.append(f'  {v:<10}{s_str:>6}  {d_str}  {p_str} {pass_str:>5}  {e_str:>3}  {tok_str:>6}  {lat_str:>6}  {icon}')
+        loc_str = ''
+        if has_loc:
+            loc = h.get('code_loc')
+            if loc:
+                base_loc = history[0].get('code_loc', 0)
+                if base_loc and loc > base_loc * 1.3:
+                    loc_str = f' {c.R}{loc}{c.RST}⚠'
+                else:
+                    loc_str = f' {loc:>5}'
+            else:
+                loc_str = '     —'
+        lines.append(f'  {v:<10}{s_str:>6}  {d_str}  {p_str} {pass_str:>5}  {e_str:>3}  {tok_str:>6}  {lat_str:>6}{loc_str}  {icon}')
     return '\n'.join(lines)

package/tools/mine_sessions.py ADDED Viewed

@@ -0,0 +1,150 @@
+#!/usr/bin/env python3
+"""Mine Claude Code session history for eval dataset examples.
+Reads ~/.claude/ session files to extract real user interactions
+that can be used as evaluation data. Filters for relevance to the
+agent being optimized, detects and skips secrets.
+Usage:
+    python3 mine_sessions.py \
+        --agent-description "A ReAct agent that answers questions using tools" \
+        --output session_examples.json \
+        [--max-examples 50]
+Stdlib-only except for secret_filter (local import).
+"""
+import argparse
+import glob
+import json
+import os
+import sys
+def find_session_files():
+    """Find Claude Code session history files."""
+    candidates = [
+        os.path.expanduser("~/.claude/history.jsonl"),
+        os.path.expanduser("~/.claude/sessions/*/messages.jsonl"),
+    ]
+    found = []
+    for pattern in candidates:
+        found.extend(glob.glob(pattern))
+    return found
+def extract_messages(file_path):
+    """Extract user->assistant message pairs from a session file."""
+    pairs = []
+    try:
+        with open(file_path) as f:
+            messages = []
+            for line in f:
+                line = line.strip()
+                if not line:
+                    continue
+                try:
+                    msg = json.loads(line)
+                    messages.append(msg)
+                except json.JSONDecodeError:
+                    continue
+            for i in range(len(messages) - 1):
+                if (messages[i].get("role") == "user" and
+                    messages[i + 1].get("role") == "assistant"):
+                    user_text = messages[i].get("content", "")
+                    if isinstance(user_text, list):
+                        user_text = " ".join(
+                            p.get("text", "") for p in user_text
+                            if isinstance(p, dict) and p.get("type") == "text"
+                        )
+                    asst_text = messages[i + 1].get("content", "")
+                    if isinstance(asst_text, list):
+                        asst_text = " ".join(
+                            p.get("text", "") for p in asst_text
+                            if isinstance(p, dict) and p.get("type") == "text"
+                        )
+                    if user_text and len(user_text) > 10:
+                        pairs.append({
+                            "input": user_text[:500],
+                            "output_preview": asst_text[:200] if asst_text else "",
+                            "source_file": os.path.basename(file_path),
+                        })
+    except (OSError, UnicodeDecodeError):
+        pass
+    return pairs
+def filter_relevant(pairs, agent_description, max_examples=50):
+    """Simple keyword-based relevance filter."""
+    stop_words = {"a", "an", "the", "is", "are", "was", "were", "that", "this",
+                  "and", "or", "for", "to", "in", "on", "with", "using"}
+    keywords = set(
+        w.lower() for w in agent_description.split()
+        if len(w) > 3 and w.lower() not in stop_words
+    )
+    scored = []
+    for pair in pairs:
+        input_words = set(pair["input"].lower().split())
+        overlap = len(keywords & input_words)
+        if overlap >= 1:
+            scored.append((overlap, pair))
+    scored.sort(key=lambda x: -x[0])
+    return [pair for _, pair in scored[:max_examples]]
+def main():
+    parser = argparse.ArgumentParser(description="Mine Claude Code sessions for eval data")
+    parser.add_argument("--agent-description", required=True, help="Description of the agent being optimized")
+    parser.add_argument("--output", default="session_examples.json")
+    parser.add_argument("--max-examples", type=int, default=50)
+    args = parser.parse_args()
+    sys.path.insert(0, os.path.dirname(__file__))
+    try:
+        from secret_filter import has_secrets
+    except ImportError:
+        has_secrets = lambda text: False  # noqa: E731
+    session_files = find_session_files()
+    if not session_files:
+        print("No Claude Code session files found.", file=sys.stderr)
+        print(json.dumps({"mined": 0, "output": args.output}))
+        sys.exit(0)
+    print(f"Found {len(session_files)} session file(s)", file=sys.stderr)
+    all_pairs = []
+    secrets_skipped = 0
+    for sf in session_files:
+        pairs = extract_messages(sf)
+        for p in pairs:
+            if has_secrets(p["input"]) or has_secrets(p.get("output_preview", "")):
+                secrets_skipped += 1
+                continue
+            all_pairs.append(p)
+    print(f"Extracted {len(all_pairs)} message pairs ({secrets_skipped} skipped for secrets)", file=sys.stderr)
+    relevant = filter_relevant(all_pairs, args.agent_description, args.max_examples)
+    print(f"Filtered to {len(relevant)} relevant examples", file=sys.stderr)
+    examples = []
+    for p in relevant:
+        examples.append({
+            "input": p["input"],
+            "metadata": {"source": "session_mining", "source_file": p["source_file"]},
+        })
+    output = {"examples": examples, "count": len(examples), "source": "claude_code_sessions"}
+    with open(args.output, "w") as f:
+        json.dump(output, f, indent=2)
+    print(json.dumps({"mined": len(examples), "output": args.output}))
+if __name__ == "__main__":
+    main()

package/tools/read_results.py CHANGED Viewed

@@ -61,7 +61,28 @@ def ensure_langsmith_api_key():
     return False
-def read_experiment(client, experiment_name):
+def weighted_score(scores, weights=None):
+    """Calculate weighted average of evaluator scores.
+    If weights provided, use them. Otherwise flat average.
+    Weights are normalized (don't need to sum to 1).
+    """
+    if not scores:
+        return 0.0
+    if not weights:
+        return sum(scores.values()) / len(scores)
+    total_weight = 0
+    weighted_sum = 0
+    for key, val in scores.items():
+        w = weights.get(key, 1.0)
+        weighted_sum += val * w
+        total_weight += w
+    return weighted_sum / total_weight if total_weight > 0 else 0.0
+def read_experiment(client, experiment_name, weights=None):
     """Read results from a single LangSmith experiment."""
     try:
         # List runs for this experiment
@@ -103,13 +124,17 @@ def read_experiment(client, experiment_name):
             # Read feedback/scores from pre-fetched batch
             feedbacks = fb_map.get(str(run.id), [])
             scores = {}
+            feedback_comments = {}
             for fb in feedbacks:
                 if fb.score is not None:
                     scores[fb.key] = fb.score
+                if fb.comment:
+                    feedback_comments[fb.key] = fb.comment
             per_example[example_id] = {
-                "score": sum(scores.values()) / len(scores) if scores else 0.0,
+                "score": weighted_score(scores, weights),
                 "scores": scores,
+                "feedback": feedback_comments,
                 "tokens": tokens,
                 "latency_ms": latency_ms,
                 "error": run.error[:200] if run.error else None,
@@ -136,6 +161,50 @@ def read_experiment(client, experiment_name):
         return {"experiment": experiment_name, "error": str(e), "combined_score": 0.0}
+def pareto_front(candidates):
+    """Find Pareto-optimal candidates (not dominated on any evaluator).
+    A candidate is dominated if another scores >= on ALL evaluators
+    and strictly > on at least one.
+    """
+    if len(candidates) <= 1:
+        return candidates
+    front = []
+    for i, ci in enumerate(candidates):
+        dominated = False
+        ci_scores = ci.get("evaluator_scores", {})
+        if not ci_scores:
+            front.append(ci)
+            continue
+        for j, cj in enumerate(candidates):
+            if i == j:
+                continue
+            cj_scores = cj.get("evaluator_scores", {})
+            if not cj_scores:
+                continue
+            all_geq = True
+            any_gt = False
+            for key in ci_scores:
+                if key in cj_scores:
+                    if cj_scores[key] < ci_scores[key]:
+                        all_geq = False
+                        break
+                    if cj_scores[key] > ci_scores[key]:
+                        any_gt = True
+            if all_geq and any_gt:
+                dominated = True
+                break
+        if not dominated:
+            front.append(ci)
+    return front if front else candidates[:1]
 def compare_experiments(results_list):
     """Compare multiple experiment results and find winner + per-task champion."""
     if not results_list:
@@ -173,12 +242,27 @@ def compare_experiments(results_list):
             "task_wins": task_wins[champion_name],
         }
+    # Compute per-evaluator averages for Pareto analysis
+    for result in valid:
+        eval_avgs = {}
+        for ex_data in result.get("per_example", {}).values():
+            for ev_key, ev_score in ex_data.get("scores", {}).items():
+                eval_avgs.setdefault(ev_key, []).append(ev_score)
+        result["evaluator_scores"] = {k: sum(v) / len(v) for k, v in eval_avgs.items()}
+    front = pareto_front(valid)
     return {
         "winner": {
             "experiment": winner["experiment"],
             "score": winner["combined_score"],
         },
         "champion": champion,
+        "pareto_front": [
+            {"experiment": r["experiment"], "score": r["combined_score"],
+             "evaluator_scores": r.get("evaluator_scores", {})}
+            for r in front
+        ],
         "all_candidates": [
             {
                 "experiment": r["experiment"],
@@ -231,12 +315,19 @@ def main():
     args = parser.parse_args()
     ensure_langsmith_api_key()
+    # Load evaluator weights from config if available
+    weights = None
+    if os.path.exists(args.config):
+        with open(args.config) as f:
+            cfg = json.load(f)
+        weights = cfg.get("evaluator_weights")
     from langsmith import Client
     client = Client()
     if args.experiment:
         # Single experiment
-        result = read_experiment(client, args.experiment)
+        result = read_experiment(client, args.experiment, weights=weights)
         if not result:
             print(f"No results found for experiment: {args.experiment}", file=sys.stderr)
             sys.exit(1)
@@ -267,10 +358,25 @@ def main():
         experiment_names = [e.strip() for e in args.experiments.split(",")]
         results_list = []
+        # Load split filter if requested
+        split_example_ids = None
+        if args.split:
+            with open(args.config) as f:
+                cfg_for_split = json.load(f)
+            split_example_ids = set()
+            for ex in client.list_examples(dataset_name=cfg_for_split["dataset"], splits=[args.split]):
+                split_example_ids.add(str(ex.id))
         for name in experiment_names:
             print(f"Reading experiment: {name}...", file=sys.stderr)
-            result = read_experiment(client, name)
+            result = read_experiment(client, name, weights=weights)
             if result:
+                # Apply split filter to each experiment
+                if split_example_ids is not None and "per_example" in result:
+                    result["per_example"] = {k: v for k, v in result["per_example"].items() if k in split_example_ids}
+                    all_scores = [v["score"] for v in result["per_example"].values()]
+                    result["combined_score"] = sum(all_scores) / len(all_scores) if all_scores else 0.0
+                    result["num_examples"] = len(result["per_example"])
                 results_list.append(result)
         if not results_list:

package/tools/run_eval.py CHANGED Viewed

@@ -72,7 +72,20 @@ def make_target(entry_point, cwd):
         try:
             cmd = entry_point
-            if "{input}" in cmd:
+            # {input_text}: extract plain text from inputs dict (for agents expecting --query "text")
+            if "{input_text}" in cmd:
+                import shlex
+                text = ""
+                for key in ("input", "question", "query", "prompt", "text", "user_input"):
+                    if key in inputs and isinstance(inputs[key], str):
+                        text = inputs[key]
+                        break
+                if not text and inputs:
+                    first_val = next(iter(inputs.values()), "")
+                    text = str(first_val) if not isinstance(first_val, str) else first_val
+                cmd = cmd.replace("{input_text}", shlex.quote(text))
+            elif "{input}" in cmd:
                 # Placeholder: replace with path to JSON file
                 cmd = cmd.replace("{input}", input_path)
             elif "{input_json}" in cmd:
@@ -167,6 +180,7 @@ def main():
     parser.add_argument("--experiment-prefix", required=True, help="Experiment name prefix (e.g. v001a)")
     parser.add_argument("--timeout", type=int, default=120, help="Per-task timeout in seconds")
     parser.add_argument("--concurrency", type=int, default=None, help="Max concurrent evaluations (default: from config or 1)")
+    parser.add_argument("--no-canary", action="store_true", help="Skip canary preflight check")
     args = parser.parse_args()
     with open(args.config) as f:
@@ -187,6 +201,32 @@ def main():
     llm_evaluators = [k for k in config["evaluators"] if k in ("correctness", "conciseness")]
     code_evaluators = [k for k in config["evaluators"] if k not in ("correctness", "conciseness")]
+    # Canary run: verify agent works before burning through full dataset
+    if not args.no_canary:
+        print("  Canary: running 1 example preflight...", file=sys.stderr)
+        try:
+            canary_examples = list(client.list_examples(dataset_name=config["dataset"], limit=1))
+            if canary_examples:
+                canary_result = target(canary_examples[0].inputs)
+                canary_output = canary_result.get("output", "")
+                canary_error = canary_result.get("error", "")
+                if not canary_output and canary_error:
+                    print(f"  CANARY FAILED: Agent produced no output.", file=sys.stderr)
+                    print(f"  Error: {canary_error}", file=sys.stderr)
+                    print(f"  Fix the agent before running full evaluation.", file=sys.stderr)
+                    output = {
+                        "experiment": None,
+                        "prefix": args.experiment_prefix,
+                        "combined_score": 0.0,
+                        "error": f"Canary failed: {canary_error[:200]}",
+                    }
+                    print(json.dumps(output))
+                    sys.exit(2)
+                else:
+                    print(f"  Canary passed: got output ({len(str(canary_output))} chars)", file=sys.stderr)
+        except Exception as e:
+            print(f"  Canary check failed: {e} (proceeding anyway)", file=sys.stderr)
     print(f"Running evaluation: {args.experiment_prefix}")
     print(f"  Dataset: {config['dataset']}")
     print(f"  Worktree: {args.worktree_path}")

package/tools/secret_filter.py ADDED Viewed

@@ -0,0 +1,97 @@
+#!/usr/bin/env python3
+"""Secret detection and filtering for eval datasets.
+Detects API keys, tokens, passwords, and other sensitive data in text.
+Used by seed_from_traces.py and dataset_health.py.
+Usage:
+    echo "text with sk-ant-api..." | python3 secret_filter.py
+    python3 secret_filter.py < file.txt
+Stdlib-only — no external dependencies.
+"""
+import re
+import json
+import sys
+SECRET_PATTERNS = re.compile(
+    r'('
+    r'sk-ant-api\S{20,}'
+    r'|sk-or-v1-\S{20,}'
+    r'|sk-\S{20,}'
+    r'|ghp_\S{20,}'
+    r'|gho_\S{20,}'
+    r'|github_pat_\S{20,}'
+    r'|xoxb-\S{20,}'
+    r'|xapp-\S{20,}'
+    r'|ntn_\S{20,}'
+    r'|AKIA[A-Z0-9]{16}'
+    r'|Bearer\s+[A-Za-z0-9\-._~+/]{20,}'
+    r'|-----BEGIN\s+(RSA\s+)?PRIVATE\s+KEY-----'
+    r')',
+    re.IGNORECASE,
+)
+ENV_PATTERNS = re.compile(
+    r'(?:ANTHROPIC_API_KEY|OPENAI_API_KEY|LANGSMITH_API_KEY|LANGCHAIN_API_KEY'
+    r'|AWS_SECRET_ACCESS_KEY|DATABASE_URL|POSTGRES_PASSWORD'
+    r'|SLACK_TOKEN|GITHUB_TOKEN|API_KEY|SECRET_KEY'
+    r')\s*[=:]\s*["\']?\S{10,}',
+    re.IGNORECASE,
+)
+ASSIGN_PATTERNS = re.compile(
+    r'(?:password|secret|token|api_key|apikey)\s*[=:]\s*["\']?\S{10,}',
+    re.IGNORECASE,
+)
+def detect_secrets(text):
+    """Return list of secret matches found in text."""
+    if not text:
+        return []
+    findings = []
+    for pattern, name in [
+        (SECRET_PATTERNS, "secret_key"),
+        (ENV_PATTERNS, "env_variable"),
+        (ASSIGN_PATTERNS, "assignment"),
+    ]:
+        for m in pattern.finditer(text):
+            match_text = m.group()
+            redacted = match_text[:10] + "..." + match_text[-4:] if len(match_text) > 20 else match_text
+            findings.append({
+                "pattern": name,
+                "match": redacted,
+                "position": m.start(),
+            })
+    return findings
+def has_secrets(text):
+    """Quick boolean check — does text contain any secrets?"""
+    if not text:
+        return False
+    return bool(SECRET_PATTERNS.search(text) or ENV_PATTERNS.search(text) or ASSIGN_PATTERNS.search(text))
+def redact_secrets(text):
+    """Replace detected secrets with [REDACTED]."""
+    if not text:
+        return text
+    text = SECRET_PATTERNS.sub("[REDACTED]", text)
+    text = ENV_PATTERNS.sub("[REDACTED]", text)
+    text = ASSIGN_PATTERNS.sub("[REDACTED]", text)
+    return text
+if __name__ == "__main__":
+    text = sys.stdin.read()
+    findings = detect_secrets(text)
+    if findings:
+        print(json.dumps({"has_secrets": True, "count": len(findings), "findings": findings}, indent=2))
+        sys.exit(1)
+    else:
+        print(json.dumps({"has_secrets": False, "count": 0}))
+        sys.exit(0)

package/tools/seed_from_traces.py CHANGED Viewed

@@ -22,6 +22,14 @@ import sys
 from collections import Counter
 from datetime import datetime, timezone
+# Secret detection (local import from same directory)
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+try:
+    from secret_filter import has_secrets
+except ImportError:
+    def has_secrets(text):
+        return False
 def extract_input(run):
     """Extract user input from a run's inputs field."""
@@ -118,9 +126,16 @@ def analyze_runs(runs):
     token_counts = []
     feedbacks = {"positive": 0, "negative": 0, "none": 0}
+    secrets_filtered = 0
     for run in runs:
         user_input = extract_input(run)
         output = extract_output(run)
+        # Skip runs containing secrets (API keys, tokens, passwords)
+        if has_secrets(str(user_input or '')) or has_secrets(str(output or '')):
+            secrets_filtered += 1
+            continue
         error = run.get("error")
         tokens = run.get("total_tokens") or 0
         latency_ms = None

package/tools/setup.py CHANGED Viewed

@@ -180,9 +180,24 @@ def create_dataset_from_file(client, dataset_name, file_path):
             elif "expected" in item:
                 ex["outputs"] = {"expected": item["expected"]}
+            # Include rubric/expected behavior in metadata
+            if "expected_behavior" in item:
+                if "metadata" not in ex:
+                    ex["metadata"] = {}
+                ex["metadata"]["expected_behavior"] = item["expected_behavior"]
+            # Include difficulty and category in metadata
+            for field in ("difficulty", "category"):
+                if field in item:
+                    if "metadata" not in ex:
+                        ex["metadata"] = {}
+                    ex["metadata"][field] = item[field]
             # Include metadata
-            if "metadata" in item:
+            if "metadata" in item and "metadata" not in ex:
                 ex["metadata"] = item["metadata"]
+            elif "metadata" in item:
+                ex["metadata"].update(item["metadata"])
             if "metadata" not in ex:
                 ex["metadata"] = {}
@@ -548,6 +563,7 @@ def main():
             "project_dir": project_dir,
             "entry_point": entry_point,
             "evaluators": evaluator_keys,
+            "evaluator_weights": None,
             "optimization_goals": goals,
             "production_project": args.production_project,
             "baseline_experiment": baseline_experiment,