npm - harness-evolver - Versions diffs - 2.6.1 → 2.8.0 - Mend

harness-evolver 2.6.1 → 2.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (11) hide show

package/agents/harness-evolver-proposer.md +18 -0
package/agents/harness-evolver-testgen.md +15 -0
package/package.json +1 -1
package/skills/evolve/SKILL.md +26 -0
package/skills/init/SKILL.md +58 -1
package/tools/__pycache__/init.cpython-313.pyc +0 -0
package/tools/__pycache__/seed_from_traces.cpython-313.pyc +0 -0
package/tools/analyze_architecture.py +56 -2
package/tools/evaluate.py +29 -5
package/tools/init.py +107 -16
package/tools/seed_from_traces.py +454 -0

package/agents/harness-evolver-proposer.md CHANGED Viewed

@@ -40,6 +40,24 @@ These insights are generated from LangSmith traces cross-referenced with per-tas
 If trace insights are not available, proceed with manual trace analysis as described in Phase 2.
+## Production Insights
+If `.harness-evolver/production_seed.json` exists in your `<files_to_read>`, it contains **real production data** from the app's LangSmith project:
+- `categories` — real traffic distribution (which domains/routes get the most queries)
+- `error_patterns` — actual production errors and their frequency
+- `negative_feedback_inputs` — queries where users gave thumbs-down
+- `slow_queries` — high-latency queries that may indicate bottlenecks
+- `sample_inputs` — real user inputs grouped by category
+Use this data to:
+1. **Prioritize changes that fix real production failures** over synthetic test failures
+2. **Match the real traffic distribution** — if 60% of production queries are domain A, optimize for domain A
+3. **Focus on negative feedback patterns** — these are confirmed bad user experiences
+4. **Address latency outliers** — slow queries may need different routing, caching, or model selection
+Production data complements trace_insights.json. Trace insights show what happened in *harness evaluation runs*. Production insights show what happens in *real-world usage*.
 ## Context7 — Enrich Your Knowledge
 You have access to Context7 MCP tools (`resolve-library-id` and `get-library-docs`) for looking up **current, version-specific documentation** of any library.

package/agents/harness-evolver-testgen.md CHANGED Viewed

@@ -36,6 +36,19 @@ Read the harness source code to understand:
 - What are its likely failure modes?
 - Are there any data files (knowledge bases, docs, etc.) that define the domain?
+### Phase 1.5: Use Production Traces (if available)
+If your prompt contains a `<production_traces>` block, this is **real data from production LangSmith traces**. This is the most valuable signal you have — real user inputs beat synthetic ones.
+When production traces are available:
+1. Read the traffic distribution — generate tasks proportional to real usage (if 60% of queries are domain A, 60% of tasks should cover domain A)
+2. Use actual user phrasing as inspiration — real inputs show abbreviations, typos, informal language
+3. Base edge cases on real error patterns — the errors listed are genuine failures, not imagined scenarios
+4. Prioritize negative feedback traces — these are confirmed bad experiences that MUST be covered
+5. Include slow queries as edge cases — high-latency traces may reveal timeout or complexity issues
+**Do NOT just copy production inputs verbatim.** Use them as inspiration to generate VARIATIONS that test the same capabilities.
 ### Phase 2: Design Test Distribution
 Plan 30 test cases with this distribution:
@@ -44,6 +57,8 @@ Plan 30 test cases with this distribution:
 - **20% Cross-Domain** (6 tasks): inputs spanning multiple categories or requiring nuanced judgment
 - **20% Adversarial** (6 tasks): misleading, ambiguous, or designed to expose weaknesses
+If production traces are available, adjust the distribution to match real traffic patterns instead of uniform.
 Ensure all categories/topics from the harness are covered.
 ### Phase 3: Generate Tasks

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "harness-evolver",
-  "version": "2.6.1",
+  "version": "2.8.0",
   "description": "Meta-Harness-style autonomous harness optimization for Claude Code",
   "author": "Raphael Valdetaro",
   "license": "MIT",

package/skills/evolve/SKILL.md CHANGED Viewed

@@ -34,6 +34,27 @@ For each iteration:
 python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(f'v{s[\"iterations\"]+1:03d}')"
 ```
+### 1.4. Gather Production Insights (first iteration only)
+On the **first iteration**, if the project has a production LangSmith project configured but no production seed yet, fetch it:
+```bash
+PROD_PROJECT=$(python3 -c "
+import json, os
+c = json.load(open('.harness-evolver/config.json'))
+print(c.get('eval', {}).get('production_project', ''))
+" 2>/dev/null)
+if [ -n "$PROD_PROJECT" ] && [ ! -f ".harness-evolver/production_seed.json" ] && [ -n "$LANGSMITH_API_KEY" ]; then
+    python3 $TOOLS/seed_from_traces.py \
+        --project "$PROD_PROJECT" \
+        --output-md .harness-evolver/production_seed.md \
+        --output-json .harness-evolver/production_seed.json \
+        --limit 100 2>/dev/null
+fi
+```
+The `production_seed.json` is included in all proposers' `<files_to_read>` so they have real-world context about how the agent is actually used in production.
 ### 1.5. Gather LangSmith Traces (MANDATORY after every evaluation)
 **Run these commands unconditionally after EVERY evaluation** (including baseline). Do NOT guess project names — discover them.
@@ -255,6 +276,7 @@ Agent(
     - .harness-evolver/langsmith_stats.json (if exists)
     - .harness-evolver/langsmith_runs.json (if exists)
     - .harness-evolver/trace_insights.json (if exists)
+    - .harness-evolver/production_seed.json (if exists)
     - .harness-evolver/architecture.json (if exists)
     </files_to_read>
@@ -295,6 +317,7 @@ Agent(
     - .harness-evolver/langsmith_diagnosis.json (if exists)
     - .harness-evolver/langsmith_runs.json (if exists)
     - .harness-evolver/trace_insights.json (if exists)
+    - .harness-evolver/production_seed.json (if exists)
     - .harness-evolver/architecture.json (if exists)
     </files_to_read>
@@ -334,6 +357,7 @@ Agent(
     - .harness-evolver/langsmith_diagnosis.json (if exists)
     - .harness-evolver/langsmith_runs.json (if exists)
     - .harness-evolver/trace_insights.json (if exists)
+    - .harness-evolver/production_seed.json (if exists)
     - .harness-evolver/architecture.json (if exists)
     </files_to_read>
@@ -377,6 +401,7 @@ Agent(
     - .harness-evolver/harnesses/{best_version}/scores.json
     - .harness-evolver/langsmith_runs.json (if exists)
     - .harness-evolver/trace_insights.json (if exists)
+    - .harness-evolver/production_seed.json (if exists)
     - .harness-evolver/architecture.json (if exists)
     </files_to_read>
@@ -438,6 +463,7 @@ Agent(
     - .harness-evolver/harnesses/{best_version}/scores.json
     - .harness-evolver/langsmith_runs.json (if exists)
     - .harness-evolver/trace_insights.json (if exists)
+    - .harness-evolver/production_seed.json (if exists)
     - .harness-evolver/architecture.json (if exists)
     </files_to_read>

package/skills/init/SKILL.md CHANGED Viewed

@@ -80,11 +80,21 @@ Agent(
     - /home/rp/Desktop/test-crewai/README.md
     </files_to_read>
+    <production_traces>
+    {IF .harness-evolver/production_seed.md EXISTS, paste its full contents here.
+     This file contains real production inputs, traffic distribution, error patterns,
+     and user feedback from LangSmith. Use it to generate REALISTIC test cases that
+     match actual usage patterns instead of synthetic ones.
+     If the file does not exist, omit this entire block.}
+    </production_traces>
     <output>
     Create directory tasks/ (at project root) with 30 files: task_001.json through task_030.json.
     Format: {"id": "task_001", "input": "...", "metadata": {"difficulty": "easy|medium|hard", "type": "standard|edge|cross_domain|adversarial"}}
     No "expected" field needed — the judge subagent will score outputs.
     Distribution: 40% standard, 20% edge, 20% cross-domain, 20% adversarial.
+    If production traces are available, match the real traffic distribution instead of uniform.
     </output>
 )
 ```
@@ -93,16 +103,60 @@ Wait for `## TESTGEN COMPLETE`. If the subagent fails or returns with no tasks,
 Print: "Generated {N} test cases from code analysis."
+If `.harness-evolver/production_seed.md` exists, also print:
+"Tasks enriched with production trace data from LangSmith."
 ## Phase 3: Run Init
+First, check if the project has a LangSmith production project configured:
+```bash
+# Auto-detect from env vars or .env
+PROD_PROJECT=$(python3 -c "
+import os
+for v in ('LANGCHAIN_PROJECT', 'LANGSMITH_PROJECT'):
+    p = os.environ.get(v, '')
+    if p: print(p); exit()
+for f in ('.env', '.env.local'):
+    if os.path.exists(f):
+        for line in open(f):
+            line = line.strip()
+            if '=' in line and not line.startswith('#'):
+                k, _, val = line.partition('=')
+                if k.strip() in ('LANGCHAIN_PROJECT', 'LANGSMITH_PROJECT'):
+                    print(val.strip().strip('\"').strip(\"'\"))
+                    exit()
+" 2>/dev/null)
+```
 ```bash
 python3 $TOOLS/init.py [directory] \
     --harness harness.py --eval eval.py --tasks tasks/ \
-    --tools-dir $TOOLS
+    --tools-dir $TOOLS \
+    ${PROD_PROJECT:+--langsmith-project "$PROD_PROJECT"}
 ```
 Add `--harness-config config.json` if a config exists.
+For **LLM-powered agents** that make real API calls (LangGraph, CrewAI, etc.) and take
+more than 30 seconds per invocation, increase the validation timeout:
+```bash
+python3 $TOOLS/init.py [directory] \
+    --harness harness.py --eval eval.py --tasks tasks/ \
+    --tools-dir $TOOLS \
+    --validation-timeout 120
+```
+If validation keeps timing out but you've verified the harness works manually, skip it:
+```bash
+python3 $TOOLS/init.py [directory] \
+    --harness harness.py --eval eval.py --tasks tasks/ \
+    --tools-dir $TOOLS \
+    --skip-validation
+```
 ## After Init — Report
 - What was detected vs created
@@ -132,3 +186,6 @@ This is advisory only — do not spawn the architect agent.
 - The `expected` field is never shown to the harness — only the eval script sees it.
 - If `.harness-evolver/` already exists, warn before overwriting.
 - If no Python files exist in CWD, the user is probably in the wrong directory.
+- **Monorepo / venv mismatch**: In monorepos with dedicated venvs per app, the system `python3` may differ from the project's Python version. The harness wrapper should re-exec with the correct venv Python. The tools now use `sys.executable` instead of hardcoded `python3`.
+- **Stale site-packages**: If the project uses editable installs (`pip install -e .`), packages in `site-packages/` may have stale copies of data files (e.g. registry YAMLs). Run `uv pip install -e . --force-reinstall --no-deps` to sync.
+- **Validation timeout**: LLM agents making real API calls typically take 15-60s per invocation. Use `--validation-timeout 120` or `--skip-validation` to handle this.

package/tools/__pycache__/init.cpython-313.pyc ADDED Viewed

Binary file

package/tools/__pycache__/seed_from_traces.cpython-313.pyc ADDED Viewed

Binary file

package/tools/analyze_architecture.py CHANGED Viewed

@@ -472,12 +472,60 @@ def analyze_scores(summary_path):
 # --- Main ---
+def analyze_multiple(file_paths):
+    """Analyze multiple Python files and merge their signals.
+    Useful in monorepo setups where the harness is a thin wrapper that
+    delegates to the actual agent code. Pass the harness AND the main
+    agent source files for a comprehensive topology classification.
+    """
+    merged = {
+        "llm_call_count": 0,
+        "has_loop_around_llm": False,
+        "has_tool_definitions": False,
+        "has_retrieval": False,
+        "has_graph_framework": False,
+        "has_parallel_execution": False,
+        "has_error_handling": False,
+        "code_lines": 0,
+        "function_count": 0,
+        "class_count": 0,
+        "files_analyzed": [],
+    }
+    for path in file_paths:
+        if not os.path.isfile(path):
+            continue
+        try:
+            signals = analyze_code(path)
+        except Exception:
+            continue
+        merged["llm_call_count"] += signals.get("llm_call_count", 0)
+        merged["code_lines"] += signals.get("code_lines", 0)
+        merged["function_count"] += signals.get("function_count", 0)
+        merged["class_count"] += signals.get("class_count", 0)
+        merged["files_analyzed"].append(os.path.basename(path))
+        for bool_key in ["has_loop_around_llm", "has_tool_definitions", "has_retrieval",
+                         "has_graph_framework", "has_parallel_execution", "has_error_handling"]:
+            if signals.get(bool_key):
+                merged[bool_key] = True
+    merged["estimated_topology"] = _estimate_topology(merged)
+    return merged
 def main():
     parser = argparse.ArgumentParser(
         description="Analyze harness architecture and produce signals for the architect agent",
-        usage="analyze_architecture.py --harness PATH [--traces-dir PATH] [--summary PATH] [-o output.json]",
+        usage="analyze_architecture.py --harness PATH [--source-files PATH ...] "
+              "[--traces-dir PATH] [--summary PATH] [-o output.json]",
     )
     parser.add_argument("--harness", required=True, help="Path to harness Python file")
+    parser.add_argument("--source-files", nargs="*", default=None,
+                        help="Additional source files to analyze (e.g. the actual agent code). "
+                             "Useful when the harness is a thin wrapper around a larger system.")
     parser.add_argument("--traces-dir", default=None, help="Path to traces directory")
     parser.add_argument("--summary", default=None, help="Path to summary.json")
     parser.add_argument("-o", "--output", default=None, help="Output JSON path")
@@ -487,8 +535,14 @@ def main():
         print(json.dumps({"error": f"Harness file not found: {args.harness}"}))
         sys.exit(1)
+    if args.source_files:
+        all_files = [args.harness] + [f for f in args.source_files if os.path.isfile(f)]
+        code_signals = analyze_multiple(all_files)
+    else:
+        code_signals = analyze_code(args.harness)
     result = {
-        "code_signals": analyze_code(args.harness),
+        "code_signals": code_signals,
         "trace_signals": None,
         "score_signals": None,
     }

package/tools/evaluate.py CHANGED Viewed

@@ -2,7 +2,7 @@
 """Evaluation orchestrator for Harness Evolver.
 Commands:
-    validate --harness PATH [--config PATH]
+    validate --harness PATH [--config PATH] [--timeout SECONDS]
     run      --harness PATH --tasks-dir PATH --eval PATH --traces-dir PATH --scores PATH
              [--config PATH] [--timeout SECONDS]
@@ -20,9 +20,23 @@ import tempfile
 import time
+def _resolve_python():
+    """Resolve the Python interpreter to use for subprocesses.
+    Prefers the current interpreter (sys.executable) over a hardcoded 'python3'.
+    This is critical in monorepo setups where the harness may need a specific
+    venv Python (e.g. Python 3.12) while the system 'python3' is a different
+    version (e.g. 3.14) with incompatible site-packages.
+    """
+    exe = sys.executable
+    if exe and os.path.isfile(exe):
+        return exe
+    return "python3"
 def _run_harness_on_task(harness, config, task_input_path, output_path, task_traces_dir, timeout, env=None):
     """Run the harness on a single task. Returns (success, elapsed_ms, stdout, stderr)."""
-    cmd = ["python3", harness, "--input", task_input_path, "--output", output_path]
+    cmd = [_resolve_python(), harness, "--input", task_input_path, "--output", output_path]
     if task_traces_dir:
         extra_dir = os.path.join(task_traces_dir, "extra")
         os.makedirs(extra_dir, exist_ok=True)
@@ -48,6 +62,7 @@ def _run_harness_on_task(harness, config, task_input_path, output_path, task_tra
 def cmd_validate(args):
     harness = args.harness
     config = getattr(args, "config", None)
+    timeout = getattr(args, "timeout", 30) or 30
     if not os.path.exists(harness):
         print(f"FAIL: harness not found: {harness}", file=sys.stderr)
@@ -61,11 +76,17 @@ def cmd_validate(args):
             json.dump(dummy_task, f)
         success, elapsed, stdout, stderr = _run_harness_on_task(
-            harness, config, input_path, output_path, None, timeout=30,
+            harness, config, input_path, output_path, None, timeout=timeout,
         )
         if not success:
-            print(f"FAIL: harness exited with error.\nstderr: {stderr}", file=sys.stderr)
+            hint = ""
+            if "TIMEOUT" in stderr:
+                hint = (f"\nHint: validation timed out after {timeout}s. "
+                        "For LLM-powered agents that make real API calls, "
+                        "use --timeout to increase the limit: "
+                        f"evaluate.py validate --harness {harness} --timeout 120")
+            print(f"FAIL: harness exited with error.\nstderr: {stderr}{hint}", file=sys.stderr)
             sys.exit(1)
         if not os.path.exists(output_path):
@@ -171,7 +192,7 @@ def cmd_run(args):
         f.write("\n".join(all_stderr))
     eval_cmd = [
-        "python3", eval_script,
+        _resolve_python(), eval_script,
         "--results-dir", results_dir,
         "--tasks-dir", tasks_dir,
         "--scores", scores_path,
@@ -195,6 +216,9 @@ def main():
     p_val = sub.add_parser("validate")
     p_val.add_argument("--harness", required=True)
     p_val.add_argument("--config", default=None)
+    p_val.add_argument("--timeout", type=int, default=30,
+                       help="Validation timeout in seconds (default: 30). "
+                            "Increase for LLM-powered agents that make real API calls.")
     p_run = sub.add_parser("run")
     p_run.add_argument("--harness", required=True)

package/tools/init.py CHANGED Viewed

@@ -124,6 +124,40 @@ def _detect_langsmith():
     return {"enabled": False}
+def _detect_langsmith_project(search_dir="."):
+    """Auto-detect the app's existing LangSmith project name.
+    Checks (in order):
+    1. LANGCHAIN_PROJECT env var (standard LangChain convention)
+    2. LANGSMITH_PROJECT env var (alternative)
+    3. .env file in the project directory
+    """
+    for var in ("LANGCHAIN_PROJECT", "LANGSMITH_PROJECT"):
+        project = os.environ.get(var)
+        if project:
+            return project
+    # Parse .env file
+    for env_name in (".env", ".env.local"):
+        env_path = os.path.join(search_dir, env_name)
+        if os.path.exists(env_path):
+            try:
+                with open(env_path) as f:
+                    for line in f:
+                        line = line.strip()
+                        if line.startswith("#") or "=" not in line:
+                            continue
+                        key, _, val = line.partition("=")
+                        key = key.strip()
+                        val = val.strip().strip("'\"")
+                        if key in ("LANGCHAIN_PROJECT", "LANGSMITH_PROJECT") and val:
+                            return val
+            except OSError:
+                pass
+    return None
 def _check_langsmith_cli():
     """Check if langsmith-cli is installed."""
     try:
@@ -134,6 +168,19 @@ def _check_langsmith_cli():
         return False
+def _resolve_python():
+    """Resolve the Python interpreter for subprocesses.
+    Uses the current interpreter (sys.executable) instead of hardcoded 'python3'.
+    This prevents version mismatches in monorepo setups where the harness may
+    need a specific venv Python different from the system python3.
+    """
+    exe = sys.executable
+    if exe and os.path.isfile(exe):
+        return exe
+    return "python3"
 def _detect_stack(harness_path):
     """Detect technology stack from harness imports."""
     detect_stack_py = os.path.join(os.path.dirname(__file__), "detect_stack.py")
@@ -141,7 +188,7 @@ def _detect_stack(harness_path):
         return {}
     try:
         r = subprocess.run(
-            ["python3", detect_stack_py, harness_path],
+            [_resolve_python(), detect_stack_py, harness_path],
             capture_output=True, text=True, timeout=30,
         )
         if r.returncode == 0 and r.stdout.strip():
@@ -183,6 +230,15 @@ def main():
     parser.add_argument("--base-dir", default=None, help="Path for .harness-evolver/")
     parser.add_argument("--harness-config", default=None, help="Path to harness config.json")
     parser.add_argument("--tools-dir", default=None, help="Path to tools directory")
+    parser.add_argument("--validation-timeout", type=int, default=30,
+                        help="Timeout for harness validation in seconds (default: 30). "
+                             "Increase for LLM-powered agents that make real API calls.")
+    parser.add_argument("--skip-validation", action="store_true",
+                        help="Skip harness validation step. Use when you know the harness "
+                             "works but validation times out (e.g. real LLM agent calls).")
+    parser.add_argument("--langsmith-project", default=None,
+                        help="Existing LangSmith project name with production traces. "
+                             "Auto-detected from LANGCHAIN_PROJECT / LANGSMITH_PROJECT env vars or .env file.")
     args = parser.parse_args()
     # Auto-detect missing args
@@ -261,6 +317,7 @@ def main():
             "args": ["--results-dir", "{results_dir}", "--tasks-dir", "{tasks_dir}",
                      "--scores", "{scores}"],
             "langsmith": _detect_langsmith(),
+            "production_project": args.langsmith_project or _detect_langsmith_project(search_dir),
         },
         "evolution": {
             "max_iterations": 10,
@@ -309,7 +366,7 @@ def main():
         if os.path.exists(detect_stack_py):
             try:
                 r = subprocess.run(
-                    ["python3", detect_stack_py, harness_dir],
+                    [_resolve_python(), detect_stack_py, harness_dir],
                     capture_output=True, text=True, timeout=30,
                 )
                 if r.returncode == 0 and r.stdout.strip():
@@ -338,7 +395,7 @@ def main():
     if os.path.exists(analyze_py):
         try:
             r = subprocess.run(
-                ["python3", analyze_py, "--harness", args.harness],
+                [_resolve_python(), analyze_py, "--harness", args.harness],
                 capture_output=True, text=True, timeout=30,
             )
             if r.returncode == 0 and r.stdout.strip():
@@ -356,31 +413,65 @@ def main():
         except Exception:
             pass
+    # 4.5 Fetch production traces seed (if LangSmith production project detected)
+    prod_project = config["eval"].get("production_project")
+    if prod_project and os.environ.get("LANGSMITH_API_KEY"):
+        seed_py = os.path.join(tools, "seed_from_traces.py")
+        if os.path.exists(seed_py):
+            print(f"Fetching production traces from LangSmith project '{prod_project}'...")
+            try:
+                r = subprocess.run(
+                    [_resolve_python(), seed_py,
+                     "--project", prod_project,
+                     "--output-md", os.path.join(base, "production_seed.md"),
+                     "--output-json", os.path.join(base, "production_seed.json"),
+                     "--limit", "100"],
+                    capture_output=True, text=True, timeout=60,
+                )
+                if r.returncode == 0:
+                    print(r.stdout.strip())
+                else:
+                    print(f"  Could not fetch production traces: {r.stderr.strip()[:200]}")
+            except Exception as e:
+                print(f"  Production trace fetch failed: {e}")
+    elif prod_project:
+        print(f"Production LangSmith project detected: {prod_project}")
+        print("  Set LANGSMITH_API_KEY to auto-fetch production traces during init.")
     # 5. Validate baseline harness
-    print("Validating baseline harness...")
-    val_args = ["python3", evaluate_py, "validate",
-                "--harness", os.path.join(base, "baseline", "harness.py")]
     config_path = os.path.join(base, "baseline", "config.json")
-    if os.path.exists(config_path):
-        val_args.extend(["--config", config_path])
-    r = subprocess.run(val_args, capture_output=True, text=True)
-    if r.returncode != 0:
-        print(f"FAIL: baseline harness validation failed.\n{r.stderr}", file=sys.stderr)
-        sys.exit(1)
-    print(r.stdout.strip())
+    if args.skip_validation:
+        print("Skipping baseline validation (--skip-validation).")
+    else:
+        print(f"Validating baseline harness (timeout: {args.validation_timeout}s)...")
+        val_args = [_resolve_python(), evaluate_py, "validate",
+                    "--harness", os.path.join(base, "baseline", "harness.py"),
+                    "--timeout", str(args.validation_timeout)]
+        if os.path.exists(config_path):
+            val_args.extend(["--config", config_path])
+        r = subprocess.run(val_args, capture_output=True, text=True)
+        if r.returncode != 0:
+            hint = ""
+            if "TIMEOUT" in r.stderr:
+                hint = (f"\n\nHint: The harness timed out after {args.validation_timeout}s. "
+                        "This is common for LLM-powered agents that make real API calls.\n"
+                        "Try: --validation-timeout 120  (or --skip-validation to bypass)")
+            print(f"FAIL: baseline harness validation failed.\n{r.stderr}{hint}", file=sys.stderr)
+            sys.exit(1)
+        print(r.stdout.strip())
     # 6. Evaluate baseline
     print("Evaluating baseline harness...")
     baseline_traces = tempfile.mkdtemp()
     baseline_scores = os.path.join(base, "baseline_scores.json")
     eval_args = [
-        "python3", evaluate_py, "run",
+        _resolve_python(), evaluate_py, "run",
         "--harness", os.path.join(base, "baseline", "harness.py"),
         "--tasks-dir", os.path.join(base, "eval", "tasks"),
         "--eval", os.path.join(base, "eval", "eval.py"),
         "--traces-dir", baseline_traces,
         "--scores", baseline_scores,
-        "--timeout", "60",
+        "--timeout", str(max(args.validation_timeout, 60)),
     ]
     if os.path.exists(config_path):
         eval_args.extend(["--config", config_path])
@@ -399,7 +490,7 @@ def main():
     # 7. Initialize state with baseline score
     print(f"Baseline score: {baseline_score:.2f}")
     r = subprocess.run(
-        ["python3", state_py, "init",
+        [_resolve_python(), state_py, "init",
          "--base-dir", base,
          "--baseline-score", str(baseline_score)],
         capture_output=True, text=True,

package/tools/seed_from_traces.py ADDED Viewed

@@ -0,0 +1,454 @@
+#!/usr/bin/env python3
+"""Fetch and summarize production LangSmith traces for Harness Evolver.
+Queries the LangSmith REST API directly (urllib, stdlib-only) to fetch
+production traces and produce:
+  1. A markdown seed file for the testgen agent (production_seed.md)
+  2. A JSON summary for programmatic use (production_seed.json)
+Usage:
+    python3 seed_from_traces.py \
+        --project ceppem-langgraph \
+        --output-md .harness-evolver/production_seed.md \
+        --output-json .harness-evolver/production_seed.json \
+        [--api-key-env LANGSMITH_API_KEY] \
+        [--limit 100]
+Stdlib-only. No external dependencies (no langsmith-cli needed).
+"""
+import argparse
+import json
+import os
+import sys
+import urllib.parse
+import urllib.request
+from collections import Counter
+from datetime import datetime, timezone
+LANGSMITH_API_BASE = "https://api.smith.langchain.com/api/v1"
+def langsmith_request(endpoint, api_key, method="GET", body=None, params=None):
+    """Make a request to the LangSmith REST API."""
+    url = f"{LANGSMITH_API_BASE}/{endpoint}"
+    if params:
+        url += "?" + urllib.parse.urlencode(params)
+    headers = {
+        "x-api-key": api_key,
+        "Accept": "application/json",
+    }
+    data = None
+    if body is not None:
+        headers["Content-Type"] = "application/json"
+        data = json.dumps(body).encode("utf-8")
+    req = urllib.request.Request(url, data=data, headers=headers, method=method)
+    try:
+        with urllib.request.urlopen(req, timeout=30) as resp:
+            return json.loads(resp.read())
+    except urllib.error.HTTPError as e:
+        body_text = ""
+        try:
+            body_text = e.read().decode("utf-8", errors="replace")[:500]
+        except Exception:
+            pass
+        print(f"LangSmith API error {e.code}: {body_text}", file=sys.stderr)
+        return None
+    except Exception as e:
+        print(f"LangSmith API request failed: {e}", file=sys.stderr)
+        return None
+def fetch_runs(project_name, api_key, limit=100):
+    """Fetch recent root runs from a LangSmith project."""
+    # Try POST /runs/query first (newer API)
+    body = {
+        "project_name": project_name,
+        "is_root": True,
+        "limit": limit,
+    }
+    result = langsmith_request("runs/query", api_key, method="POST", body=body)
+    if result and isinstance(result, dict):
+        return result.get("runs", result.get("results", []))
+    if result and isinstance(result, list):
+        return result
+    # Fallback: GET /runs with query params
+    params = {
+        "project_name": project_name,
+        "is_root": "true",
+        "limit": str(limit),
+    }
+    result = langsmith_request("runs", api_key, params=params)
+    if result and isinstance(result, list):
+        return result
+    if result and isinstance(result, dict):
+        return result.get("runs", result.get("results", []))
+    return []
+def extract_input(run):
+    """Extract user input from a run's inputs field."""
+    inputs = run.get("inputs", {})
+    if not inputs:
+        return None
+    if isinstance(inputs, str):
+        return inputs
+    # Direct field
+    for key in ("input", "question", "query", "prompt", "text", "user_input"):
+        if key in inputs and isinstance(inputs[key], str):
+            return inputs[key]
+    # LangChain messages format
+    messages = inputs.get("messages") or inputs.get("input")
+    if isinstance(messages, list):
+        if messages and isinstance(messages[0], list):
+            messages = messages[0]
+        for msg in messages:
+            if isinstance(msg, dict):
+                if msg.get("type") in ("human", "HumanMessage") or msg.get("role") == "user":
+                    content = msg.get("content", "")
+                    if isinstance(content, str) and content:
+                        return content
+                    if isinstance(content, list):
+                        for part in content:
+                            if isinstance(part, dict) and part.get("type") == "text":
+                                return part.get("text", "")
+            elif isinstance(msg, str) and msg:
+                return msg
+    return None
+def extract_output(run):
+    """Extract the output/response from a run."""
+    outputs = run.get("outputs", {})
+    if not outputs:
+        return None
+    if isinstance(outputs, str):
+        return outputs
+    for key in ("output", "answer", "result", "response", "text"):
+        if key in outputs and isinstance(outputs[key], str):
+            return outputs[key]
+    # LangChain messages format
+    messages = outputs.get("messages") or outputs.get("output")
+    if isinstance(messages, list):
+        if messages and isinstance(messages[0], list):
+            messages = messages[0]
+        for msg in reversed(messages):
+            if isinstance(msg, dict):
+                if msg.get("type") in ("ai", "AIMessage", "assistant") or msg.get("role") == "assistant":
+                    content = msg.get("content", "")
+                    if isinstance(content, str) and content:
+                        return content
+            elif isinstance(msg, str) and msg:
+                return msg
+    return None
+def get_feedback(run):
+    """Extract feedback from a run."""
+    fb = run.get("feedback_stats") or {}
+    if isinstance(fb, dict):
+        pos = fb.get("thumbs_up", 0) or fb.get("positive", 0) or 0
+        neg = fb.get("thumbs_down", 0) or fb.get("negative", 0) or 0
+        if neg > 0:
+            return "negative"
+        if pos > 0:
+            return "positive"
+    return None
+def categorize_run(run):
+    """Categorize a run by its name/type."""
+    name = run.get("name", "unknown")
+    # Use top-level run name as category
+    return name
+def analyze_runs(runs):
+    """Analyze a batch of runs and produce structured insights."""
+    if not runs:
+        return None
+    processed = []
+    categories = Counter()
+    errors = []
+    latencies = []
+    token_counts = []
+    feedbacks = {"positive": 0, "negative": 0, "none": 0}
+    for run in runs:
+        user_input = extract_input(run)
+        output = extract_output(run)
+        error = run.get("error")
+        tokens = run.get("total_tokens") or 0
+        latency_ms = None
+        feedback = get_feedback(run)
+        # Calculate latency from start/end times
+        start = run.get("start_time") or run.get("start_dt")
+        end = run.get("end_time") or run.get("end_dt")
+        if isinstance(start, str) and isinstance(end, str):
+            try:
+                from datetime import datetime as dt
+                s = dt.fromisoformat(start.replace("Z", "+00:00"))
+                e = dt.fromisoformat(end.replace("Z", "+00:00"))
+                latency_ms = int((e - s).total_seconds() * 1000)
+            except Exception:
+                pass
+        elif run.get("latency"):
+            latency_ms = int(run["latency"] * 1000) if isinstance(run["latency"], float) else run["latency"]
+        category = categorize_run(run)
+        categories[category] += 1
+        entry = {
+            "input": (user_input or "")[:500],
+            "output": (output or "")[:300],
+            "category": category,
+            "tokens": tokens,
+            "latency_ms": latency_ms,
+            "error": (error or "")[:200] if error else None,
+            "feedback": feedback,
+        }
+        processed.append(entry)
+        if error:
+            errors.append({"error": error[:200], "input": (user_input or "")[:200], "category": category})
+        if latency_ms:
+            latencies.append(latency_ms)
+        if tokens:
+            token_counts.append(tokens)
+        if feedback == "positive":
+            feedbacks["positive"] += 1
+        elif feedback == "negative":
+            feedbacks["negative"] += 1
+        else:
+            feedbacks["none"] += 1
+    # Compute statistics
+    stats = {
+        "total_traces": len(runs),
+        "with_input": sum(1 for p in processed if p["input"]),
+        "with_error": len(errors),
+        "error_rate": len(errors) / max(len(runs), 1),
+        "feedback": feedbacks,
+    }
+    if latencies:
+        latencies.sort()
+        stats["latency"] = {
+            "avg_ms": int(sum(latencies) / len(latencies)),
+            "p50_ms": latencies[len(latencies) // 2],
+            "p95_ms": latencies[int(len(latencies) * 0.95)] if len(latencies) >= 20 else latencies[-1],
+            "max_ms": latencies[-1],
+        }
+    if token_counts:
+        stats["tokens"] = {
+            "avg": int(sum(token_counts) / len(token_counts)),
+            "max": max(token_counts),
+            "total": sum(token_counts),
+        }
+    # Group by category
+    by_category = {}
+    for entry in processed:
+        cat = entry["category"]
+        by_category.setdefault(cat, []).append(entry)
+    # Error patterns
+    error_patterns = Counter()
+    for e in errors:
+        # Normalize error to first 60 chars
+        pattern = e["error"][:60]
+        error_patterns[pattern] += 1
+    return {
+        "stats": stats,
+        "categories": dict(categories.most_common()),
+        "by_category": by_category,
+        "error_patterns": dict(error_patterns.most_common(10)),
+        "errors": errors[:20],
+        "processed": processed,
+    }
+def generate_markdown_seed(analysis, project_name):
+    """Generate a markdown seed file for the testgen agent."""
+    stats = analysis["stats"]
+    lines = [
+        f"# Production Trace Analysis: {project_name}",
+        "",
+        f"*{stats['total_traces']} traces analyzed*",
+        "",
+        "## Key Metrics",
+        "",
+        f"- **Error rate**: {stats['error_rate']:.1%}",
+    ]
+    if "latency" in stats:
+        lat = stats["latency"]
+        lines.append(f"- **Latency**: {lat['avg_ms']}ms avg, {lat['p50_ms']}ms p50, {lat['p95_ms']}ms p95")
+    if "tokens" in stats:
+        tok = stats["tokens"]
+        lines.append(f"- **Tokens**: {tok['avg']} avg, {tok['max']} max")
+    fb = stats["feedback"]
+    total_fb = fb["positive"] + fb["negative"]
+    if total_fb > 0:
+        lines.append(f"- **User feedback**: {fb['positive']}/{total_fb} positive ({fb['positive']/total_fb:.0%})")
+    # Traffic distribution
+    lines.extend(["", "## Traffic Distribution", ""])
+    total = stats["total_traces"]
+    for cat, count in sorted(analysis["categories"].items(), key=lambda x: -x[1]):
+        pct = count / max(total, 1) * 100
+        lines.append(f"- **{cat}**: {count} traces ({pct:.0f}%)")
+    # Sample inputs by category
+    lines.extend(["", "## Sample Inputs by Category", ""])
+    for cat, entries in sorted(analysis["by_category"].items(), key=lambda x: -len(x[1])):
+        lines.append(f"### {cat} ({len(entries)} traces)")
+        lines.append("")
+        # Show up to 8 sample inputs per category
+        shown = 0
+        for entry in entries:
+            if not entry["input"] or shown >= 8:
+                break
+            status = "ERROR" if entry["error"] else "ok"
+            tok_str = f", {entry['tokens']}tok" if entry["tokens"] else ""
+            lat_str = f", {entry['latency_ms']}ms" if entry["latency_ms"] else ""
+            fb_str = ""
+            if entry["feedback"] == "negative":
+                fb_str = " [NEGATIVE FEEDBACK]"
+            elif entry["feedback"] == "positive":
+                fb_str = " [+]"
+            lines.append(f'- "{entry["input"][:150]}" ({status}{tok_str}{lat_str}){fb_str}')
+            shown += 1
+        lines.append("")
+    # Error patterns
+    if analysis["error_patterns"]:
+        lines.extend(["## Error Patterns", ""])
+        for pattern, count in analysis["error_patterns"].items():
+            lines.append(f"- **{pattern}**: {count} occurrences")
+        lines.append("")
+    # Negative feedback traces
+    neg_traces = [e for e in analysis["processed"] if e["feedback"] == "negative" and e["input"]]
+    if neg_traces:
+        lines.extend(["## Traces with Negative Feedback (high priority)", ""])
+        for entry in neg_traces[:10]:
+            lines.append(f'- "{entry["input"][:200]}" → category: {entry["category"]}')
+        lines.append("")
+    # Guidance for testgen
+    lines.extend([
+        "## Guidance for Test Generation",
+        "",
+        "Use the above data to generate test cases that:",
+        "1. **Match the real traffic distribution** — generate more tasks for high-traffic categories",
+        "2. **Include actual user phrasing** — real inputs show how users actually communicate (informal, abbreviations, typos)",
+        "3. **Cover real error patterns** — the errors above are genuine failure modes, not imagined scenarios",
+        "4. **Prioritize negative feedback traces** — these are confirmed bad experiences",
+        "5. **Include slow queries as edge cases** — high-latency traces may reveal timeout or complexity issues",
+    ])
+    return "\n".join(lines)
+def generate_json_summary(analysis, project_name):
+    """Generate a JSON summary for programmatic use."""
+    return {
+        "project": project_name,
+        "generated_at": datetime.now(timezone.utc).isoformat(),
+        "stats": analysis["stats"],
+        "categories": analysis["categories"],
+        "error_patterns": analysis["error_patterns"],
+        "sample_inputs": {
+            cat: [e["input"] for e in entries if e["input"]][:10]
+            for cat, entries in analysis["by_category"].items()
+        },
+        "negative_feedback_inputs": [
+            e["input"] for e in analysis["processed"]
+            if e["feedback"] == "negative" and e["input"]
+        ][:20],
+        "slow_queries": [
+            {"input": e["input"][:200], "latency_ms": e["latency_ms"], "category": e["category"]}
+            for e in sorted(analysis["processed"], key=lambda x: -(x["latency_ms"] or 0))
+            if e["latency_ms"] and e["input"]
+        ][:10],
+    }
+def main():
+    parser = argparse.ArgumentParser(description="Fetch and summarize production LangSmith traces")
+    parser.add_argument("--project", required=True, help="LangSmith project name")
+    parser.add_argument("--api-key-env", default="LANGSMITH_API_KEY",
+                        help="Env var containing API key (default: LANGSMITH_API_KEY)")
+    parser.add_argument("--limit", type=int, default=100, help="Max traces to fetch (default: 100)")
+    parser.add_argument("--output-md", required=True, help="Output path for markdown seed")
+    parser.add_argument("--output-json", required=True, help="Output path for JSON summary")
+    args = parser.parse_args()
+    api_key = os.environ.get(args.api_key_env, "")
+    if not api_key:
+        print(f"No API key found in ${args.api_key_env} — cannot fetch production traces", file=sys.stderr)
+        sys.exit(1)
+    print(f"Fetching up to {args.limit} traces from LangSmith project '{args.project}'...")
+    runs = fetch_runs(args.project, api_key, args.limit)
+    if not runs:
+        print("No traces found. The project may be empty or the name may be wrong.")
+        # Write empty files so downstream doesn't break
+        for path in [args.output_md, args.output_json]:
+            os.makedirs(os.path.dirname(path) or ".", exist_ok=True)
+        with open(args.output_md, "w") as f:
+            f.write(f"# Production Trace Analysis: {args.project}\n\nNo traces found.\n")
+        with open(args.output_json, "w") as f:
+            json.dump({"project": args.project, "stats": {"total_traces": 0}}, f, indent=2)
+        return
+    print(f"Fetched {len(runs)} traces. Analyzing...")
+    analysis = analyze_runs(runs)
+    if not analysis:
+        print("Analysis failed — no processable traces")
+        return
+    # Write markdown seed
+    os.makedirs(os.path.dirname(args.output_md) or ".", exist_ok=True)
+    md = generate_markdown_seed(analysis, args.project)
+    with open(args.output_md, "w") as f:
+        f.write(md)
+    # Write JSON summary
+    os.makedirs(os.path.dirname(args.output_json) or ".", exist_ok=True)
+    summary = generate_json_summary(analysis, args.project)
+    with open(args.output_json, "w") as f:
+        json.dump(summary, f, indent=2, ensure_ascii=False)
+    stats = analysis["stats"]
+    cats = len(analysis["categories"])
+    errs = stats["with_error"]
+    print(f"Production seed generated:")
+    print(f"  {stats['total_traces']} traces, {cats} categories, {errs} errors ({stats['error_rate']:.1%})")
+    print(f"  {args.output_md}")
+    print(f"  {args.output_json}")
+if __name__ == "__main__":
+    main()