npm - harness-evolver - Versions diffs - 3.1.1 → 3.2.1 - Mend

harness-evolver 3.1.1 → 3.2.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (13) hide show

package/.claude-plugin/plugin.json +20 -0
package/README.md +50 -9
package/agents/evolver-evaluator.md +2 -2
package/agents/evolver-proposer.md +2 -1
package/hooks/hooks.json +15 -0
package/hooks/session-start.sh +71 -0
package/package.json +4 -2
package/skills/evolve/SKILL.md +4 -3
package/skills/setup/SKILL.md +11 -5
package/tools/read_results.py +14 -1
package/tools/run_eval.py +33 -6
package/tools/setup.py +2 -0
package/tools/trace_insights.py +37 -0

package/.claude-plugin/plugin.json ADDED Viewed

@@ -0,0 +1,20 @@
+{
+  "name": "harness-evolver",
+  "description": "LangSmith-native autonomous agent optimization — evolves LLM agent code using multi-agent proposers, LangSmith experiments, and git worktrees",
+  "version": "3.2.1",
+  "author": {
+    "name": "Raphael Valdetaro"
+  },
+  "homepage": "https://github.com/raphaelchristi/harness-evolver",
+  "repository": "https://github.com/raphaelchristi/harness-evolver",
+  "license": "MIT",
+  "keywords": [
+    "langsmith",
+    "optimization",
+    "evolution",
+    "llm",
+    "agent",
+    "evaluator",
+    "langsmith-cli"
+  ]
+}

package/README.md CHANGED Viewed

@@ -19,11 +19,24 @@ Inspired by [Meta-Harness](https://yoonholee.com/meta-harness/) (Lee et al., 202
 ## Install
+### Claude Code Plugin (recommended)
+```
+/plugin marketplace add raphaelchristi/harness-evolver-marketplace
+/plugin install harness-evolver
+```
+Updates are automatic. Python dependencies (langsmith, langsmith-cli) are installed on first session start via hook.
+### npx (first-time setup or non-Claude Code runtimes)
 ```bash
 npx harness-evolver@latest
 ```
-> Works with Claude Code, Cursor, Codex, and Windsurf. Requires LangSmith account + API key.
+Interactive installer that configures LangSmith API key, creates Python venv, and installs all dependencies. Works with Claude Code, Cursor, Codex, and Windsurf.
+> **Both install paths work together.** Use npx for initial setup (API key, venv), then the plugin marketplace handles updates automatically.
 ---
@@ -58,6 +71,10 @@ claude
 <td>Each iteration spawns 5 parallel agents: exploit, explore, crossover, and 2 failure-targeted. Strategies adapt based on per-task analysis. Quality-diversity selection preserves per-task champions.</td>
 </tr>
 <tr>
+<td><b>Agent-Based Evaluation</b></td>
+<td>The evaluator agent reads experiment outputs via langsmith-cli, judges correctness using the same Claude model powering the other agents, and writes scores back. No OpenAI API key or openevals dependency needed.</td>
+</tr>
+<tr>
 <td><b>Production Traces</b></td>
 <td>Auto-discovers existing LangSmith production projects. Uses real user inputs for test generation and real error patterns for targeted optimization.</td>
 </tr>
@@ -89,10 +106,10 @@ claude
 | Agent | Role | Color |
 |---|---|---|
 | **Proposer** | Modifies agent code in isolated worktrees based on trace analysis | Green |
+| **Evaluator** | LLM-as-judge — reads outputs via langsmith-cli, scores correctness | Yellow |
 | **Architect** | Recommends multi-agent topology changes | Blue |
 | **Critic** | Validates evaluator quality, detects gaming | Red |
 | **TestGen** | Generates test inputs for LangSmith datasets | Cyan |
-| **Evaluator** | LLM-as-judge — reads outputs via langsmith-cli, scores correctness | Yellow |
 ---
@@ -118,19 +135,43 @@ claude
 ---
+## Architecture
+```
+Plugin hook (SessionStart)
+  └→ Creates venv, installs langsmith + langsmith-cli, exports env vars
+Skills (markdown)
+  ├── /evolver:setup    → explores project, runs setup.py
+  ├── /evolver:evolve   → orchestrates the evolution loop
+  ├── /evolver:status   → reads .evolver.json + LangSmith
+  └── /evolver:deploy   → tags and pushes
+Agents (markdown)
+  ├── Proposer (x5)     → modifies code in git worktrees
+  ├── Evaluator          → LLM-as-judge via langsmith-cli
+  ├── Critic             → detects evaluator gaming
+  ├── Architect          → recommends topology changes
+  └── TestGen            → generates test inputs
+Tools (Python + langsmith SDK)
+  ├── setup.py           → creates datasets, configures evaluators
+  ├── run_eval.py        → runs target against dataset
+  ├── read_results.py    → compares experiments
+  ├── trace_insights.py  → clusters errors from traces
+  └── seed_from_traces.py → imports production traces
+```
+---
 ## Requirements
 - **LangSmith account** + `LANGSMITH_API_KEY`
-- **Python 3.10+** with `langsmith` package
-- **langsmith-cli** (`uv tool install langsmith-cli`) — required for evaluator agent
+- **Python 3.10+**
 - **Git** (for worktree-based isolation)
 - **Claude Code** (or Cursor/Codex/Windsurf)
-```bash
-export LANGSMITH_API_KEY="lsv2_pt_..."
-pip install langsmith
-uv tool install langsmith-cli
-```
+Dependencies (`langsmith`, `langsmith-cli`) are installed automatically by the plugin hook or the npx installer.
 ---

package/agents/evolver-evaluator.md CHANGED Viewed

@@ -37,7 +37,7 @@ You interact with LangSmith exclusively through `langsmith-cli`. Always use `--j
 langsmith-cli --json runs list \
     --project "{experiment_name}" \
     --fields id,inputs,outputs,error,reference_example_id \
-    --is-root \
+    --is-root true \
     --limit 200
 ```
@@ -72,7 +72,7 @@ Fetch all runs from the experiment. Save the output to a file for reference:
 langsmith-cli --json runs list \
     --project "{experiment_name}" \
     --fields id,inputs,outputs,error,reference_example_id \
-    --is-root --limit 200 \
+    --is-root true --limit 200 \
     --output experiment_runs.jsonl
 ```

package/agents/evolver-proposer.md CHANGED Viewed

@@ -97,9 +97,10 @@ Ask about the SPECIFIC API you're going to use or change.
 1. **Commit all changes** with a descriptive message:
    ```bash
-   git add -A
+   git add -A -- ':!.venv' ':!venv' ':!node_modules'
    git commit -m "evolver: {brief description of changes}"
    ```
+   **CRITICAL**: Never commit `.venv`, `venv`, or `node_modules`. Symlinks to these in worktrees will break the main branch if merged.
 2. **Write proposal.md** explaining:
    - What you changed and why

package/hooks/hooks.json ADDED Viewed

@@ -0,0 +1,15 @@
+{
+  "description": "Harness Evolver — ensures Python deps and env vars are available each session",
+  "hooks": {
+    "SessionStart": [
+      {
+        "hooks": [
+          {
+            "type": "command",
+            "command": "bash \"${CLAUDE_PLUGIN_ROOT}/hooks/session-start.sh\""
+          }
+        ]
+      }
+    ]
+  }
+}

package/hooks/session-start.sh ADDED Viewed

@@ -0,0 +1,71 @@
+#!/usr/bin/env bash
+# Harness Evolver — SessionStart hook
+# Ensures Python venv, langsmith, langsmith-cli, and env vars are ready.
+# Runs silently on every session start. Installs deps only if missing.
+set -euo pipefail
+# Resolve paths — plugin root is set by Claude Code
+PLUGIN_ROOT="${CLAUDE_PLUGIN_ROOT:-}"
+PLUGIN_DATA="${CLAUDE_PLUGIN_DATA:-}"
+# Fallback: if running outside plugin system (npx install), use legacy paths
+if [ -z "$PLUGIN_ROOT" ]; then
+    PLUGIN_ROOT="$HOME/.evolver"
+    PLUGIN_DATA="$HOME/.evolver"
+fi
+TOOLS_DIR="$PLUGIN_ROOT/tools"
+VENV_DIR="$PLUGIN_DATA/venv"
+VENV_PY="$VENV_DIR/bin/python"
+# --- 1. Create venv if missing ---
+if [ ! -f "$VENV_PY" ]; then
+    if command -v uv >/dev/null 2>&1; then
+        uv venv "$VENV_DIR" >/dev/null 2>&1
+    else
+        python3 -m venv "$VENV_DIR" >/dev/null 2>&1
+    fi
+fi
+# --- 2. Install langsmith if missing ---
+if [ -f "$VENV_PY" ]; then
+    "$VENV_PY" -c "import langsmith" 2>/dev/null || {
+        if command -v uv >/dev/null 2>&1; then
+            uv pip install --python "$VENV_PY" langsmith >/dev/null 2>&1
+        else
+            "$VENV_DIR/bin/pip" install --upgrade langsmith >/dev/null 2>&1 || \
+            "$VENV_PY" -m pip install --upgrade langsmith >/dev/null 2>&1
+        fi
+    }
+fi
+# --- 3. Install langsmith-cli if missing ---
+command -v langsmith-cli >/dev/null 2>&1 || {
+    if command -v uv >/dev/null 2>&1; then
+        uv tool install langsmith-cli >/dev/null 2>&1
+    else
+        pip install langsmith-cli >/dev/null 2>&1 || pip3 install langsmith-cli >/dev/null 2>&1
+    fi
+} || true
+# --- 4. Load API key from credentials file if not in env ---
+if [ -z "${LANGSMITH_API_KEY:-}" ]; then
+    if [ "$(uname)" = "Darwin" ]; then
+        CREDS="$HOME/Library/Application Support/langsmith-cli/credentials"
+    else
+        CREDS="$HOME/.config/langsmith-cli/credentials"
+    fi
+    if [ -f "$CREDS" ]; then
+        KEY=$(grep '^LANGSMITH_API_KEY=' "$CREDS" 2>/dev/null | head -1 | cut -d= -f2-)
+        if [ -n "$KEY" ]; then
+            echo "export LANGSMITH_API_KEY=\"$KEY\"" >> "$CLAUDE_ENV_FILE"
+        fi
+    fi
+fi
+# --- 5. Export env vars for skills ---
+if [ -n "${CLAUDE_ENV_FILE:-}" ]; then
+    echo "export EVOLVER_TOOLS=\"$TOOLS_DIR\"" >> "$CLAUDE_ENV_FILE"
+    echo "export EVOLVER_PY=\"$VENV_PY\"" >> "$CLAUDE_ENV_FILE"
+fi

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "harness-evolver",
-  "version": "3.1.1",
+  "version": "3.2.1",
   "description": "LangSmith-native autonomous agent optimization for Claude Code",
   "author": "Raphael Valdetaro",
   "license": "MIT",
@@ -24,6 +24,8 @@
     "bin/",
     "skills/",
     "agents/",
-    "tools/"
+    "tools/",
+    "hooks/",
+    ".claude-plugin/"
   ]
 }

package/skills/evolve/SKILL.md CHANGED Viewed

@@ -16,8 +16,9 @@ Run the autonomous propose-evaluate-iterate loop using LangSmith as the evaluati
 ## Resolve Tool Path and Python
 ```bash
-TOOLS=$([ -d ".evolver/tools" ] && echo ".evolver/tools" || echo "$HOME/.evolver/tools")
-EVOLVER_PY=$([ -f "$HOME/.evolver/venv/bin/python" ] && echo "$HOME/.evolver/venv/bin/python" || echo "python3")
+# Prefer env vars set by plugin hook; fallback to legacy npx paths
+TOOLS="${EVOLVER_TOOLS:-$([ -d ".evolver/tools" ] && echo ".evolver/tools" || echo "$HOME/.evolver/tools")}"
+EVOLVER_PY="${EVOLVER_PY:-$([ -f "$HOME/.evolver/venv/bin/python" ] && echo "$HOME/.evolver/venv/bin/python" || echo "python3")}"
 ```
 Use `$EVOLVER_PY` instead of `python3` for ALL tool invocations.
@@ -234,7 +235,7 @@ Agent(
     Entry point: {entry_point}
     For each experiment:
-    1. Read all runs via: langsmith-cli --json runs list --project "{experiment_name}" --fields id,inputs,outputs,error --is-root --limit 200
+    1. Read all runs via: langsmith-cli --json runs list --project "{experiment_name}" --fields id,inputs,outputs,error --is-root true --limit 200
     2. Judge each run's output against the input
     3. Write scores via: langsmith-cli --json feedback create {run_id} --key {evaluator} --score {0.0|1.0} --comment "{reason}" --source model
     </context>

package/skills/setup/SKILL.md CHANGED Viewed

@@ -38,8 +38,9 @@ The tools auto-load the key from the credentials file, but the env var takes pre
 ## Resolve Tool Path and Python
 ```bash
-TOOLS=$([ -d ".evolver/tools" ] && echo ".evolver/tools" || echo "$HOME/.evolver/tools")
-EVOLVER_PY=$([ -f "$HOME/.evolver/venv/bin/python" ] && echo "$HOME/.evolver/venv/bin/python" || echo "python3")
+# Prefer env vars set by plugin hook; fallback to legacy npx paths
+TOOLS="${EVOLVER_TOOLS:-$([ -d ".evolver/tools" ] && echo ".evolver/tools" || echo "$HOME/.evolver/tools")}"
+EVOLVER_PY="${EVOLVER_PY:-$([ -f "$HOME/.evolver/venv/bin/python" ] && echo "$HOME/.evolver/venv/bin/python" || echo "python3")}"
 ```
 Use `$EVOLVER_PY` instead of `python3` for ALL tool invocations. This ensures the venv with langsmith is used.
@@ -60,9 +61,14 @@ Look for:
 To identify the **framework**, read the entry point file and its immediate imports. The proposer agents will use Context7 MCP for detailed documentation lookup — you don't need to detect every library, just identify the main framework (LangGraph, CrewAI, OpenAI Agents SDK, etc.) from the imports you see.
-Identify the **run command** — how to execute the agent:
-- `python main.py` (if it accepts `--input` flag)
-- The command in the project's README, Makefile, or scripts/
+Identify the **run command** — how to execute the agent. Use `{input}` as a placeholder for the JSON file path:
+- `python main.py {input}` — agent reads JSON file from positional arg
+- `python main.py --input {input}` — agent reads JSON file from `--input` flag
+- `python main.py --query {input_json}` — agent receives inline JSON string
+The runner writes `{"input": "user question..."}` to a temp `.json` file and replaces `{input}` with the file path. If the entry point already contains `--input` (without placeholder), the runner appends the file path as the next argument.
+If no placeholder and no `--input` flag detected, the runner appends `--input <path> --output <path>`.
 ## Phase 2: Confirm Detection (interactive)

package/tools/read_results.py CHANGED Viewed

@@ -26,7 +26,7 @@ import sys
 def ensure_langsmith_api_key():
-    """Load LANGSMITH_API_KEY from credentials file if not in env."""
+    """Load LANGSMITH_API_KEY from credentials file or .env if not in env."""
     if os.environ.get("LANGSMITH_API_KEY"):
         return True
     if platform.system() == "Darwin":
@@ -45,6 +45,19 @@ def ensure_langsmith_api_key():
                             return True
         except OSError:
             pass
+    # Also check .env in current directory
+    if os.path.exists(".env"):
+        try:
+            with open(".env") as f:
+                for line in f:
+                    line = line.strip()
+                    if line.startswith("LANGSMITH_API_KEY=") and not line.startswith("#"):
+                        key = line.split("=", 1)[1].strip().strip("'\"")
+                        if key:
+                            os.environ["LANGSMITH_API_KEY"] = key
+                            return True
+        except OSError:
+            pass
     return False

package/tools/run_eval.py CHANGED Viewed

@@ -73,10 +73,16 @@ def make_target(entry_point, cwd):
         try:
             cmd = entry_point
             if "{input}" in cmd:
+                # Placeholder: replace with path to JSON file
                 cmd = cmd.replace("{input}", input_path)
             elif "{input_json}" in cmd:
+                # Placeholder: replace with inline JSON string
                 cmd = cmd.replace("{input_json}", input_json)
+            elif "--input" in cmd or "-i " in cmd:
+                # Entry point already has --input flag — pass the file path as next arg
+                cmd = f"{cmd} {input_path}"
             else:
+                # Default: append --input and --output flags
                 cmd = f"{cmd} --input {input_path} --output {output_path}"
             env = os.environ.copy()
@@ -197,17 +203,38 @@ def main():
         experiment_name = results.experiment_name
         # Calculate mean score from code-based evaluators only
+        # langsmith>=0.7.x returns dicts, older versions return dataclasses
         scores = []
         per_example = {}
         for result in results:
             example_scores = []
-            if result.evaluation_results and result.evaluation_results.get("results"):
-                for er in result.evaluation_results["results"]:
-                    if er.get("score") is not None:
-                        example_scores.append(er["score"])
-                        scores.append(er["score"])
-            example_id = str(result.example.id) if result.example else "unknown"
+            # Handle both dict and object results (SDK version compat)
+            if isinstance(result, dict):
+                eval_results = result.get("evaluation_results", {})
+                if isinstance(eval_results, dict):
+                    eval_list = eval_results.get("results", [])
+                else:
+                    eval_list = getattr(eval_results, "results", []) or []
+                example_obj = result.get("example")
+                example_id = str(example_obj.get("id", "unknown") if isinstance(example_obj, dict) else getattr(example_obj, "id", "unknown"))
+            else:
+                eval_results = getattr(result, "evaluation_results", None)
+                if isinstance(eval_results, dict):
+                    eval_list = eval_results.get("results", [])
+                elif eval_results:
+                    eval_list = getattr(eval_results, "results", []) or []
+                else:
+                    eval_list = []
+                example_obj = getattr(result, "example", None)
+                example_id = str(getattr(example_obj, "id", "unknown") if example_obj else "unknown")
+            for er in eval_list:
+                score_val = er.get("score") if isinstance(er, dict) else getattr(er, "score", None)
+                if score_val is not None:
+                    example_scores.append(score_val)
+                    scores.append(score_val)
             per_example[example_id] = {
                 "score": sum(example_scores) / len(example_scores) if example_scores else 0.0,
                 "num_evaluators": len(example_scores),

package/tools/setup.py CHANGED Viewed

@@ -267,6 +267,8 @@ def make_target(entry_point, cwd=None):
                 cmd = cmd.replace("{input}", input_path)
             elif "{input_json}" in cmd:
                 cmd = cmd.replace("{input_json}", input_json)
+            elif "--input" in cmd or "-i " in cmd:
+                cmd = f"{cmd} {input_path}"
             else:
                 cmd = f"{cmd} --input {input_path} --output {output_path}"

package/tools/trace_insights.py CHANGED Viewed

@@ -23,10 +23,46 @@ Requires: pip install langsmith (for SDK mode)
 import argparse
 import json
 import os
+import platform
 import sys
 from datetime import datetime, timezone
+def ensure_langsmith_api_key():
+    """Load LANGSMITH_API_KEY from credentials file or .env if not in env."""
+    if os.environ.get("LANGSMITH_API_KEY"):
+        return True
+    if platform.system() == "Darwin":
+        creds_path = os.path.expanduser("~/Library/Application Support/langsmith-cli/credentials")
+    else:
+        creds_path = os.path.expanduser("~/.config/langsmith-cli/credentials")
+    if os.path.exists(creds_path):
+        try:
+            with open(creds_path) as f:
+                for line in f:
+                    line = line.strip()
+                    if line.startswith("LANGSMITH_API_KEY="):
+                        key = line.split("=", 1)[1].strip()
+                        if key:
+                            os.environ["LANGSMITH_API_KEY"] = key
+                            return True
+        except OSError:
+            pass
+    if os.path.exists(".env"):
+        try:
+            with open(".env") as f:
+                for line in f:
+                    line = line.strip()
+                    if line.startswith("LANGSMITH_API_KEY=") and not line.startswith("#"):
+                        key = line.split("=", 1)[1].strip().strip("'\"")
+                        if key:
+                            os.environ["LANGSMITH_API_KEY"] = key
+                            return True
+        except OSError:
+            pass
+    return False
 def load_json(path):
     """Load JSON file, return None if missing or invalid."""
     if not path or not os.path.exists(path):
@@ -260,6 +296,7 @@ def identify_top_issues(error_clusters, response_analysis, score_cross_ref):
 def fetch_runs_from_langsmith(project_name, experiment_name=None, limit=50):
     """Fetch runs directly from LangSmith SDK (v3 mode)."""
     try:
+        ensure_langsmith_api_key()
         from langsmith import Client
         client = Client()