npm - harness-evolver - Versions diffs - 0.7.1 → 0.9.0 - Mend

harness-evolver 0.7.1 → 0.9.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (3) hide show

package/README.md CHANGED Viewed

@@ -4,46 +4,102 @@ End-to-end optimization of LLM agent harnesses, inspired by [Meta-Harness](https
 **The harness is the 80% factor.** Changing just the scaffolding around a fixed LLM can produce a [6x performance gap](https://arxiv.org/abs/2603.28052) on the same benchmark. Harness Evolver automates the search for better harnesses using an autonomous propose-evaluate-iterate loop with full execution traces as feedback.
-## Why
+## Install
-Manual harness engineering is slow and doesn't scale. Existing optimizers work in prompt-space (OPRO, TextGrad, GEPA) or use compressed summaries. Meta-Harness showed that **code-space search with full diagnostic context** (10M+ tokens of traces) outperforms all of them by 10+ points.
+```bash
+npx harness-evolver@latest
+```
-Harness Evolver brings that approach to any domain as a Claude Code plugin.
+Select your runtime (Claude Code, Cursor, Codex, Windsurf) and scope (global/local). Then **restart your AI coding agent** for the skills to appear.
-## Install
+## Prerequisites
+### API Keys (set in your shell before launching Claude Code)
+The harness you're evolving may call LLM APIs. Set the keys your harness needs:
 ```bash
-# Via npx (recommended)
-npx harness-evolver@latest
+# Required: at least one LLM provider
+export ANTHROPIC_API_KEY="sk-ant-..."       # For Claude-based harnesses
+export OPENAI_API_KEY="sk-..."              # For OpenAI-based harnesses
+export GEMINI_API_KEY="AIza..."             # For Gemini-based harnesses
+export OPENROUTER_API_KEY="sk-or-..."       # For OpenRouter (multi-model)
+# Optional: enhanced tracing
+export LANGSMITH_API_KEY="lsv2_pt_..."      # Auto-enables LangSmith tracing
+```
+The plugin auto-detects which keys are available during `/harness-evolver:init` and shows them. The proposer agent knows which APIs are available and uses them accordingly.
+**No API key needed for the example** — the classifier example uses keyword matching (mock mode), no LLM calls.
+### Optional: Enhanced Integrations
+```bash
+# LangSmith — rich trace analysis for the proposer
+uv tool install langsmith-cli && langsmith-cli auth login
+# Context7 — up-to-date library documentation for the proposer
+claude mcp add context7 -- npx -y @upstash/context7-mcp@latest
-# Or as a Claude Code plugin
-/plugin install harness-evolver
+# LangChain Docs — LangChain/LangGraph-specific documentation
+claude mcp add docs-langchain --transport http https://docs.langchain.com/mcp
 ```
 ## Quick Start
+### Try the Example (no API key needed)
 ```bash
-# 1. Copy the example into a working directory
+# 1. Copy the example
 cp -r ~/.harness-evolver/examples/classifier ./my-classifier
 cd my-classifier
-# 2. Initialize (validates harness, evaluates baseline)
-/harness-evolve-init --harness harness.py --eval eval.py --tasks tasks/
+# 2. Open Claude Code
+claude
+# 3. Initialize — auto-detects harness.py, eval.py, tasks/
+/harness-evolver:init
+# 4. Run the evolution loop
+/harness-evolver:evolve --iterations 3
-# 3. Run the evolution loop
-/harness-evolve --iterations 5
+# 5. Check progress
+/harness-evolver:status
+```
+### Use with Your Own Project
+```bash
+cd my-llm-project
+claude
-# 4. Check progress anytime
-/harness-evolve-status
+# Init scans your project, identifies the entry point,
+# and helps create harness wrapper + eval + tasks if missing
+/harness-evolver:init
+# Run optimization
+/harness-evolver:evolve --iterations 10
 ```
-The classifier example runs in mock mode (no API key needed) and demonstrates the full loop in under 2 minutes.
+The init skill adapts to your project — if you have `graph.py` instead of `harness.py`, it creates a thin wrapper. If you don't have an eval script, it helps you write one.
+## Available Commands
+| Command | What it does |
+|---|---|
+| `/harness-evolver:init` | Scan project, create harness/eval/tasks, run baseline |
+| `/harness-evolver:evolve` | Run the autonomous optimization loop |
+| `/harness-evolver:status` | Show progress (scores, iterations, stagnation) |
+| `/harness-evolver:compare` | Diff two versions with per-task analysis |
+| `/harness-evolver:diagnose` | Deep trace analysis of a specific version |
+| `/harness-evolver:deploy` | Copy the best harness back to your project |
 ## How It Works
 ```
                     ┌─────────────────────────────┐
-                    │     /harness-evolve          │
+                    │   /harness-evolver:evolve    │
                     │     (orchestrator skill)     │
                     └──────────┬──────────────────┘
                                │
@@ -63,10 +119,10 @@ The classifier example runs in mock mode (no API key needed) and demonstrates th
         scores.json
 ```
-1. **Propose** — A proposer agent (Claude Code subagent) reads all prior candidates' code, execution traces, and scores. It diagnoses failure modes via counterfactual analysis and writes a new harness.
-2. **Evaluate** — The harness runs against every task. Traces are captured per-task (input, output, stdout, stderr, timing). The user's eval script scores the results.
+1. **Propose** — A proposer agent reads all prior candidates' code, execution traces, and scores. Diagnoses failure modes via counterfactual analysis and writes a new harness.
+2. **Evaluate** — The harness runs against every task. Traces are captured per-task (input, output, stdout, stderr, timing). Your eval script scores the results.
 3. **Update** — State files are updated with the new score, parent lineage, and regression detection.
-4. **Repeat** — The loop continues until N iterations, stagnation (3 rounds without >1% improvement), or a target score is reached.
+4. **Repeat** — Until N iterations, stagnation (3 rounds without >1% improvement), or target score reached.
 ## The Harness Contract
@@ -78,8 +134,8 @@ python3 harness.py --input task.json --output result.json [--traces-dir DIR] [--
 - `--input`: JSON with `{id, input, metadata}` (never sees expected answers)
 - `--output`: JSON with `{id, output}`
-- `--traces-dir`: optional directory for the harness to write rich traces
-- `--config`: optional JSON with evolvable parameters (model, temperature, etc.)
+- `--traces-dir`: optional directory for rich traces
+- `--config`: optional JSON with evolvable parameters
 The eval script is also any executable:
@@ -87,165 +143,104 @@ The eval script is also any executable:
 python3 eval.py --results-dir results/ --tasks-dir tasks/ --scores scores.json
 ```
-This means Harness Evolver works with **any language, any framework, any domain**.
+Works with **any language, any framework, any domain**.
-## Project Structure
+## Project Structure (after init)
 ```
-.harness-evolver/                     # Created in your project by /harness-evolve-init
-├── config.json                       # Project config (harness cmd, eval cmd, evolution params)
+.harness-evolver/                     # Created by /harness-evolver:init
+├── config.json                       # Project config (harness cmd, eval, API keys detected)
 ├── summary.json                      # Source of truth (versions, scores, parents)
-├── STATE.md                          # Human-readable status (generated)
+├── STATE.md                          # Human-readable status
 ├── PROPOSER_HISTORY.md               # Log of all proposals and outcomes
-├── baseline/                         # Original harness (read-only reference)
-│   ├── harness.py
-│   └── config.json
+├── baseline/                         # Original harness (read-only)
+│   └── harness.py
 ├── eval/
-│   ├── eval.py                       # Scoring script
-│   └── tasks/                        # Test cases (JSON files)
+│   ├── eval.py                       # Your scoring script
+│   └── tasks/                        # Test cases
 └── harnesses/
     └── v001/
-        ├── harness.py                # Candidate code
-        ├── config.json               # Evolvable parameters
-        ├── proposal.md               # Proposer's reasoning
-        ├── scores.json               # Evaluation results
+        ├── harness.py                # Evolved candidate
+        ├── proposal.md               # Why this version was created
+        ├── scores.json               # How it scored
         └── traces/                   # Full execution traces
             ├── stdout.log
             ├── stderr.log
             ├── timing.json
             └── task_001/
-                ├── input.json        # What the harness received
-                └── output.json       # What the harness returned
+                ├── input.json
+                └── output.json
 ```
-## Plugin Architecture
+## The Proposer
-Three-layer design inspired by [GSD](https://github.com/gsd-build/get-shit-done):
+The core of the system. 4-phase workflow from the Meta-Harness paper:
-```
-Layer 1: Skills + Agents (markdown)     → AI orchestration
-Layer 2: Tools (Python stdlib-only)     → Deterministic operations
-Layer 3: Installer (Node.js)            → Distribution via npx
-```
+| Phase | What it does |
+|---|---|
+| **Orient** | Read `summary.json` + `PROPOSER_HISTORY.md`. Pick 2-3 versions to investigate. |
+| **Diagnose** | Deep trace analysis. grep for errors, diff versions, counterfactual diagnosis. |
+| **Propose** | Write new harness. Prefer additive changes after regressions. |
+| **Document** | Write `proposal.md` with evidence. Update history. |
-| Component | Files | Purpose |
-|---|---|---|
-| **Skills** | `skills/harness-evolve-init/`, `skills/harness-evolve/`, `skills/harness-evolve-status/` | Slash commands that orchestrate the loop |
-| **Agent** | `agents/harness-evolver-proposer.md` | The proposer — 4-phase workflow (orient, diagnose, propose, document) with 6 rules |
-| **Tools** | `tools/evaluate.py`, `tools/state.py`, `tools/init.py`, `tools/detect_stack.py`, `tools/trace_logger.py` | CLI tools called via subprocess — zero LLM tokens spent on deterministic work |
-| **Installer** | `bin/install.js`, `package.json` | Copies skills/agents/tools to the right locations |
-| **Example** | `examples/classifier/` | 10-task medical classifier with mock mode |
+**7 rules:** evidence-based changes, conservative after regression, don't repeat mistakes, one hypothesis at a time, maintain interface, prefer readability, use available API keys from environment.
 ## Integrations
-### LangSmith (optional)
-If `LANGSMITH_API_KEY` is set, the plugin automatically:
-- Enables `LANGCHAIN_TRACING_V2` for auto-tracing of LangChain/LangGraph harnesses
-- Detects [langsmith-cli](https://github.com/gigaverse-app/langsmith-cli) for the proposer to query traces directly
+### LangSmith (optional, recommended for LangChain/LangGraph harnesses)
 ```bash
-# Setup
 export LANGSMITH_API_KEY=lsv2_...
 uv tool install langsmith-cli && langsmith-cli auth login
-# The proposer can then do:
-langsmith-cli --json runs list --project harness-evolver-v003 --failed --fields id,name,error
-langsmith-cli --json runs stats --project harness-evolver-v003
 ```
-No custom API client — the proposer uses `langsmith-cli` like it uses `grep` and `diff`.
-### Context7 (optional)
-The plugin detects the harness's technology stack via AST analysis (17 libraries supported) and instructs the proposer to consult current documentation before proposing API changes.
+When detected, the plugin:
+- Sets `LANGCHAIN_TRACING_V2=true` automatically — all LLM calls are traced
+- The proposer queries traces directly via `langsmith-cli`:
 ```bash
-# Setup
-claude mcp add context7 -- npx -y @upstash/context7-mcp@latest
-# The proposer automatically:
-# 1. Reads config.json → stack.detected (e.g., LangChain, ChromaDB)
-# 2. Queries Context7 for current docs before writing code
-# 3. Annotates proposal.md with "API verified via Context7"
+langsmith-cli --json runs list --project harness-evolver-v003 --failed --fields id,name,error
+langsmith-cli --json runs stats --project harness-evolver-v003
 ```
-Without Context7, the proposer uses model knowledge and annotates "API not verified against current docs."
-### LangChain Docs MCP (optional)
+### Context7 (optional, recommended for any library-heavy harness)
 ```bash
-claude mcp add docs-langchain --transport http https://docs.langchain.com/mcp
+claude mcp add context7 -- npx -y @upstash/context7-mcp@latest
 ```
-Complements Context7 with LangChain/LangGraph/LangSmith-specific documentation search.
-## The Proposer
-The proposer agent is the core of the system. It follows a 4-phase workflow derived from the Meta-Harness paper:
-| Phase | Context % | What it does |
-|---|---|---|
-| **Orient** | ~6% | Read `summary.json` and `PROPOSER_HISTORY.md`. Decide which 2-3 versions to investigate. |
-| **Diagnose** | ~80% | Deep trace analysis on selected versions. grep for errors, diff between good/bad versions, counterfactual diagnosis. |
-| **Propose** | ~10% | Write new `harness.py` + `config.json`. Prefer additive changes after regressions. |
-| **Document** | ~4% | Write `proposal.md` with evidence. Append to `PROPOSER_HISTORY.md`. |
-**6 rules:**
-1. Every change motivated by evidence (cite task ID, trace line, or score delta)
-2. After regression, prefer additive changes
-3. Don't repeat past mistakes (read PROPOSER_HISTORY.md)
-4. One hypothesis at a time when possible
-5. Maintain the CLI interface
-6. Prefer readable harnesses over defensive ones
-## Supported Libraries (Stack Detection)
-The AST-based stack detector recognizes 17 libraries:
-| Category | Libraries |
-|---|---|
-| **AI Frameworks** | LangChain, LangGraph, LlamaIndex, OpenAI, Anthropic, DSPy, CrewAI, AutoGen |
-| **Vector Stores** | ChromaDB, Pinecone, Qdrant, Weaviate |
-| **Web** | FastAPI, Flask, Pydantic |
-| **Data** | Pandas, NumPy |
+The plugin detects your stack via AST analysis (17 libraries: LangChain, LangGraph, OpenAI, Anthropic, ChromaDB, FastAPI, etc.) and instructs the proposer to consult current docs before proposing API changes.
 ## Development
 ```bash
-# Run all tests (41 tests, stdlib-only, no pip install needed)
+# Run all tests (41 tests, stdlib-only)
 python3 -m unittest discover -s tests -v
-# Test the example manually
-cd examples/classifier
-python3 harness.py --input tasks/task_001.json --output /tmp/result.json --config config.json
-cat /tmp/result.json
+# Test example manually
+python3 examples/classifier/harness.py --input examples/classifier/tasks/task_001.json --output /tmp/result.json --config examples/classifier/config.json
-# Run the installer locally
+# Install locally for development
 node bin/install.js
 ```
-## Comparison with Related Work
+## Comparison
-| | Meta-Harness (paper) | A-Evolve | ECC /evolve | **Harness Evolver** |
+| | Meta-Harness | A-Evolve | ECC | **Harness Evolver** |
 |---|---|---|---|---|
 | **Format** | Paper artifact | Framework (Docker) | Plugin (passive) | **Plugin (active)** |
-| **Search space** | Code-space | Code-space | Prompt-space | **Code-space** |
-| **Context/iter** | 10M tokens | Variable | N/A | **Full filesystem** |
+| **Search** | Code-space | Code-space | Prompt-space | **Code-space** |
 | **Domain** | TerminalBench-2 | Coding benchmarks | Dev workflow | **Any domain** |
-| **Install** | Manual Python | Docker CLI | `/plugin install` | **`npx` or `/plugin install`** |
-| **LangSmith** | No | No | No | **Yes (langsmith-cli)** |
-| **Context7** | No | No | No | **Yes (MCP)** |
+| **Install** | Manual Python | Docker CLI | `/plugin install` | **`npx`** |
+| **LangSmith** | No | No | No | **Yes** |
+| **Context7** | No | No | No | **Yes** |
 ## References
-- [Meta-Harness: End-to-End Optimization of Model Harnesses](https://arxiv.org/abs/2603.28052) — Lee et al., 2026
-- [GSD (Get Shit Done)](https://github.com/gsd-build/get-shit-done) — CLI architecture inspiration
-- [LangSmith CLI](https://github.com/gigaverse-app/langsmith-cli) — Trace analysis for the proposer
-- [Context7](https://github.com/upstash/context7) — Documentation lookup via MCP
+- [Meta-Harness paper (arxiv 2603.28052)](https://arxiv.org/abs/2603.28052) — Lee et al., 2026
 - [Design Spec](docs/specs/2026-03-31-harness-evolver-design.md)
-- [LangSmith Integration Spec](docs/specs/2026-03-31-langsmith-integration.md)
-- [Context7 Integration Spec](docs/specs/2026-03-31-context7-integration.md)
+- [LangSmith Integration](docs/specs/2026-03-31-langsmith-integration.md)
+- [Context7 Integration](docs/specs/2026-03-31-context7-integration.md)
 ## License

package/bin/install.js CHANGED Viewed

@@ -70,30 +70,43 @@ function checkPython() {
   }
 }
+function checkCommand(cmd) {
+  try {
+    execSync(cmd, { stdio: "pipe" });
+    return true;
+  } catch {
+    return false;
+  }
+}
 function installForRuntime(runtimeDir, scope) {
   const baseDir = scope === "local"
     ? path.join(process.cwd(), runtimeDir)
     : path.join(HOME, runtimeDir);
-  const commandsDir = path.join(baseDir, "commands", "harness-evolver");
+  const skillsDir = path.join(baseDir, "skills");
   const agentsDir = path.join(baseDir, "agents");
-  // Skills → commands/harness-evolver/ as flat .md files
-  // Claude Code expects commands/name.md, not commands/name/SKILL.md
+  // Skills → ~/.claude/skills/<skill-name>/SKILL.md (proper skills format)
   const skillsSource = path.join(PLUGIN_ROOT, "skills");
   if (fs.existsSync(skillsSource)) {
-    fs.mkdirSync(commandsDir, { recursive: true });
     for (const skill of fs.readdirSync(skillsSource, { withFileTypes: true })) {
       if (skill.isDirectory()) {
-        const skillMd = path.join(skillsSource, skill.name, "SKILL.md");
-        if (fs.existsSync(skillMd)) {
-          fs.copyFileSync(skillMd, path.join(commandsDir, skill.name + ".md"));
-          console.log(`  ${GREEN}✓${RESET} Installed command: harness-evolver:${skill.name}`);
-        }
+        const src = path.join(skillsSource, skill.name);
+        const dest = path.join(skillsDir, "harness-evolver:" + skill.name);
+        copyDir(src, dest);
+        console.log(`  ${GREEN}✓${RESET} Installed skill: harness-evolver:${skill.name}`);
       }
     }
   }
+  // Cleanup old commands/ install (from previous versions)
+  const oldCommandsDir = path.join(baseDir, "commands", "harness-evolver");
+  if (fs.existsSync(oldCommandsDir)) {
+    fs.rmSync(oldCommandsDir, { recursive: true, force: true });
+    console.log(`  ${GREEN}✓${RESET} Cleaned up old commands/ directory`);
+  }
   // Agents → agents/
   const agentsSource = path.join(PLUGIN_ROOT, "agents");
   if (fs.existsSync(agentsSource)) {
@@ -219,7 +232,90 @@ async function main() {
   fs.writeFileSync(versionPath, VERSION);
   console.log(`  ${GREEN}✓${RESET} VERSION ${VERSION}`);
-  console.log(`\n  ${GREEN}Done!${RESET} Run ${BRIGHT_MAGENTA}/reload-plugins${RESET} in Claude Code, then ${BRIGHT_MAGENTA}/harness-evolver:init${RESET}`);
+  console.log(`\n  ${GREEN}Done!${RESET} Restart Claude Code, then run ${BRIGHT_MAGENTA}/harness-evolver:init${RESET}\n`);
+  // Optional integrations
+  console.log(`  ${YELLOW}Install optional integrations?${RESET}\n`);
+  console.log(`  These enhance the proposer with rich traces and up-to-date documentation.\n`);
+  // LangSmith CLI
+  const hasLangsmithCli = checkCommand("langsmith-cli --version");
+  if (hasLangsmithCli) {
+    console.log(`  ${GREEN}✓${RESET} langsmith-cli already installed`);
+  } else {
+    console.log(`  ${BOLD}LangSmith CLI${RESET} — rich trace analysis (error rates, latency, token usage)`);
+    console.log(`    ${DIM}uv tool install langsmith-cli && langsmith-cli auth login${RESET}`);
+    const lsAnswer = await ask(rl, `\n  ${YELLOW}Install langsmith-cli? [y/N]:${RESET} `);
+    if (lsAnswer.trim().toLowerCase() === "y") {
+      console.log(`\n  Installing langsmith-cli...`);
+      try {
+        execSync("uv tool install langsmith-cli", { stdio: "inherit" });
+        console.log(`\n  ${GREEN}✓${RESET} langsmith-cli installed`);
+        console.log(`  ${YELLOW}Run ${BOLD}langsmith-cli auth login${RESET}${YELLOW} to authenticate with your LangSmith API key.${RESET}\n`);
+      } catch {
+        console.log(`\n  ${RED}Failed.${RESET} Install manually: uv tool install langsmith-cli\n`);
+      }
+    }
+  }
+  // Context7 MCP
+  const hasContext7 = (() => {
+    try {
+      for (const p of [path.join(HOME, ".claude", "settings.json"), path.join(HOME, ".claude.json")]) {
+        if (fs.existsSync(p)) {
+          const s = JSON.parse(fs.readFileSync(p, "utf8"));
+          if (s.mcpServers && (s.mcpServers.context7 || s.mcpServers.Context7)) return true;
+        }
+      }
+    } catch {}
+    return false;
+  })();
+  if (hasContext7) {
+    console.log(`  ${GREEN}✓${RESET} Context7 MCP already configured`);
+  } else {
+    console.log(`\n  ${BOLD}Context7 MCP${RESET} — up-to-date library documentation (LangChain, OpenAI, etc.)`);
+    console.log(`    ${DIM}claude mcp add context7 -- npx -y @upstash/context7-mcp@latest${RESET}`);
+    const c7Answer = await ask(rl, `\n  ${YELLOW}Install Context7 MCP? [y/N]:${RESET} `);
+    if (c7Answer.trim().toLowerCase() === "y") {
+      console.log(`\n  Installing Context7 MCP...`);
+      try {
+        execSync("claude mcp add context7 -- npx -y @upstash/context7-mcp@latest", { stdio: "inherit" });
+        console.log(`\n  ${GREEN}✓${RESET} Context7 MCP configured`);
+      } catch {
+        console.log(`\n  ${RED}Failed.${RESET} Install manually: claude mcp add context7 -- npx -y @upstash/context7-mcp@latest\n`);
+      }
+    }
+  }
+  // LangChain Docs MCP
+  const hasLcDocs = (() => {
+    try {
+      for (const p of [path.join(HOME, ".claude", "settings.json"), path.join(HOME, ".claude.json")]) {
+        if (fs.existsSync(p)) {
+          const s = JSON.parse(fs.readFileSync(p, "utf8"));
+          if (s.mcpServers && (s.mcpServers["docs-langchain"] || s.mcpServers["LangChain Docs"])) return true;
+        }
+      }
+    } catch {}
+    return false;
+  })();
+  if (hasLcDocs) {
+    console.log(`  ${GREEN}✓${RESET} LangChain Docs MCP already configured`);
+  } else {
+    console.log(`\n  ${BOLD}LangChain Docs MCP${RESET} — LangChain/LangGraph/LangSmith documentation search`);
+    console.log(`    ${DIM}claude mcp add docs-langchain --transport http https://docs.langchain.com/mcp${RESET}`);
+    const lcAnswer = await ask(rl, `\n  ${YELLOW}Install LangChain Docs MCP? [y/N]:${RESET} `);
+    if (lcAnswer.trim().toLowerCase() === "y") {
+      console.log(`\n  Installing LangChain Docs MCP...`);
+      try {
+        execSync("claude mcp add docs-langchain --transport http https://docs.langchain.com/mcp", { stdio: "inherit" });
+        console.log(`\n  ${GREEN}✓${RESET} LangChain Docs MCP configured`);
+      } catch {
+        console.log(`\n  ${RED}Failed.${RESET} Install manually: claude mcp add docs-langchain --transport http https://docs.langchain.com/mcp\n`);
+      }
+    }
+  }
   console.log(`\n  ${DIM}Quick start with example:${RESET}`);
   console.log(`    cp -r ~/.harness-evolver/examples/classifier ./my-project`);
   console.log(`    cd my-project && claude`);

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "harness-evolver",
-  "version": "0.7.1",
+  "version": "0.9.0",
   "description": "Meta-Harness-style autonomous harness optimization for Claude Code",
   "author": "Raphael Valdetaro",
   "license": "MIT",