npm - harness-evolver - Versions diffs - 4.3.0 → 4.3.1 - Mend

harness-evolver 4.3.0 → 4.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

package/.claude-plugin/plugin.json +1 -1
package/README.md +15 -9
package/package.json +1 -1
package/skills/evolve/SKILL.md +28 -33
package/tools/run_eval.py +6 -1

package/.claude-plugin/plugin.json CHANGED Viewed

@@ -1,7 +1,7 @@
 {
   "name": "harness-evolver",
   "description": "LangSmith-native autonomous agent optimization — evolves LLM agent code using multi-agent proposers, LangSmith experiments, and git worktrees",
-  "version": "4.3.0",
+  "version": "4.3.1",
   "author": {
     "name": "Raphael Valdetaro"
   },

package/README.md CHANGED Viewed

@@ -91,8 +91,12 @@ claude
 <td>Cross-iteration memory consolidation inspired by Claude Code's autoDream. Tracks which approaches win, which failures recur, and promotes insights after 2+ occurrences.</td>
 </tr>
 <tr>
+<td><b>Dataset Health</b></td>
+<td>Pre-flight dataset quality check: size adequacy, difficulty distribution, dead example detection, production coverage analysis, train/held-out splits. Auto-corrects issues before evolution starts.</td>
+</tr>
+<tr>
 <td><b>Smart Gating</b></td>
-<td>Three-gate iteration triggers (score plateau, cost budget, convergence detection) replace blind N-iteration loops. State validation ensures config hasn't diverged from LangSmith.</td>
+<td>Claude assesses gate conditions directly — score plateau, target reached, diminishing returns. No hardcoded thresholds. State validation ensures config hasn't diverged from LangSmith.</td>
 </tr>
 <tr>
 <td><b>Background Mode</b></td>
@@ -107,6 +111,7 @@ claude
 | Command | What it does |
 |---|---|
 | `/evolver:setup` | Explore project, configure LangSmith (dataset, evaluators), run baseline |
+| `/evolver:health` | Check dataset quality (size, difficulty, coverage, splits), auto-correct issues |
 | `/evolver:evolve` | Run the optimization loop (dynamic self-organizing proposers in worktrees) |
 | `/evolver:status` | Show progress, scores, history |
 | `/evolver:deploy` | Tag, push, clean up temporary files |
@@ -132,10 +137,11 @@ claude
 /evolver:evolve
   |
   +- 0.5  Validate state (skeptical memory — check .evolver.json vs LangSmith)
+  +- 0.6  /evolver:health — dataset quality check + auto-correct
   +- 1.   Read state (.evolver.json + LangSmith experiments)
   +- 1.5  Gather trace insights (cluster errors, tokens, latency)
-  +- 1.8  Analyze per-task failures
-  +- 1.8a Synthesize strategy document + dynamic lenses (investigation questions)
+  +- 1.8  Analyze per-task failures (train split only — proposers don't see held-out)
+  +- 1.8a Claude generates strategy.md + lenses.json from analysis data
   +- 1.9  Prepare shared proposer context (KV cache-optimized prefix)
   +- 2.   Spawn N self-organizing proposers in parallel (each in a git worktree)
   +- 3.   Run target for each candidate (code-based evaluators)
@@ -144,10 +150,10 @@ claude
   +- 5.   Merge winning worktree into main branch
   +- 5.5  Regression tracking (auto-add guard examples to dataset)
   +- 6.   Report results
-  +- 6.2  Consolidate evolution memory (orient/gather/consolidate/prune)
+  +- 6.2  Consolidator agent updates evolution memory (runs in background)
   +- 6.5  Auto-trigger Active Critic (detect + fix evaluator gaming)
   +- 7.   Auto-trigger ULTRAPLAN Architect (opus model, deep analysis)
-  +- 8.   Three-gate check (score plateau, cost budget, convergence)
+  +- 8.   Claude assesses gate conditions (plateau, target, diminishing returns)
 ```
 ---
@@ -159,7 +165,8 @@ Plugin hook (SessionStart)
   └→ Creates venv, installs langsmith + langsmith-cli, exports env vars
 Skills (markdown)
-  ├── /evolver:setup    → explores project, runs setup.py
+  ├── /evolver:setup    → explores project, smart defaults, runs setup.py
+  ├── /evolver:health   → dataset quality check + auto-correct
   ├── /evolver:evolve   → orchestrates the evolution loop
   ├── /evolver:status   → reads .evolver.json + LangSmith
   └── /evolver:deploy   → tags and pushes
@@ -179,10 +186,8 @@ Tools (Python + langsmith SDK)
   ├── trace_insights.py     → clusters errors from traces
   ├── seed_from_traces.py   → imports production traces
   ├── validate_state.py     → validates config vs LangSmith state
-  ├── iteration_gate.py     → three-gate iteration triggers
+  ├── dataset_health.py     → dataset quality diagnostic (size, difficulty, coverage, splits)
   ├── regression_tracker.py → tracks regressions, adds guard examples
-  ├── consolidate.py        → cross-iteration memory consolidation
-  ├── synthesize_strategy.py→ generates strategy document + investigation lenses
   ├── add_evaluator.py      → programmatically adds evaluators
   └── adversarial_inject.py → detects memorization, injects adversarial tests
 ```
@@ -221,6 +226,7 @@ LangSmith traces **any** AI framework. The evolver works with all of them:
 - [Darwin Godel Machine](https://sakana.ai/dgm/) — Sakana AI
 - [AlphaEvolve](https://deepmind.google/blog/alphaevolve/) — DeepMind
 - [LangSmith Evaluation](https://docs.smith.langchain.com/evaluation) — LangChain
+- [Harnessing Claude's Intelligence](https://claude.com/blog/harnessing-claudes-intelligence) — Martin, Anthropic, 2026
 - [Traces Start the Agent Improvement Loop](https://www.langchain.com/conceptual-guides/traces-start-agent-improvement-loop) — LangChain
 ---

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "harness-evolver",
-  "version": "4.3.0",
+  "version": "4.3.1",
   "description": "LangSmith-native autonomous agent optimization for Claude Code",
   "author": "Raphael Valdetaro",
   "license": "MIT",

package/skills/evolve/SKILL.md CHANGED Viewed

@@ -156,44 +156,36 @@ For each iteration:
 python3 -c "import json; c=json.load(open('.evolver.json')); print(f'v{c[\"iterations\"]+1:03d}')"
 ```
-### 1.5. Gather Trace Insights
+### 1.5. Gather Analysis Data (Parallel)
-Read the best experiment from config. If null (no baseline was run), skip trace insights for this iteration — proposers will work blind on the first pass:
+Read the best experiment from config. If null (no baseline was run), skip data gathering — proposers will work from code analysis only:
 ```bash
 BEST=$(python3 -c "import json; b=json.load(open('.evolver.json')).get('best_experiment'); print(b if b else '')")
+PROD=$(python3 -c "import json; c=json.load(open('.evolver.json')); print(c.get('production_project',''))")
 if [ -n "$BEST" ]; then
+    # Run all data gathering in parallel — these are independent API calls
     $EVOLVER_PY $TOOLS/trace_insights.py \
         --from-experiment "$BEST" \
-        --output trace_insights.json 2>/dev/null
-fi
-```
+        --output trace_insights.json 2>/dev/null &
-If a production project is configured, also gather production insights:
+    $EVOLVER_PY $TOOLS/read_results.py \
+        --experiment "$BEST" \
+        --config .evolver.json \
+        --split train \
+        --output best_results.json 2>/dev/null &
+fi
-```bash
-PROD=$(python3 -c "import json; c=json.load(open('.evolver.json')); print(c.get('production_project',''))")
 if [ -n "$PROD" ] && [ ! -f "production_seed.json" ]; then
     $EVOLVER_PY $TOOLS/seed_from_traces.py \
         --project "$PROD" \
         --output-md production_seed.md \
         --output-json production_seed.json \
-        --limit 100 2>/dev/null
+        --limit 100 2>/dev/null &
 fi
-```
-### 1.8. Analyze Per-Task Failures
-If `$BEST` is set (not the first iteration without baseline), read results and cluster failures:
-```bash
-if [ -n "$BEST" ]; then
-    $EVOLVER_PY $TOOLS/read_results.py \
-        --experiment "$BEST" \
-        --config .evolver.json \
-        --split train \
-        --output best_results.json 2>/dev/null
-fi
+wait  # Wait for all data gathering to complete
 ```
 If `best_results.json` exists, parse it to find failing examples (score < 0.7). Group by metadata or error pattern.
@@ -338,20 +330,23 @@ done
 Only run evaluation (Step 3) for proposers that committed changes (not abstained, not stuck).
-### 3. Run Target for Each Candidate
+### 3. Run Target for Each Candidate (Parallel)
-For each worktree that has changes (proposer committed something):
+Run evaluations for ALL candidates simultaneously — they're independent:
 ```bash
-# If PROJECT_DIR is set, resolve paths into the worktree subdirectory
-WORKTREE_PROJECT="{worktree_path}"
-[ -n "$PROJECT_DIR" ] && WORKTREE_PROJECT="{worktree_path}/{PROJECT_DIR}"
-$EVOLVER_PY $TOOLS/run_eval.py \
-    --config "$WORKTREE_PROJECT/.evolver.json" \
-    --worktree-path "$WORKTREE_PROJECT" \
-    --experiment-prefix v{NNN}-{lens_id} \
-    --timeout 120
+# Launch all evaluations in parallel
+for WORKTREE in {worktree_paths_with_commits}; do
+    WORKTREE_PROJECT="$WORKTREE"
+    [ -n "$PROJECT_DIR" ] && WORKTREE_PROJECT="$WORKTREE/$PROJECT_DIR"
+    $EVOLVER_PY $TOOLS/run_eval.py \
+        --config "$WORKTREE_PROJECT/.evolver.json" \
+        --worktree-path "$WORKTREE_PROJECT" \
+        --experiment-prefix v{NNN}-{lens_id} \
+        --timeout 120 &
+done
+wait  # Wait for all evaluations to complete
 ```
 Each candidate becomes a separate LangSmith experiment. This step runs the agent and applies code-based evaluators (has_output, token_efficiency) only.

package/tools/run_eval.py CHANGED Viewed

@@ -166,11 +166,14 @@ def main():
     parser.add_argument("--worktree-path", required=True, help="Path to the candidate's worktree")
     parser.add_argument("--experiment-prefix", required=True, help="Experiment name prefix (e.g. v001a)")
     parser.add_argument("--timeout", type=int, default=120, help="Per-task timeout in seconds")
+    parser.add_argument("--concurrency", type=int, default=None, help="Max concurrent evaluations (default: from config or 1)")
     args = parser.parse_args()
     with open(args.config) as f:
         config = json.load(f)
+    concurrency = args.concurrency or config.get("eval_concurrency", 1)
     os.environ["EVAL_TASK_TIMEOUT"] = str(args.timeout)
     ensure_langsmith_api_key()
@@ -188,6 +191,8 @@ def main():
     print(f"  Dataset: {config['dataset']}")
     print(f"  Worktree: {args.worktree_path}")
     print(f"  Code evaluators: {['has_output'] + code_evaluators}")
+    if concurrency > 1:
+        print(f"  Concurrency: {concurrency} parallel evaluations")
     if llm_evaluators:
         print(f"  Pending LLM evaluators (agent): {llm_evaluators}")
@@ -197,7 +202,7 @@ def main():
             data=config["dataset"],
             evaluators=evaluators,
             experiment_prefix=args.experiment_prefix,
-            max_concurrency=1,
+            max_concurrency=concurrency,
         )
         experiment_name = results.experiment_name