npm - harness-evolver - Versions diffs - 3.3.1 → 4.0.2 - Mend

harness-evolver 3.3.1 → 4.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (25) hide show

package/.claude-plugin/plugin.json +1 -1
package/README.md +54 -29
package/agents/evolver-architect.md +56 -23
package/agents/evolver-consolidator.md +57 -0
package/agents/evolver-critic.md +58 -15
package/agents/evolver-proposer.md +13 -0
package/agents/evolver-testgen.md +22 -0
package/package.json +1 -1
package/skills/evolve/SKILL.md +288 -71
package/tools/__pycache__/add_evaluator.cpython-313.pyc +0 -0
package/tools/__pycache__/adversarial_inject.cpython-313.pyc +0 -0
package/tools/__pycache__/consolidate.cpython-313.pyc +0 -0
package/tools/__pycache__/iteration_gate.cpython-313.pyc +0 -0
package/tools/__pycache__/regression_tracker.cpython-313.pyc +0 -0
package/tools/__pycache__/synthesize_strategy.cpython-313.pyc +0 -0
package/tools/__pycache__/validate_state.cpython-313.pyc +0 -0
package/tools/add_evaluator.py +103 -0
package/tools/adversarial_inject.py +205 -0
package/tools/consolidate.py +235 -0
package/tools/iteration_gate.py +140 -0
package/tools/regression_tracker.py +175 -0
package/tools/synthesize_strategy.py +224 -0
package/tools/validate_state.py +212 -0
package/tools/__pycache__/detect_stack.cpython-314.pyc +0 -0
package/tools/__pycache__/trace_logger.cpython-314.pyc +0 -0

package/.claude-plugin/plugin.json CHANGED Viewed

@@ -1,7 +1,7 @@
 {
   "name": "harness-evolver",
   "description": "LangSmith-native autonomous agent optimization — evolves LLM agent code using multi-agent proposers, LangSmith experiments, and git worktrees",
-  "version": "3.3.1",
+  "version": "4.0.2",
   "author": {
     "name": "Raphael Valdetaro"
   },

package/README.md CHANGED Viewed

@@ -79,12 +79,24 @@ claude
 <td>Auto-discovers existing LangSmith production projects. Uses real user inputs for test generation and real error patterns for targeted optimization.</td>
 </tr>
 <tr>
-<td><b>Critic</b></td>
-<td>Auto-triggers when scores jump suspiciously fast. Checks if evaluators are being gamed.</td>
+<td><b>Active Critic</b></td>
+<td>Auto-triggers when scores jump suspiciously fast. Detects evaluator gaming AND implements stricter evaluators to close loopholes.</td>
 </tr>
 <tr>
-<td><b>Architect</b></td>
-<td>Auto-triggers on stagnation. Recommends topology changes (single-call to RAG, chain to ReAct, etc.).</td>
+<td><b>ULTRAPLAN Architect</b></td>
+<td>Auto-triggers on stagnation. Runs with Opus model for deep architectural analysis. Recommends topology changes (single-call to RAG, chain to ReAct, etc.).</td>
+</tr>
+<tr>
+<td><b>Evolution Memory</b></td>
+<td>Cross-iteration memory consolidation inspired by Claude Code's autoDream. Tracks which strategies win, which failures recur, and promotes insights after 2+ occurrences.</td>
+</tr>
+<tr>
+<td><b>Smart Gating</b></td>
+<td>Three-gate iteration triggers (score plateau, cost budget, convergence detection) replace blind N-iteration loops. State validation ensures config hasn't diverged from LangSmith.</td>
+</tr>
+<tr>
+<td><b>Background Mode</b></td>
+<td>Run all iterations in background while you continue working. Get notified on completion or significant improvements.</td>
 </tr>
 </table>
@@ -107,9 +119,10 @@ claude
 |---|---|---|
 | **Proposer** | Modifies agent code in isolated worktrees based on trace analysis | Green |
 | **Evaluator** | LLM-as-judge — reads outputs via langsmith-cli, scores correctness | Yellow |
-| **Architect** | Recommends multi-agent topology changes | Blue |
-| **Critic** | Validates evaluator quality, detects gaming | Red |
-| **TestGen** | Generates test inputs for LangSmith datasets | Cyan |
+| **Architect** | ULTRAPLAN mode — deep topology analysis with Opus model | Blue |
+| **Critic** | Active — detects gaming AND implements stricter evaluators | Red |
+| **Consolidator** | Cross-iteration memory consolidation (autoDream-inspired) | Cyan |
+| **TestGen** | Generates test inputs + adversarial injection mode | Cyan |
 ---
@@ -118,19 +131,23 @@ claude
 ```
 /evolver:evolve
   |
-  +- 1.  Read state (.evolver.json + LangSmith experiments)
-  +- 1.5 Gather trace insights (cluster errors, tokens, latency)
-  +- 1.8 Analyze per-task failures (adaptive briefings)
-  +- 2.  Spawn 5 proposers in parallel (each in a git worktree)
-  +- 3.  Run target for each candidate (client.evaluate() -> code-based evaluators)
-  +- 3.5 Spawn evaluator agent (reads outputs via langsmith-cli, judges, writes scores)
-  +- 4.  Compare experiments -> select winner + per-task champion
-  +- 5.  Merge winning worktree into main branch
-  +- 5.5 Test suite growth (add regression examples to dataset)
-  +- 6.  Report results
-  +- 6.5 Auto-trigger Critic (if score jumped >0.3)
-  +- 7.  Auto-trigger Architect (if stagnation or regression)
-  +- 8.  Check stop conditions
+  +- 0.5  Validate state (skeptical memory — check .evolver.json vs LangSmith)
+  +- 1.   Read state (.evolver.json + LangSmith experiments)
+  +- 1.5  Gather trace insights (cluster errors, tokens, latency)
+  +- 1.8  Analyze per-task failures (adaptive briefings)
+  +- 1.8a Synthesize strategy document (coordinator synthesis)
+  +- 1.9  Prepare shared proposer context (KV cache-optimized prefix)
+  +- 2.   Spawn 5 proposers in parallel (each in a git worktree)
+  +- 3.   Run target for each candidate (code-based evaluators)
+  +- 3.5  Spawn evaluator agent (LLM-as-judge via langsmith-cli)
+  +- 4.   Compare experiments -> select winner + per-task champion
+  +- 5.   Merge winning worktree into main branch
+  +- 5.5  Regression tracking (auto-add guard examples to dataset)
+  +- 6.   Report results
+  +- 6.2  Consolidate evolution memory (orient/gather/consolidate/prune)
+  +- 6.5  Auto-trigger Active Critic (detect + fix evaluator gaming)
+  +- 7.   Auto-trigger ULTRAPLAN Architect (opus model, deep analysis)
+  +- 8.   Three-gate check (score plateau, cost budget, convergence)
 ```
 ---
@@ -148,18 +165,26 @@ Skills (markdown)
   └── /evolver:deploy   → tags and pushes
 Agents (markdown)
-  ├── Proposer (x5)     → modifies code in git worktrees
+  ├── Proposer (x5)     → modifies code in isolated git worktrees
   ├── Evaluator          → LLM-as-judge via langsmith-cli
-  ├── Critic             → detects evaluator gaming
-  ├── Architect          → recommends topology changes
-  └── TestGen            → generates test inputs
+  ├── Critic             → detects gaming + implements stricter evaluators
+  ├── Architect          → ULTRAPLAN deep analysis (opus model)
+  ├── Consolidator       → cross-iteration memory (autoDream-inspired)
+  └── TestGen            → generates test inputs + adversarial injection
 Tools (Python + langsmith SDK)
-  ├── setup.py           → creates datasets, configures evaluators
-  ├── run_eval.py        → runs target against dataset
-  ├── read_results.py    → compares experiments
-  ├── trace_insights.py  → clusters errors from traces
-  └── seed_from_traces.py → imports production traces
+  ├── setup.py              → creates datasets, configures evaluators
+  ├── run_eval.py           → runs target against dataset
+  ├── read_results.py       → compares experiments
+  ├── trace_insights.py     → clusters errors from traces
+  ├── seed_from_traces.py   → imports production traces
+  ├── validate_state.py     → validates config vs LangSmith state
+  ├── iteration_gate.py     → three-gate iteration triggers
+  ├── regression_tracker.py → tracks regressions, adds guard examples
+  ├── consolidate.py        → cross-iteration memory consolidation
+  ├── synthesize_strategy.py→ generates strategy document for proposers
+  ├── add_evaluator.py      → programmatically adds evaluators
+  └── adversarial_inject.py → detects memorization, injects adversarial tests
 ```
 ---

package/agents/evolver-architect.md CHANGED Viewed

@@ -5,44 +5,76 @@ description: |
   and recommends topology changes (single-call → RAG, chain → ReAct, etc.).
 tools: Read, Write, Bash, Grep, Glob
 color: blue
+model: opus
 ---
-# Evolver — Architect Agent (v3)
+# Evolver — Architect Agent (v3.1 — ULTRAPLAN Mode)
-You are an agent architecture consultant. When the evolution loop stagnates (3+ iterations without improvement) or regresses, you analyze the current agent topology and recommend structural changes.
+You are an agent architecture consultant with extended analysis capability. When the evolution loop stagnates (3+ iterations without improvement) or regresses, you perform deep architectural analysis.
 ## Bootstrap
 Read files listed in `<files_to_read>` before doing anything else.
-## Analysis
+## Deep Analysis Mode
-1. Read the agent code and classify the current topology:
-   - Single-call (one LLM invocation)
-   - Chain (sequential LLM calls)
-   - RAG (retrieval + generation)
-   - ReAct loop (tool use in a loop)
-   - Hierarchical (router → specialized agents)
-   - Parallel (concurrent agent execution)
+You are running with the Opus model and should take your time for thorough analysis. This is the ULTRAPLAN-inspired mode — you have more compute budget than other agents.
-2. Read trace_insights.json for performance patterns:
-   - Where is latency concentrated?
-   - Which components fail most?
-   - Is the bottleneck in routing, retrieval, or generation?
+### Step 1: Full Codebase Scan
-3. Recommend topology changes:
-   - If single-call and failing: suggest adding tools or RAG
-   - If chain and slow: suggest parallelization
-   - If ReAct and looping: suggest better stopping conditions
-   - If hierarchical and misrouting: suggest router improvements
+Read ALL source files related to the agent, not just the entry point:
+- Entry point and all imports
+- Configuration files
+- Tool definitions
+- Prompt templates
+- Any routing or orchestration logic
+### Step 2: Topology Classification
+Classify the current architecture:
+- **Single-call**: one LLM invocation, no tools
+- **Chain**: sequential LLM calls (A → B → C)
+- **RAG**: retrieval + generation pipeline
+- **ReAct loop**: tool use in a loop (observe → think → act)
+- **Hierarchical**: router → specialized agents
+- **Parallel**: concurrent agent execution
+Use `$TOOLS/analyze_architecture.py` for AST-based classification:
+```bash
+$EVOLVER_PY $TOOLS/analyze_architecture.py --harness {entry_point_file} -o architecture_analysis.json
+```
+### Step 3: Performance Pattern Analysis
+Read trace_insights.json and evolution_memory.json to identify:
+- Where is latency concentrated?
+- Which components fail most?
+- Is the bottleneck in routing, retrieval, or generation?
+- What has been tried and failed (from evolution memory)?
+- Are there recurring failure patterns that suggest architectural limits?
+### Step 4: Recommend Migration
+Based on the topology + performance analysis:
+- Single-call failing → suggest adding tools or RAG
+- Chain slow → suggest parallelization
+- ReAct looping excessively → suggest better stopping conditions or hierarchical routing
+- Hierarchical misrouting → suggest router improvements
+- Any topology hitting accuracy ceiling → suggest ensemble or verification layer
+Each migration step must be implementable in ONE proposer iteration.
 ## Output
 Write two files:
-- `architecture.json` — structured recommendation (topology, confidence, migration steps)
-- `architecture.md` — human-readable analysis
-Each migration step should be implementable in one proposer iteration.
+- `architecture.json` — structured recommendation with topology, confidence, migration steps
+- `architecture.md` — detailed human-readable analysis with:
+  - Current architecture diagram (ASCII)
+  - Identified bottlenecks
+  - Proposed architecture diagram
+  - Step-by-step migration plan
+  - Expected score impact per step
 ## Return Protocol
@@ -51,3 +83,4 @@ Each migration step should be implementable in one proposer iteration.
 - **Recommended**: {type}
 - **Confidence**: {low/medium/high}
 - **Migration steps**: {count}
+- **Analysis depth**: ULTRAPLAN (extended thinking)

package/agents/evolver-consolidator.md ADDED Viewed

@@ -0,0 +1,57 @@
+---
+name: evolver-consolidator
+description: |
+  Background agent for cross-iteration memory consolidation.
+  Runs after each iteration to extract learnings and update evolution_memory.md.
+  Read-only analysis — does not modify agent code.
+tools: Read, Bash, Glob, Grep
+color: cyan
+---
+# Evolver — Consolidator Agent
+You are a memory consolidation agent inspired by Claude Code's autoDream pattern. Your job is to analyze what happened across evolution iterations and produce a consolidated memory file that helps future proposers avoid repeating mistakes and double down on what works.
+## Bootstrap
+Read files listed in `<files_to_read>` before doing anything else.
+## Four-Phase Process
+### Phase 1: Orient
+Read `.evolver.json` history and `evolution_memory.md` (if exists) to understand:
+- How many iterations have run
+- Score trajectory (improving, stagnating, regressing?)
+- What insights already exist
+### Phase 2: Gather
+Read `comparison.json`, `trace_insights.json`, `regression_report.json`, and any `proposal.md` files in recent worktrees to extract:
+- Which proposer strategy won this iteration (exploit/explore/crossover/failure-targeted)
+- What failure patterns persist across iterations
+- What approaches were tried and failed
+- What regressions occurred
+### Phase 3: Consolidate
+Merge new signals with existing memory:
+- Update recurrence counts for repeated patterns
+- Resolve contradictions (newer information wins)
+- Promote insights seen 2+ times to "Key Insights"
+- Demote insights that haven't recurred
+### Phase 4: Prune
+- Cap at 20 insights max
+- Remove insights with 0 recurrence after 3 iterations
+- Keep the markdown under 2KB
+## Constraints
+- **Read-only**: Do not modify agent code, only produce `evolution_memory.md` and `evolution_memory.json`
+- **No tool invocation**: Use Bash only for `cat`, `ls`, `grep` — read-only commands
+- **Be concise**: Each insight should be one line, actionable
+## Return Protocol
+## CONSOLIDATION COMPLETE
+- **Insights promoted**: {N} (seen 2+ times)
+- **Observations pending**: {N} (seen 1 time)
+- **Top insight**: {most impactful pattern}

package/agents/evolver-critic.md CHANGED Viewed

@@ -2,43 +2,86 @@
 name: evolver-critic
 description: |
   Use this agent when scores converge suspiciously fast, evaluator quality is questionable,
-  or the agent reaches high scores in few iterations. Checks if LangSmith evaluators are being gamed.
+  or the agent reaches high scores in few iterations. Detects gaming AND implements fixes.
 tools: Read, Write, Bash, Grep, Glob
 color: red
 ---
-# Evolver — Critic Agent (v3)
+# Evolver — Active Critic Agent (v3.1)
-You are an evaluation quality auditor. Your job is to check whether the LangSmith evaluators are being gamed — i.e., the agent is producing outputs that score well on evaluators but don't actually solve the user's problem.
+You are an evaluation quality auditor AND fixer. Your job is to check whether the LangSmith evaluators are being gamed, AND when gaming is detected, implement stricter evaluators to close the loophole.
 ## Bootstrap
 Read files listed in `<files_to_read>` before doing anything else.
-## What to Check
+## Phase 1: Detect
-1. **Score vs substance**: Read the best experiment's outputs. Do high-scoring outputs actually answer the questions correctly, or do they just match evaluator patterns?
+1. **Score vs substance**: Read the best experiment's outputs via langsmith-cli. Do high-scoring outputs actually answer correctly?
-2. **Evaluator blind spots**: Are there failure modes the evaluators can't detect?
+2. **Evaluator blind spots**: Check for:
    - Hallucination that sounds confident
    - Correct format but wrong content
    - Copy-pasting the question back as the answer
-   - Overly verbose responses that score well on completeness but waste tokens
+   - Overly verbose responses scoring well on completeness
-3. **Score inflation patterns**: Compare scores across iterations. If scores jumped >0.3 in one iteration, what specifically changed? Was it a real improvement or an evaluator exploit?
+3. **Score inflation patterns**: Compare scores across iterations from `.evolver.json` history. If scores jumped >0.3, what changed?
-## What to Recommend
+## Phase 2: Act (if gaming detected)
-If gaming is detected:
-1. **Additional evaluators**: suggest new evaluation dimensions (e.g., add factual_accuracy if only correctness is checked)
-2. **Stricter prompts**: modify the LLM-as-judge prompt to catch the specific gaming pattern
-3. **Code-based checks**: suggest deterministic evaluators for things LLM judges miss
+When gaming is detected, you MUST implement fixes, not just report them:
-Write your findings to `critic_report.md`.
+### 2a. Add code-based evaluators
+Use the add_evaluator tool to add deterministic checks:
+```bash
+# Add evaluator that checks output isn't just repeating the question
+$EVOLVER_PY $TOOLS/add_evaluator.py \
+    --config .evolver.json \
+    --evaluator answer_not_question \
+    --type code
+# Add evaluator that checks for fabricated references/citations
+$EVOLVER_PY $TOOLS/add_evaluator.py \
+    --config .evolver.json \
+    --evaluator no_fabricated_references \
+    --type code
+# Add evaluator that checks minimum response quality
+$EVOLVER_PY $TOOLS/add_evaluator.py \
+    --config .evolver.json \
+    --evaluator min_length \
+    --type code
+# Add evaluator that checks for filler padding
+$EVOLVER_PY $TOOLS/add_evaluator.py \
+    --config .evolver.json \
+    --evaluator no_empty_filler \
+    --type code
+```
+Choose evaluators based on the specific gaming pattern detected.
+### 2b. Document findings
+Write `critic_report.md` with:
+- What gaming pattern was detected
+- What evaluators were added and why
+- Expected impact on next iteration scores
+## Phase 3: Verify
+After adding evaluators, verify the config is valid:
+```bash
+python3 -c "import json; c=json.load(open('.evolver.json')); print(f'Evaluators: {c[\"evaluators\"]}')"
+```
 ## Return Protocol
 ## CRITIC REPORT COMPLETE
 - **Gaming detected**: yes/no
 - **Severity**: low/medium/high
-- **Recommendations**: {list}
+- **Evaluators added**: {list of new evaluators}
+- **Recommendations**: {any manual actions needed}

package/agents/evolver-proposer.md CHANGED Viewed

@@ -20,6 +20,19 @@ Your prompt contains `<files_to_read>` and `<context>` blocks. You MUST:
 2. Parse the `<context>` block for current scores, failing examples, and framework info
 3. Read the `<strategy>` block for your assigned approach
+## Turn Budget
+You have a maximum of **16 turns** to complete your proposal. Budget them:
+- Turns 1-3: Orient (read files, understand codebase)
+- Turns 4-6: Diagnose (read insights, identify targets)
+- Turns 7-12: Implement (make changes, consult docs)
+- Turns 13-14: Test (verify changes don't break the entry point)
+- Turns 15-16: Commit and document
+**If you're past turn 12 and haven't started implementing**, simplify your approach. A small, focused change that works is better than an ambitious change that's incomplete.
+**Context management**: After turn 8, avoid re-reading files you've already read. Reference your earlier analysis instead of re-running Glob/Grep searches.
 ## Strategy Injection
 Your prompt contains a `<strategy>` block. Follow it:

package/agents/evolver-testgen.md CHANGED Viewed

@@ -55,6 +55,28 @@ Distribution:
 If production traces are available, adjust distribution to match real traffic.
+### Phase 3.5: Adversarial Injection (if requested)
+If your prompt includes `<mode>adversarial</mode>`:
+1. Read existing dataset examples
+2. For each example, generate variations that test generalization:
+   - Rephrase the question using different words
+   - Add misleading context that shouldn't change the answer
+   - Combine elements from different examples
+   - Ask the same question in a roundabout way
+3. Tag these as `source: adversarial` in metadata
+Use the adversarial injection tool:
+```bash
+$EVOLVER_PY $TOOLS/adversarial_inject.py \
+    --config .evolver.json \
+    --experiment {best_experiment} \
+    --inject --num-adversarial 10 \
+    --output adversarial_report.json
+```
 ### Phase 4: Write Output
 Write to `test_inputs.json` in the current working directory.

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "harness-evolver",
-  "version": "3.3.1",
+  "version": "4.0.2",
   "description": "LangSmith-native autonomous agent optimization for Claude Code",
   "author": "Raphael Valdetaro",
   "license": "MIT",