npm - harness-evolver - Versions diffs - 4.0.3 → 4.2.0 - Mend

harness-evolver 4.0.3 → 4.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

package/.claude-plugin/plugin.json +1 -1
package/README.md +11 -10
package/agents/evolver-proposer.md +45 -47
package/package.json +1 -1
package/skills/evolve/SKILL.md +173 -65
package/tools/__pycache__/adversarial_inject.cpython-313.pyc +0 -0
package/tools/__pycache__/regression_tracker.cpython-313.pyc +0 -0
package/tools/__pycache__/setup.cpython-313.pyc +0 -0
package/tools/adversarial_inject.py +8 -3
package/tools/consolidate.py +7 -15
package/tools/dataset_health.py +385 -0
package/tools/read_results.py +21 -2
package/tools/regression_tracker.py +17 -4
package/tools/setup.py +23 -0
package/tools/synthesize_strategy.py +138 -2
package/tools/trace_insights.py +7 -1

package/.claude-plugin/plugin.json CHANGED Viewed

@@ -1,7 +1,7 @@
 {
   "name": "harness-evolver",
   "description": "LangSmith-native autonomous agent optimization — evolves LLM agent code using multi-agent proposers, LangSmith experiments, and git worktrees",
-  "version": "4.0.3",
+  "version": "4.2.0",
   "author": {
     "name": "Raphael Valdetaro"
   },

package/README.md CHANGED Viewed

@@ -67,8 +67,8 @@ claude
 <td>Proposers modify your actual agent code — not a wrapper. Each candidate works in an isolated git worktree. Winners are merged automatically.</td>
 </tr>
 <tr>
-<td><b>5 Adaptive Proposers</b></td>
-<td>Each iteration spawns 5 parallel agents: exploit, explore, crossover, and 2 failure-targeted. Strategies adapt based on per-task analysis. Quality-diversity selection preserves per-task champions.</td>
+<td><b>Self-Organizing Proposers</b></td>
+<td>Each iteration generates dynamic investigation lenses from failure data, architecture analysis, production traces, and evolution memory. Proposers self-organize their approach — no fixed strategies. They can self-abstain when their contribution would be redundant. Inspired by <a href="https://arxiv.org/abs/2603.28990">Dochkina (2026)</a>.</td>
 </tr>
 <tr>
 <td><b>Agent-Based Evaluation</b></td>
@@ -88,7 +88,7 @@ claude
 </tr>
 <tr>
 <td><b>Evolution Memory</b></td>
-<td>Cross-iteration memory consolidation inspired by Claude Code's autoDream. Tracks which strategies win, which failures recur, and promotes insights after 2+ occurrences.</td>
+<td>Cross-iteration memory consolidation inspired by Claude Code's autoDream. Tracks which approaches win, which failures recur, and promotes insights after 2+ occurrences.</td>
 </tr>
 <tr>
 <td><b>Smart Gating</b></td>
@@ -107,7 +107,7 @@ claude
 | Command | What it does |
 |---|---|
 | `/evolver:setup` | Explore project, configure LangSmith (dataset, evaluators), run baseline |
-| `/evolver:evolve` | Run the optimization loop (5 parallel proposers in worktrees) |
+| `/evolver:evolve` | Run the optimization loop (dynamic self-organizing proposers in worktrees) |
 | `/evolver:status` | Show progress, scores, history |
 | `/evolver:deploy` | Tag, push, clean up temporary files |
@@ -117,7 +117,7 @@ claude
 | Agent | Role | Color |
 |---|---|---|
-| **Proposer** | Modifies agent code in isolated worktrees based on trace analysis | Green |
+| **Proposer** | Self-organizing — investigates a data-driven lens, decides own approach, may abstain | Green |
 | **Evaluator** | LLM-as-judge — reads outputs via langsmith-cli, scores correctness | Yellow |
 | **Architect** | ULTRAPLAN mode — deep topology analysis with Opus model | Blue |
 | **Critic** | Active — detects gaming AND implements stricter evaluators | Red |
@@ -134,10 +134,10 @@ claude
   +- 0.5  Validate state (skeptical memory — check .evolver.json vs LangSmith)
   +- 1.   Read state (.evolver.json + LangSmith experiments)
   +- 1.5  Gather trace insights (cluster errors, tokens, latency)
-  +- 1.8  Analyze per-task failures (adaptive briefings)
-  +- 1.8a Synthesize strategy document (coordinator synthesis)
+  +- 1.8  Analyze per-task failures
+  +- 1.8a Synthesize strategy document + dynamic lenses (investigation questions)
   +- 1.9  Prepare shared proposer context (KV cache-optimized prefix)
-  +- 2.   Spawn 5 proposers in parallel (each in a git worktree)
+  +- 2.   Spawn N self-organizing proposers in parallel (each in a git worktree)
   +- 3.   Run target for each candidate (code-based evaluators)
   +- 3.5  Spawn evaluator agent (LLM-as-judge via langsmith-cli)
   +- 4.   Compare experiments -> select winner + per-task champion
@@ -165,7 +165,7 @@ Skills (markdown)
   └── /evolver:deploy   → tags and pushes
 Agents (markdown)
-  ├── Proposer (x5)     → modifies code in isolated git worktrees
+  ├── Proposer (xN)     → self-organizing, lens-driven, isolated git worktrees
   ├── Evaluator          → LLM-as-judge via langsmith-cli
   ├── Critic             → detects gaming + implements stricter evaluators
   ├── Architect          → ULTRAPLAN deep analysis (opus model)
@@ -182,7 +182,7 @@ Tools (Python + langsmith SDK)
   ├── iteration_gate.py     → three-gate iteration triggers
   ├── regression_tracker.py → tracks regressions, adds guard examples
   ├── consolidate.py        → cross-iteration memory consolidation
-  ├── synthesize_strategy.py→ generates strategy document for proposers
+  ├── synthesize_strategy.py→ generates strategy document + investigation lenses
   ├── add_evaluator.py      → programmatically adds evaluators
   └── adversarial_inject.py → detects memorization, injects adversarial tests
 ```
@@ -217,6 +217,7 @@ LangSmith traces **any** AI framework. The evolver works with all of them:
 ## References
 - [Meta-Harness: End-to-End Optimization of Model Harnesses](https://arxiv.org/abs/2603.28052) — Lee et al., 2026
+- [Drop the Hierarchy and Roles: How Self-Organizing LLM Agents Outperform Designed Structures](https://arxiv.org/abs/2603.28990) — Dochkina, 2026
 - [Darwin Godel Machine](https://sakana.ai/dgm/) — Sakana AI
 - [AlphaEvolve](https://deepmind.google/blog/alphaevolve/) — DeepMind
 - [LangSmith Evaluation](https://docs.smith.langchain.com/evaluation) — LangChain

package/agents/evolver-proposer.md CHANGED Viewed

@@ -1,75 +1,56 @@
 ---
 name: evolver-proposer
 description: |
-  Use this agent to propose improvements to an LLM agent's code.
-  Works in an isolated git worktree — modifies real code, not a harness wrapper.
-  Spawned by the evolve skill with a strategy (exploit/explore/crossover/failure-targeted).
+  Self-organizing agent optimizer. Investigates a data-driven lens (question),
+  decides its own approach, and modifies real code in an isolated git worktree.
+  May self-abstain if it cannot add meaningful value.
 tools: Read, Write, Edit, Bash, Glob, Grep
 color: green
 permissionMode: acceptEdits
 ---
-# Evolver — Proposer Agent (v3)
+# Evolver — Self-Organizing Proposer (v4)
-You are an LLM agent optimizer. Your job is to modify the user's actual agent code to improve its performance on the evaluation dataset. You work in an **isolated git worktree** — you can modify any file freely without affecting the main branch.
+You are an LLM agent optimizer. Your job is to improve the user's agent code to score higher on the evaluation dataset. You work in an **isolated git worktree** — you can modify any file freely without affecting the main branch.
 ## Bootstrap
-Your prompt contains `<files_to_read>` and `<context>` blocks. You MUST:
+Your prompt contains `<files_to_read>`, `<context>`, and `<lens>` blocks. You MUST:
 1. Read every file listed in `<files_to_read>` using the Read tool
 2. Parse the `<context>` block for current scores, failing examples, and framework info
-3. Read the `<strategy>` block for your assigned approach
+3. Read the `<lens>` block — this is your investigation starting point
 ## Turn Budget
-You have a maximum of **16 turns** to complete your proposal. Budget them:
-- Turns 1-3: Orient (read files, understand codebase)
-- Turns 4-6: Diagnose (read insights, identify targets)
-- Turns 7-12: Implement (make changes, consult docs)
-- Turns 13-14: Test (verify changes don't break the entry point)
-- Turns 15-16: Commit and document
+You have a maximum of **16 turns**. You decide how to allocate them. General guidance:
+- Spend early turns reading context and investigating your lens question
+- Spend middle turns implementing changes and consulting documentation
+- Reserve final turns for committing and writing proposal.md
 **If you're past turn 12 and haven't started implementing**, simplify your approach. A small, focused change that works is better than an ambitious change that's incomplete.
 **Context management**: After turn 8, avoid re-reading files you've already read. Reference your earlier analysis instead of re-running Glob/Grep searches.
-## Strategy Injection
+## Lens Protocol
-Your prompt contains a `<strategy>` block. Follow it:
-- **exploitation**: Conservative fix on current best. Focus on specific failing examples.
-- **exploration**: Bold, fundamentally different approach. Change algorithms, prompts, routing.
-- **crossover**: Combine strengths from previous iterations. Check git log for recent changes.
-- **failure-targeted**: Fix SPECIFIC failing examples listed in the strategy. Analyze WHY they fail.
-- **creative**: Try something unexpected — different libraries, architecture, algorithms.
-- **efficiency**: Same quality but fewer tokens, faster latency, simpler code.
+Your prompt contains a `<lens>` block with an **investigation question**. This is your starting point, not your mandate.
-If no strategy block is present, default to exploitation.
+1. **Investigate** — dig into the data relevant to the lens question (trace insights, failing examples, code)
+2. **Hypothesize** — form your own theory about what to change
+3. **Decide** — choose your approach freely. You may end up solving something completely different from what the lens asks. That's fine.
+4. **Implement or Abstain** — if you can add meaningful value, implement and commit. If not, abstain.
-## Your Workflow
-### Phase 1: Orient
-Read .evolver.json to understand:
-- What framework is this? (LangGraph, CrewAI, OpenAI SDK, etc.)
-- What's the entry point?
-- What evaluators are active? (correctness, conciseness, latency, etc.)
-- What's the current best score?
+You are NOT constrained to the lens topic. The lens gives you a starting perspective. Your actual approach is yours to decide.
-### Phase 2: Diagnose
+## Your Workflow
-Read trace_insights.json and best_results.json to understand:
-- Which examples are failing and why?
-- What error patterns exist?
-- Are there token/latency issues?
+There are no fixed phases. Use your judgment to allocate turns. A typical flow:
-If production_seed.json exists, read it to understand real-world usage:
-- What do real user inputs look like?
-- What are the common error patterns in production?
-- Which query types get the most traffic?
+**Orient** — Read .evolver.json, strategy.md, evolution_memory.md. Understand the framework, entry point, evaluators, current score, and what has been tried before.
-### Phase 3: Propose Changes
+**Investigate** — Read trace_insights.json and best_results.json. Understand which examples fail and why. If production_seed.json exists, understand real-world usage patterns. Focus on data relevant to your lens question.
-Based on your strategy and diagnosis, modify the code:
+**Decide** — Based on investigation, decide what to change. Consider:
 - **Prompts**: system prompts, few-shot examples, output format instructions
 - **Routing**: how queries are dispatched to different handlers
 - **Tools**: tool definitions, tool selection logic
@@ -77,7 +58,23 @@ Based on your strategy and diagnosis, modify the code:
 - **Error handling**: retry logic, fallback strategies, timeout handling
 - **Model selection**: which model for which task
-### Phase 3.5: Consult Documentation (MANDATORY)
+## Self-Abstention
+If after investigating your lens you conclude you cannot add meaningful value, you may **abstain**. This is a valued contribution — it saves evaluation tokens and signals confidence that the current code handles the lens topic adequately.
+To abstain, skip implementation and write only a `proposal.md`:
+```
+## ABSTAIN
+- **Lens**: {the question you investigated}
+- **Finding**: {what you discovered during investigation}
+- **Reason**: {why you're abstaining}
+- **Suggested focus**: {optional — what future iterations should look at}
+```
+Then end with the return protocol using `ABSTAIN` as your approach.
+### Consult Documentation (MANDATORY)
 **Before writing ANY code**, you MUST consult Context7 for every library you'll be modifying or using. This is NOT optional.
@@ -106,7 +103,7 @@ Ask about the SPECIFIC API you're going to use or change.
 **If Context7 MCP is not available:** Note in proposal.md "API patterns not verified against current docs — verify before deploying."
-### Phase 4: Commit and Document
+### Commit and Document
 1. **Commit all changes** with a descriptive message:
    ```bash
@@ -143,7 +140,7 @@ Prioritize changes that fix real production failures over synthetic test failure
 ## Rules
 1. **Read before writing** — understand the code before changing it
-2. **Minimal changes** — change only what's needed for your strategy
+2. **Focused changes** — change what's needed based on your investigation. Don't scatter changes across unrelated files.
 3. **Don't break the interface** — the agent must still be runnable with the same command
 4. **Commit your changes** — uncommitted changes are lost when the worktree is cleaned up
 5. **Write proposal.md** — the evolve skill reads this to understand what you did
@@ -153,8 +150,9 @@ Prioritize changes that fix real production failures over synthetic test failure
 When done, end your response with:
 ## PROPOSAL COMPLETE
-- **Version**: v{NNN}{suffix}
-- **Strategy**: {strategy}
+- **Version**: v{NNN}-{id}
+- **Lens**: {the investigation question}
+- **Approach**: {what you chose to do and why — free text, your own words}
 - **Changes**: {brief list of files changed}
 - **Expected impact**: {which evaluators/examples should improve}
 - **Files modified**: {count}

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "harness-evolver",
-  "version": "4.0.3",
+  "version": "4.2.0",
   "description": "LangSmith-native autonomous agent optimization for Claude Code",
   "author": "Raphael Valdetaro",
   "license": "MIT",

package/skills/evolve/SKILL.md CHANGED Viewed

@@ -127,6 +127,122 @@ If critical issues found, ask user whether to continue or fix first via AskUserQ
 - "Fix and retry" — attempt auto-fix with `--fix` flag
 - "Abort" — stop the evolution loop
+### 0.6. Dataset Health Check
+Run the dataset health diagnostic:
+```bash
+$EVOLVER_PY $TOOLS/dataset_health.py \
+    --config .evolver.json \
+    --production-seed production_seed.json \
+    --output health_report.json 2>/dev/null
+```
+Read `health_report.json`. Print summary:
+```bash
+python3 -c "
+import json, os
+if os.path.exists('health_report.json'):
+    r = json.load(open('health_report.json'))
+    print(f'Dataset Health: {r[\"health_score\"]}/10 ({r[\"example_count\"]} examples)')
+    for issue in r.get('issues', []):
+        print(f'  [{issue[\"severity\"]}] {issue[\"message\"]}')
+"
+```
+### 0.7. Auto-Correct Dataset Issues
+If `health_report.json` has corrections, apply them automatically:
+```bash
+CORRECTIONS=$(python3 -c "
+import json, os
+if os.path.exists('health_report.json'):
+    r = json.load(open('health_report.json'))
+    for c in r.get('corrections', []):
+        print(c['action'])
+" 2>/dev/null)
+```
+For each correction:
+**If `create_splits`**: Run inline Python to assign 70/30 splits:
+```bash
+$EVOLVER_PY -c "
+from langsmith import Client
+import json, random
+client = Client()
+config = json.load(open('.evolver.json'))
+examples = list(client.list_examples(dataset_name=config['dataset']))
+random.shuffle(examples)
+sp = int(len(examples) * 0.7)
+for ex in examples[:sp]:
+    client.update_example(ex.id, split='train')
+for ex in examples[sp:]:
+    client.update_example(ex.id, split='held_out')
+print(f'Assigned splits: {sp} train, {len(examples)-sp} held_out')
+"
+```
+**If `generate_hard`**: Spawn testgen agent with hard-mode instruction:
+```
+Agent(
+  subagent_type: "evolver-testgen",
+  description: "Generate hard examples to rebalance dataset",
+  prompt: |
+    <objective>
+    The dataset is skewed toward easy examples. Generate {count} HARD examples
+    that the current agent is likely to fail on.
+    Focus on: edge cases, adversarial inputs, complex multi-step queries,
+    ambiguous questions, and inputs that require deep reasoning.
+    </objective>
+    <files_to_read>
+    - .evolver.json
+    - strategy.md (if exists)
+    - production_seed.json (if exists)
+    </files_to_read>
+)
+```
+**If `fill_coverage`**: Spawn testgen agent with coverage-fill instruction:
+```
+Agent(
+  subagent_type: "evolver-testgen",
+  description: "Generate examples for missing categories",
+  prompt: |
+    <objective>
+    The dataset is missing these production categories: {categories}.
+    Generate 5 examples per missing category.
+    Use production_seed.json for real-world patterns in these categories.
+    </objective>
+    <files_to_read>
+    - .evolver.json
+    - production_seed.json (if exists)
+    </files_to_read>
+)
+```
+**If `retire_dead`**: Move dead examples to retired split:
+```bash
+$EVOLVER_PY -c "
+from langsmith import Client
+import json
+client = Client()
+report = json.load(open('health_report.json'))
+dead_ids = report.get('dead_examples', {}).get('ids', [])
+config = json.load(open('.evolver.json'))
+examples = {str(e.id): e for e in client.list_examples(dataset_name=config['dataset'])}
+retired = 0
+for eid in dead_ids:
+    if eid in examples:
+        client.update_example(examples[eid].id, split='retired')
+        retired += 1
+print(f'Retired {retired} dead examples')
+"
+```
+After corrections, log what was done. Do NOT re-run health check (corrections may need an experiment cycle to show effect).
 For each iteration:
 ### 1. Get Next Version
@@ -170,12 +286,13 @@ if [ -n "$BEST" ]; then
     $EVOLVER_PY $TOOLS/read_results.py \
         --experiment "$BEST" \
         --config .evolver.json \
+        --split train \
         --output best_results.json 2>/dev/null
 fi
 ```
 If `best_results.json` exists, parse it to find failing examples (score < 0.7). Group by metadata or error pattern.
-Generate adaptive briefings for Candidates D and E (same logic as v2).
+This failure data feeds into `synthesize_strategy.py` which generates targeted lenses for proposers.
 If no best_results.json (first iteration without baseline), all proposers work from code analysis only — no failure data available.
 ### 1.8a. Synthesize Strategy
@@ -189,17 +306,18 @@ $EVOLVER_PY $TOOLS/synthesize_strategy.py \
     --best-results best_results.json \
     --evolution-memory evolution_memory.json \
     --production-seed production_seed.json \
-    --output strategy.md 2>/dev/null
+    --output strategy.md \
+    --lenses lenses.json 2>/dev/null
 ```
-The `strategy.md` file is included in the proposer `<files_to_read>` block via the shared context (Step 1.9). It synthesizes trace analysis, evolution memory, and production data into an actionable document. Proposers also receive `production_seed.json` directly for access to raw production traces.
+The `strategy.md` file is included in the proposer `<files_to_read>` block via the shared context (Step 1.9). The `lenses.json` file contains dynamically generated investigation questions — one per proposer. Each lens directs a proposer's attention to a different aspect of the problem (failure cluster, architecture, production data, evolution memory, or open investigation).
 ### 1.9. Prepare Shared Proposer Context
-Build the shared context that ALL proposers will receive as an identical prefix. This enables KV cache sharing — spawning 5 proposers costs barely more than 1.
+Build the shared context that ALL proposers will receive as an identical prefix. This enables KV cache sharing — spawning N proposers costs barely more than 1.
 ```bash
-# Build shared context block (identical for all 5 proposers)
+# Build shared context block (identical for all proposers)
 SHARED_FILES_BLOCK="<files_to_read>
 - .evolver.json
 - strategy.md (if exists)
@@ -223,13 +341,19 @@ You are working in an isolated git worktree — modify any file freely.
 </objective>"
 ```
-**CRITICAL for cache sharing**: The `<objective>`, `<files_to_read>`, and `<context>` blocks MUST be byte-identical across all 5 proposer prompts. Only the `<strategy>` block differs. Place the strategy block LAST in the prompt so the shared prefix is maximized.
+**CRITICAL for cache sharing**: The `<objective>`, `<files_to_read>`, and `<context>` blocks MUST be byte-identical across all proposer prompts. Only the `<lens>` block differs. Place the lens block LAST in the prompt so the shared prefix is maximized.
+### 2. Spawn Proposers in Parallel (Dynamic Lenses)
+Read `lenses.json` to get the list of investigation lenses:
-### 2. Spawn 5 Proposers in Parallel
+```bash
+LENS_COUNT=$(python3 -c "import json; print(json.load(open('lenses.json'))['lens_count'])")
+```
-Each proposer receives the IDENTICAL prefix (objective + files + context) followed by its unique strategy suffix.
+Each proposer receives the IDENTICAL prefix (objective + files + context) followed by its unique lens.
-**All 5 candidates** — `run_in_background: true, isolation: "worktree"`:
+**For each lens** — `run_in_background: true, isolation: "worktree"`:
 The prompt for EACH proposer follows this structure:
 ```
@@ -239,66 +363,54 @@ The prompt for EACH proposer follows this structure:
 {SHARED_CONTEXT_BLOCK}
-<strategy>
-{UNIQUE PER CANDIDATE — see below}
-</strategy>
+<lens>
+Investigation question: {lens.question}
-<output>
-1. Modify the code to improve performance
-2. Commit your changes with a descriptive message
-3. Write proposal.md explaining what you changed and why
-</output>
-```
+This is your STARTING POINT, not your mandate. Investigate, form your
+own hypothesis, and implement whatever you conclude will help most.
+You may solve something entirely different — that's fine.
+If you cannot add meaningful value, ABSTAIN.
-**Candidate A strategy block:**
-```
-APPROACH: exploitation
-Make targeted improvements to the current best version.
-Focus on the specific failures identified in the results.
-```
-**Candidate B strategy block:**
-```
-APPROACH: exploration
-Try a fundamentally different approach. Change algorithms, prompts, routing, architecture.
-Don't be afraid to make big changes — this worktree is disposable.
-```
+Source: {lens.source}
+</lens>
-**Candidate C strategy block:**
-```
-APPROACH: crossover
-Combine strengths from previous iterations. Check git log for what was tried.
-Recent changes: {git_log_last_5}
+<output>
+1. Investigate the lens question
+2. Decide your approach (or abstain)
+3. If proceeding: modify code, commit, write proposal.md
+4. proposal.md must include: what you chose to do, why, how it relates to the lens
+</output>
 ```
-**Candidate D strategy block:**
-```
-APPROACH: {failure_targeted_or_creative}
-{adaptive_briefing_d}
-```
+For each lens in `lenses.json`, spawn one proposer agent:
-**Candidate E strategy block:**
 ```
-APPROACH: {failure_targeted_or_efficiency}
-{adaptive_briefing_e}
+Agent(
+  subagent_type: "evolver-proposer",
+  description: "Proposer {lens.id}: {lens.source} lens",
+  isolation: "worktree",
+  run_in_background: true,
+  prompt: {SHARED_PREFIX + LENS_BLOCK above, with lens fields filled in}
+)
 ```
-Wait for all 5 to complete.
+Wait for all proposers to complete.
 **Stuck proposer detection**: If any proposer hasn't completed after 10 minutes, it may be stuck in a loop. The Claude Code runtime handles this via the agent's turn limit. If a proposer returns without committing changes, skip it — don't retry.
-After all proposers complete, check which ones actually committed:
+After all proposers complete, check which ones committed and which abstained:
 ```bash
 for WORKTREE in {worktree_paths}; do
-    CHANGES=$(cd "$WORKTREE" && git log --oneline -1 --since="10 minutes ago" 2>/dev/null | wc -l)
-    if [ "$CHANGES" -eq 0 ]; then
+    if [ -f "$WORKTREE/proposal.md" ] && grep -q "## ABSTAIN" "$WORKTREE/proposal.md" 2>/dev/null; then
+        echo "Proposer in $WORKTREE abstained — skipping evaluation"
+    elif [ $(cd "$WORKTREE" && git log --oneline -1 --since="10 minutes ago" 2>/dev/null | wc -l) -eq 0 ]; then
         echo "Proposer in $WORKTREE made no commits — skipping"
     fi
 done
 ```
-Only run evaluation (Step 3) for proposers that committed changes.
+Only run evaluation (Step 3) for proposers that committed changes (not abstained, not stuck).
 ### 3. Run Target for Each Candidate
@@ -308,7 +420,7 @@ For each worktree that has changes (proposer committed something):
 $EVOLVER_PY $TOOLS/run_eval.py \
     --config .evolver.json \
     --worktree-path {worktree_path} \
-    --experiment-prefix v{NNN}{suffix} \
+    --experiment-prefix v{NNN}-{lens_id} \
     --timeout 120
 ```
@@ -339,11 +451,7 @@ Agent(
   prompt: |
     <experiment>
     Evaluate the following experiments (one per candidate):
-    - {experiment_name_a}
-    - {experiment_name_b}
-    - {experiment_name_c}
-    - {experiment_name_d}
-    - {experiment_name_e}
+    {list all experiment names from proposers that committed changes — skip abstained}
     </experiment>
     <evaluators>
@@ -370,14 +478,14 @@ Wait for the evaluator agent to complete before proceeding.
 ```bash
 $EVOLVER_PY $TOOLS/read_results.py \
-    --experiments "v{NNN}a,v{NNN}b,v{NNN}c,v{NNN}d,v{NNN}e" \
+    --experiments "{comma-separated list of experiment names from non-abstained proposers}" \
     --config .evolver.json \
     --output comparison.json
 ```
 Parse `comparison.json`:
 - `comparison.winner` — highest combined score
-- `comparison.champion` — per-task champion (for next crossover)
+- `comparison.champion` — per-task champion (for next iteration's context)
 - `comparison.all_candidates` — all scores for reporting
 ### 5. Merge Winner
@@ -389,7 +497,7 @@ If the winner scored higher than the current best:
 WINNER_BRANCH={winning_worktree_branch}
 # Merge into main
-git merge $WINNER_BRANCH --no-edit -m "evolve: merge v{NNN}{suffix} (score: {score})"
+git merge $WINNER_BRANCH --no-edit -m "evolve: merge v{NNN}-{lens_id} (score: {score})"
 ```
 Update `.evolver.json`:
@@ -409,14 +517,14 @@ json.dump(c, open('.evolver.json', 'w'), indent=2)
 Report ALL candidates:
 ```
-Iteration {i}/{N} — 5 candidates evaluated:
-  v{NNN}a (exploit):     {score_a} — {summary}
-  v{NNN}b (explore):     {score_b} — {summary}
-  v{NNN}c (crossover):   {score_c} — {summary}
-  v{NNN}d ({strategy}):  {score_d} — {summary}
-  v{NNN}e ({strategy}):  {score_e} — {summary}
+Iteration {i}/{N} — {lens_count} lenses, {evaluated_count} candidates evaluated ({abstained_count} abstained):
+  {For each proposer, read proposal.md and extract the Approach field}
+  v{NNN}-1 ({approach from proposal.md}):  {score} — {summary}
+  v{NNN}-2 ({approach from proposal.md}):  {score} — {summary}
+  v{NNN}-3 (ABSTAINED):                    --    — {reason from proposal.md}
+  ...
-  Winner: v{NNN}{suffix} ({score}) — merged into main
+  Winner: v{NNN}-{id} ({score}) — merged into main
   Per-task champion: {champion} (beats winner on {N} tasks)
 ```

package/tools/__pycache__/adversarial_inject.cpython-313.pyc CHANGED Viewed

Binary file

package/tools/__pycache__/regression_tracker.cpython-313.pyc CHANGED Viewed

Binary file

package/tools/__pycache__/setup.cpython-313.pyc ADDED Viewed

Binary file

package/tools/adversarial_inject.py CHANGED Viewed

@@ -145,15 +145,20 @@ def generate_adversarial_inputs(client, dataset_name, num_inputs=5):
     return adversarial
-def inject_adversarial(client, dataset_id, adversarial_inputs):
+def inject_adversarial(client, dataset_id, adversarial_inputs, config=None):
     """Add adversarial examples to dataset."""
+    config = config or {}
     added = 0
     for adv in adversarial_inputs:
         try:
+            split = "train" if random.random() < 0.7 else "held_out"
+            metadata = dict(adv["metadata"])
+            metadata["added_at_iteration"] = config.get("iterations", 0)
             client.create_example(
                 inputs=adv["inputs"],
                 dataset_id=dataset_id,
-                metadata=adv["metadata"],
+                metadata=metadata,
+                split=split,
             )
             added += 1
         except Exception as e:
@@ -182,7 +187,7 @@ def main():
     injected = 0
     if args.inject and adversarial:
-        injected = inject_adversarial(client, config["dataset_id"], adversarial)
+        injected = inject_adversarial(client, config["dataset_id"], adversarial, config=config)
     result = {
         "memorization_suspects": len(suspicious),