npm - harness-evolver - Versions diffs - 4.2.9 → 4.3.1 - Mend

harness-evolver 4.2.9 → 4.3.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

package/.claude-plugin/plugin.json +1 -1
package/README.md +15 -9
package/agents/evolver-proposer.md +7 -46
package/package.json +1 -1
package/skills/evolve/SKILL.md +74 -271
package/skills/health/SKILL.md +120 -0
package/skills/setup/SKILL.md +66 -64
package/tools/run_eval.py +6 -1
package/tools/seed_from_traces.py +24 -105

package/.claude-plugin/plugin.json CHANGED Viewed

@@ -1,7 +1,7 @@
 {
   "name": "harness-evolver",
   "description": "LangSmith-native autonomous agent optimization — evolves LLM agent code using multi-agent proposers, LangSmith experiments, and git worktrees",
-  "version": "4.2.9",
+  "version": "4.3.1",
   "author": {
     "name": "Raphael Valdetaro"
   },

package/README.md CHANGED Viewed

@@ -91,8 +91,12 @@ claude
 <td>Cross-iteration memory consolidation inspired by Claude Code's autoDream. Tracks which approaches win, which failures recur, and promotes insights after 2+ occurrences.</td>
 </tr>
 <tr>
+<td><b>Dataset Health</b></td>
+<td>Pre-flight dataset quality check: size adequacy, difficulty distribution, dead example detection, production coverage analysis, train/held-out splits. Auto-corrects issues before evolution starts.</td>
+</tr>
+<tr>
 <td><b>Smart Gating</b></td>
-<td>Three-gate iteration triggers (score plateau, cost budget, convergence detection) replace blind N-iteration loops. State validation ensures config hasn't diverged from LangSmith.</td>
+<td>Claude assesses gate conditions directly — score plateau, target reached, diminishing returns. No hardcoded thresholds. State validation ensures config hasn't diverged from LangSmith.</td>
 </tr>
 <tr>
 <td><b>Background Mode</b></td>
@@ -107,6 +111,7 @@ claude
 | Command | What it does |
 |---|---|
 | `/evolver:setup` | Explore project, configure LangSmith (dataset, evaluators), run baseline |
+| `/evolver:health` | Check dataset quality (size, difficulty, coverage, splits), auto-correct issues |
 | `/evolver:evolve` | Run the optimization loop (dynamic self-organizing proposers in worktrees) |
 | `/evolver:status` | Show progress, scores, history |
 | `/evolver:deploy` | Tag, push, clean up temporary files |
@@ -132,10 +137,11 @@ claude
 /evolver:evolve
   |
   +- 0.5  Validate state (skeptical memory — check .evolver.json vs LangSmith)
+  +- 0.6  /evolver:health — dataset quality check + auto-correct
   +- 1.   Read state (.evolver.json + LangSmith experiments)
   +- 1.5  Gather trace insights (cluster errors, tokens, latency)
-  +- 1.8  Analyze per-task failures
-  +- 1.8a Synthesize strategy document + dynamic lenses (investigation questions)
+  +- 1.8  Analyze per-task failures (train split only — proposers don't see held-out)
+  +- 1.8a Claude generates strategy.md + lenses.json from analysis data
   +- 1.9  Prepare shared proposer context (KV cache-optimized prefix)
   +- 2.   Spawn N self-organizing proposers in parallel (each in a git worktree)
   +- 3.   Run target for each candidate (code-based evaluators)
@@ -144,10 +150,10 @@ claude
   +- 5.   Merge winning worktree into main branch
   +- 5.5  Regression tracking (auto-add guard examples to dataset)
   +- 6.   Report results
-  +- 6.2  Consolidate evolution memory (orient/gather/consolidate/prune)
+  +- 6.2  Consolidator agent updates evolution memory (runs in background)
   +- 6.5  Auto-trigger Active Critic (detect + fix evaluator gaming)
   +- 7.   Auto-trigger ULTRAPLAN Architect (opus model, deep analysis)
-  +- 8.   Three-gate check (score plateau, cost budget, convergence)
+  +- 8.   Claude assesses gate conditions (plateau, target, diminishing returns)
 ```
 ---
@@ -159,7 +165,8 @@ Plugin hook (SessionStart)
   └→ Creates venv, installs langsmith + langsmith-cli, exports env vars
 Skills (markdown)
-  ├── /evolver:setup    → explores project, runs setup.py
+  ├── /evolver:setup    → explores project, smart defaults, runs setup.py
+  ├── /evolver:health   → dataset quality check + auto-correct
   ├── /evolver:evolve   → orchestrates the evolution loop
   ├── /evolver:status   → reads .evolver.json + LangSmith
   └── /evolver:deploy   → tags and pushes
@@ -179,10 +186,8 @@ Tools (Python + langsmith SDK)
   ├── trace_insights.py     → clusters errors from traces
   ├── seed_from_traces.py   → imports production traces
   ├── validate_state.py     → validates config vs LangSmith state
-  ├── iteration_gate.py     → three-gate iteration triggers
+  ├── dataset_health.py     → dataset quality diagnostic (size, difficulty, coverage, splits)
   ├── regression_tracker.py → tracks regressions, adds guard examples
-  ├── consolidate.py        → cross-iteration memory consolidation
-  ├── synthesize_strategy.py→ generates strategy document + investigation lenses
   ├── add_evaluator.py      → programmatically adds evaluators
   └── adversarial_inject.py → detects memorization, injects adversarial tests
 ```
@@ -221,6 +226,7 @@ LangSmith traces **any** AI framework. The evolver works with all of them:
 - [Darwin Godel Machine](https://sakana.ai/dgm/) — Sakana AI
 - [AlphaEvolve](https://deepmind.google/blog/alphaevolve/) — DeepMind
 - [LangSmith Evaluation](https://docs.smith.langchain.com/evaluation) — LangChain
+- [Harnessing Claude's Intelligence](https://claude.com/blog/harnessing-claudes-intelligence) — Martin, Anthropic, 2026
 - [Traces Start the Agent Improvement Loop](https://www.langchain.com/conceptual-guides/traces-start-agent-improvement-loop) — LangChain
 ---

package/agents/evolver-proposer.md CHANGED Viewed

@@ -22,14 +22,7 @@ Your prompt contains `<files_to_read>`, `<context>`, and `<lens>` blocks. You MU
 ## Turn Budget
-You have a maximum of **16 turns**. You decide how to allocate them. General guidance:
-- Spend early turns reading context and investigating your lens question
-- Spend middle turns implementing changes and consulting documentation
-- Reserve final turns for committing and writing proposal.md
-**If you're past turn 12 and haven't started implementing**, simplify your approach. A small, focused change that works is better than an ambitious change that's incomplete.
-**Context management**: After turn 8, avoid re-reading files you've already read. Reference your earlier analysis instead of re-running Glob/Grep searches.
+Most proposals need **10-15 turns**. Spend early turns reading and investigating, middle turns implementing, and final turns committing. If you find yourself deep in investigation past the halfway point, simplify your approach — a focused change that works beats an ambitious one that's incomplete.
 ## Lens Protocol
@@ -44,19 +37,7 @@ You are NOT constrained to the lens topic. The lens gives you a starting perspec
 ## Your Workflow
-There are no fixed phases. Use your judgment to allocate turns. A typical flow:
-**Orient** — Read .evolver.json, strategy.md, evolution_memory.md. Understand the framework, entry point, evaluators, current score, and what has been tried before.
-**Investigate** — Read trace_insights.json and best_results.json. Understand which examples fail and why. If production_seed.json exists, understand real-world usage patterns. Focus on data relevant to your lens question.
-**Decide** — Based on investigation, decide what to change. Consider:
-- **Prompts**: system prompts, few-shot examples, output format instructions
-- **Routing**: how queries are dispatched to different handlers
-- **Tools**: tool definitions, tool selection logic
-- **Architecture**: agent topology, chain structure, graph edges
-- **Error handling**: retry logic, fallback strategies, timeout handling
-- **Model selection**: which model for which task
+Read the available context files (.evolver.json, strategy.md, evolution_memory.md, trace_insights.json, best_results.json, production_seed.json). Investigate your lens question. Decide what to change and implement it.
 ## Self-Abstention
@@ -74,34 +55,14 @@ To abstain, skip implementation and write only a `proposal.md`:
 Then end with the return protocol using `ABSTAIN` as your approach.
-### Consult Documentation (MANDATORY)
-**Before writing ANY code**, you MUST consult Context7 for every library you'll be modifying or using. This is NOT optional.
-**Step 1 — Identify libraries from the code you read:**
-Read the imports in the files you're about to modify. For each framework/library (LangGraph, OpenAI, Anthropic, CrewAI, etc.):
-**Step 2 — Resolve library ID:**
-```
-resolve-library-id(libraryName: "langgraph", query: "what you're trying to do")
-```
-This returns up to 10 matches. Pick the one with the highest relevance.
-**Step 3 — Query docs for your specific task:**
-```
-get-library-docs(libraryId: "/langchain-ai/langgraph", query: "conditional edges StateGraph", topic: "routing")
-```
-Ask about the SPECIFIC API you're going to use or change.
+## Consult Documentation
-**Examples of what to query:**
-- About to modify a StateGraph? → `query: "StateGraph add_conditional_edges"`
-- Changing prompt template? → `query: "ChatPromptTemplate from_messages"` for langchain
-- Adding a tool? → `query: "StructuredTool create tool definition"` for langchain
-- Changing model? → `query: "ChatOpenAI model parameters temperature"` for openai
+Before modifying library APIs (LangGraph, OpenAI, Anthropic, etc.), consult Context7 to verify you're using current patterns:
-**Why this matters:** Your training data may be outdated. Libraries change APIs between versions. A quick Context7 lookup takes seconds and prevents proposing code that uses deprecated or incorrect patterns. The documentation is the source of truth, not your model knowledge.
+1. `resolve-library-id(libraryName: "langgraph")`
+2. `get-library-docs(libraryId: "/langchain-ai/langgraph", query: "your specific API question")`
-**If Context7 MCP is not available:** Note in proposal.md "API patterns not verified against current docs — verify before deploying."
+If Context7 MCP is not available, note in proposal.md that API patterns were not verified.
 ### Commit and Document

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "harness-evolver",
-  "version": "4.2.9",
+  "version": "4.3.1",
   "description": "LangSmith-native autonomous agent optimization for Claude Code",
   "author": "Raphael Valdetaro",
   "license": "MIT",

package/skills/evolve/SKILL.md CHANGED Viewed

@@ -131,119 +131,7 @@ If critical issues found, ask user whether to continue or fix first via AskUserQ
 ### 0.6. Dataset Health Check
-Run the dataset health diagnostic:
-```bash
-$EVOLVER_PY $TOOLS/dataset_health.py \
-    --config .evolver.json \
-    --production-seed production_seed.json \
-    --output health_report.json 2>/dev/null
-```
-Read `health_report.json`. Print summary:
-```bash
-python3 -c "
-import json, os
-if os.path.exists('health_report.json'):
-    r = json.load(open('health_report.json'))
-    print(f'Dataset Health: {r[\"health_score\"]}/10 ({r[\"example_count\"]} examples)')
-    for issue in r.get('issues', []):
-        print(f'  [{issue[\"severity\"]}] {issue[\"message\"]}')
-"
-```
-### 0.7. Auto-Correct Dataset Issues
-If `health_report.json` has corrections, apply them automatically:
-```bash
-CORRECTIONS=$(python3 -c "
-import json, os
-if os.path.exists('health_report.json'):
-    r = json.load(open('health_report.json'))
-    for c in r.get('corrections', []):
-        print(c['action'])
-" 2>/dev/null)
-```
-For each correction:
-**If `create_splits`**: Run inline Python to assign 70/30 splits:
-```bash
-$EVOLVER_PY -c "
-from langsmith import Client
-import json, random
-client = Client()
-config = json.load(open('.evolver.json'))
-examples = list(client.list_examples(dataset_name=config['dataset']))
-random.shuffle(examples)
-sp = int(len(examples) * 0.7)
-for ex in examples[:sp]:
-    client.update_example(ex.id, split='train')
-for ex in examples[sp:]:
-    client.update_example(ex.id, split='held_out')
-print(f'Assigned splits: {sp} train, {len(examples)-sp} held_out')
-"
-```
-**If `generate_hard`**: Spawn testgen agent with hard-mode instruction:
-```
-Agent(
-  subagent_type: "evolver-testgen",
-  description: "Generate hard examples to rebalance dataset",
-  prompt: |
-    <objective>
-    The dataset is skewed toward easy examples. Generate {count} HARD examples
-    that the current agent is likely to fail on.
-    Focus on: edge cases, adversarial inputs, complex multi-step queries,
-    ambiguous questions, and inputs that require deep reasoning.
-    </objective>
-    <files_to_read>
-    - .evolver.json
-    - strategy.md (if exists)
-    - production_seed.json (if exists)
-    </files_to_read>
-)
-```
-**If `fill_coverage`**: Spawn testgen agent with coverage-fill instruction:
-```
-Agent(
-  subagent_type: "evolver-testgen",
-  description: "Generate examples for missing categories",
-  prompt: |
-    <objective>
-    The dataset is missing these production categories: {categories}.
-    Generate 5 examples per missing category.
-    Use production_seed.json for real-world patterns in these categories.
-    </objective>
-    <files_to_read>
-    - .evolver.json
-    - production_seed.json (if exists)
-    </files_to_read>
-)
-```
-**If `retire_dead`**: Move dead examples to retired split:
-```bash
-$EVOLVER_PY -c "
-from langsmith import Client
-import json
-client = Client()
-report = json.load(open('health_report.json'))
-dead_ids = report.get('dead_examples', {}).get('ids', [])
-config = json.load(open('.evolver.json'))
-examples = {str(e.id): e for e in client.list_examples(dataset_name=config['dataset'])}
-retired = 0
-for eid in dead_ids:
-    if eid in examples:
-        client.update_example(examples[eid].id, split='retired')
-        retired += 1
-print(f'Retired {retired} dead examples')
-"
-```
-After corrections, log what was done. Do NOT re-run health check (corrections may need an experiment cycle to show effect).
+Invoke `/evolver:health` to check and auto-correct dataset issues. If health_report.json shows critical issues that couldn't be auto-corrected, ask user whether to proceed via AskUserQuestion.
 ### 0.8. Resolve Project Directory
@@ -268,66 +156,75 @@ For each iteration:
 python3 -c "import json; c=json.load(open('.evolver.json')); print(f'v{c[\"iterations\"]+1:03d}')"
 ```
-### 1.5. Gather Trace Insights
+### 1.5. Gather Analysis Data (Parallel)
-Read the best experiment from config. If null (no baseline was run), skip trace insights for this iteration — proposers will work blind on the first pass:
+Read the best experiment from config. If null (no baseline was run), skip data gathering — proposers will work from code analysis only:
 ```bash
 BEST=$(python3 -c "import json; b=json.load(open('.evolver.json')).get('best_experiment'); print(b if b else '')")
+PROD=$(python3 -c "import json; c=json.load(open('.evolver.json')); print(c.get('production_project',''))")
 if [ -n "$BEST" ]; then
+    # Run all data gathering in parallel — these are independent API calls
     $EVOLVER_PY $TOOLS/trace_insights.py \
         --from-experiment "$BEST" \
-        --output trace_insights.json 2>/dev/null
-fi
-```
+        --output trace_insights.json 2>/dev/null &
-If a production project is configured, also gather production insights:
+    $EVOLVER_PY $TOOLS/read_results.py \
+        --experiment "$BEST" \
+        --config .evolver.json \
+        --split train \
+        --output best_results.json 2>/dev/null &
+fi
-```bash
-PROD=$(python3 -c "import json; c=json.load(open('.evolver.json')); print(c.get('production_project',''))")
 if [ -n "$PROD" ] && [ ! -f "production_seed.json" ]; then
     $EVOLVER_PY $TOOLS/seed_from_traces.py \
-        --project "$PROD" --use-sdk \
+        --project "$PROD" \
         --output-md production_seed.md \
         --output-json production_seed.json \
-        --limit 100 2>/dev/null
+        --limit 100 2>/dev/null &
 fi
-```
-### 1.8. Analyze Per-Task Failures
-If `$BEST` is set (not the first iteration without baseline), read results and cluster failures:
-```bash
-if [ -n "$BEST" ]; then
-    $EVOLVER_PY $TOOLS/read_results.py \
-        --experiment "$BEST" \
-        --config .evolver.json \
-        --split train \
-        --output best_results.json 2>/dev/null
-fi
+wait  # Wait for all data gathering to complete
 ```
 If `best_results.json` exists, parse it to find failing examples (score < 0.7). Group by metadata or error pattern.
-This failure data feeds into `synthesize_strategy.py` which generates targeted lenses for proposers.
+This failure data feeds into the strategy and lens generation step (1.8a).
 If no best_results.json (first iteration without baseline), all proposers work from code analysis only — no failure data available.
-### 1.8a. Synthesize Strategy
+### 1.8a. Generate Strategy and Lenses
-Generate a targeted strategy document from all available analysis:
+Read the available analysis files:
+- `trace_insights.json` (error clusters, token analysis)
+- `best_results.json` (per-task scores and failures)
+- `evolution_memory.json` / `evolution_memory.md` (cross-iteration insights)
+- `production_seed.json` (real-world traffic patterns, if exists)
-```bash
-$EVOLVER_PY $TOOLS/synthesize_strategy.py \
-    --config .evolver.json \
-    --trace-insights trace_insights.json \
-    --best-results best_results.json \
-    --evolution-memory evolution_memory.json \
-    --production-seed production_seed.json \
-    --output strategy.md \
-    --lenses lenses.json 2>/dev/null
+Based on this data, generate two files:
+**`strategy.md`** — A concise strategy document with: target files, failure clusters (prioritized), recommended approaches (from evolution memory), approaches to avoid, top failing examples, and production insights.
+**`lenses.json`** — Investigation questions for proposers, format:
+```json
+{
+  "generated_at": "ISO timestamp",
+  "lens_count": N,
+  "lenses": [
+    {"id": 1, "question": "...", "source": "failure_cluster|architecture|production|evolution_memory|uniform_failure|open", "severity": "critical|high|medium", "context": {}},
+    ...
+  ]
+}
 ```
-The `strategy.md` file is included in the proposer `<files_to_read>` block via the shared context (Step 1.9). The `lenses.json` file contains dynamically generated investigation questions — one per proposer. Each lens directs a proposer's attention to a different aspect of the problem (failure cluster, architecture, production data, evolution memory, or open investigation).
+Lens generation rules:
+- One lens per distinct failure cluster (max 3)
+- One architecture lens if high-severity structural issues exist
+- One production lens if production data shows problems
+- One evolution memory lens if a pattern won 2+ times
+- One persistent failure lens if a pattern recurred 3+ iterations
+- If all examples fail with same error, one "uniform_failure" lens
+- Always include one "open" lens
+- Sort by severity (critical > high > medium), cap at max_proposers from config (default 5)
 ### 1.9. Prepare Shared Proposer Context
@@ -433,20 +330,23 @@ done
 Only run evaluation (Step 3) for proposers that committed changes (not abstained, not stuck).
-### 3. Run Target for Each Candidate
+### 3. Run Target for Each Candidate (Parallel)
-For each worktree that has changes (proposer committed something):
+Run evaluations for ALL candidates simultaneously — they're independent:
 ```bash
-# If PROJECT_DIR is set, resolve paths into the worktree subdirectory
-WORKTREE_PROJECT="{worktree_path}"
-[ -n "$PROJECT_DIR" ] && WORKTREE_PROJECT="{worktree_path}/{PROJECT_DIR}"
-$EVOLVER_PY $TOOLS/run_eval.py \
-    --config "$WORKTREE_PROJECT/.evolver.json" \
-    --worktree-path "$WORKTREE_PROJECT" \
-    --experiment-prefix v{NNN}-{lens_id} \
-    --timeout 120
+# Launch all evaluations in parallel
+for WORKTREE in {worktree_paths_with_commits}; do
+    WORKTREE_PROJECT="$WORKTREE"
+    [ -n "$PROJECT_DIR" ] && WORKTREE_PROJECT="$WORKTREE/$PROJECT_DIR"
+    $EVOLVER_PY $TOOLS/run_eval.py \
+        --config "$WORKTREE_PROJECT/.evolver.json" \
+        --worktree-path "$WORKTREE_PROJECT" \
+        --experiment-prefix v{NNN}-{lens_id} \
+        --timeout 120 &
+done
+wait  # Wait for all evaluations to complete
 ```
 Each candidate becomes a separate LangSmith experiment. This step runs the agent and applies code-based evaluators (has_output, token_efficiency) only.
@@ -473,27 +373,7 @@ Then spawn ONE evaluator agent that scores ALL candidates in a single pass. This
 Agent(
   subagent_type: "evolver-evaluator",
   description: "Evaluate all candidates for iteration v{NNN}",
-  prompt: |
-    <experiment>
-    Evaluate the following experiments (one per candidate):
-    {list all experiment names from proposers that committed changes — skip abstained}
-    </experiment>
-    <evaluators>
-    Apply these evaluators to each run in each experiment:
-    - {llm_evaluator_list, e.g. "correctness", "conciseness"}
-    </evaluators>
-    <context>
-    Agent type: {framework} agent
-    Domain: {description from .evolver.json or entry point context}
-    Entry point: {entry_point}
-    For each experiment:
-    1. Read all runs via: langsmith-cli --json runs list --project "{experiment_name}" --fields id,inputs,outputs,error --is-root true --limit 200
-    2. Judge each run's output against the input
-    3. Write scores via: langsmith-cli --json feedback create {run_id} --key {evaluator} --score {0.0|1.0} --comment "{reason}" --source model
-    </context>
+  prompt: "Experiments to evaluate: {comma-separated experiment names from non-abstained proposers}. Evaluators: {llm_evaluator_list}. Framework: {framework}. Entry point: {entry_point}."
 )
 ```
@@ -592,45 +472,18 @@ Print: `Iteration {i}/{N}: v{NNN} scored {score} (best: {best} at {best_score})`
 ### 6.2. Consolidate Evolution Memory
-Spawn the consolidator agent to analyze the iteration and update cross-iteration memory:
+Spawn the consolidator agent (runs in background — doesn't block the next iteration):
 ```
 Agent(
   subagent_type: "evolver-consolidator",
   description: "Consolidate evolution memory after iteration v{NNN}",
   run_in_background: true,
-  prompt: |
-    <objective>
-    Consolidate learnings from iteration v{NNN}.
-    Run the consolidation tool and review its output.
-    </objective>
-    <tools_path>
-    TOOLS={tools_path}
-    EVOLVER_PY={evolver_py_path}
-    </tools_path>
-    <instructions>
-    Run: $EVOLVER_PY $TOOLS/consolidate.py \
-        --config .evolver.json \
-        --comparison-files comparison.json \
-        --output evolution_memory.md \
-        --output-json evolution_memory.json
-    Then read the output and verify insights are accurate.
-    </instructions>
-    <files_to_read>
-    - .evolver.json
-    - comparison.json
-    - trace_insights.json (if exists)
-    - regression_report.json (if exists)
-    - evolution_memory.md (if exists)
-    </files_to_read>
+  prompt: "Update evolution_memory.md with learnings from this iteration. Read .evolver.json, comparison.json, trace_insights.json, regression_report.json (if exists), and current evolution_memory.md (if exists). Track what worked, what failed, and promote insights that recur across iterations."
 )
 ```
-The `evolution_memory.md` file will be included in proposer briefings for subsequent iterations.
+The `evolution_memory.md` file will be available for proposer briefings in subsequent iterations.
 ### 6.5. Auto-trigger Active Critic
@@ -639,25 +492,8 @@ If score jumped >0.3 from previous iteration OR reached target in <3 iterations:
 ```
 Agent(
   subagent_type: "evolver-critic",
-  description: "Active Critic: detect and fix evaluator gaming",
-  prompt: |
-    <objective>
-    EVAL GAMING CHECK: Score jumped from {prev_score} to {score}.
-    Check if the LangSmith evaluators are being gamed.
-    If gaming detected, add stricter evaluators using $TOOLS/add_evaluator.py.
-    </objective>
-    <tools_path>
-    TOOLS={tools_path}
-    EVOLVER_PY={evolver_py_path}
-    </tools_path>
-    <files_to_read>
-    - .evolver.json
-    - comparison.json
-    - trace_insights.json
-    - evolution_memory.md (if exists)
-    </files_to_read>
+  description: "Check evaluator gaming after score jump",
+  prompt: "Score jumped from {prev_score} to {score}. Check if LangSmith evaluators are being gamed. Read .evolver.json, comparison.json, trace_insights.json, evolution_memory.md. If gaming detected, add stricter evaluators using $EVOLVER_PY $TOOLS/add_evaluator.py."
 )
 ```
@@ -674,55 +510,22 @@ If 3 consecutive iterations within 1% OR score dropped:
 Agent(
   subagent_type: "evolver-architect",
   model: "opus",
-  description: "Architect ULTRAPLAN: deep topology analysis",
-  prompt: |
-    <objective>
-    The evolution loop has stagnated after {iterations} iterations.
-    Scores: {last_3_scores}.
-    Perform deep architectural analysis and recommend structural changes.
-    Use extended thinking — you have more compute budget than normal agents.
-    </objective>
-    <tools_path>
-    TOOLS={tools_path}
-    EVOLVER_PY={evolver_py_path}
-    </tools_path>
-    <files_to_read>
-    - .evolver.json
-    - trace_insights.json
-    - evolution_memory.md (if exists)
-    - evolution_memory.json (if exists)
-    - strategy.md (if exists)
-    - {entry point and all related source files}
-    </files_to_read>
+  description: "Deep topology analysis after stagnation",
+  prompt: "Evolution stagnated after {iterations} iterations. Scores: {last_3_scores}. Analyze architecture and recommend structural changes. Read .evolver.json, trace_insights.json, evolution_memory.md, strategy.md, and the entry point source files. Use $EVOLVER_PY $TOOLS/analyze_architecture.py for AST analysis if helpful."
 )
 ```
 After architect completes, include `architecture.md` in proposer `<files_to_read>` for next iteration.
-### 8. Gate Check (Three-Gate Trigger)
-Before starting the next iteration, run the gate check:
+### 8. Gate Check
-```bash
-GATE_RESULT=$($EVOLVER_PY $TOOLS/iteration_gate.py --config .evolver.json 2>/dev/null)
-PROCEED=$(echo "$GATE_RESULT" | python3 -c "import sys,json; print(json.load(sys.stdin).get('proceed', True))")
-```
-If `PROCEED` is `False`, check suggestions:
-```bash
-SUGGEST=$(echo "$GATE_RESULT" | python3 -c "import sys,json; s=json.load(sys.stdin).get('suggestions',[]); print(s[0] if s else '')")
-```
+Read `.evolver.json` history and assess whether to continue:
-- If `$SUGGEST` is `architect`: auto-trigger architect agent (Step 7)
-- If `$SUGGEST` is `continue_cautious`: ask user via AskUserQuestion whether to continue
-- Otherwise: stop the loop and report final results
+- **Score plateau**: If last 3 scores are within 2% of each other, evolution may have converged. Consider triggering architect (Step 7) or stopping.
+- **Target reached**: If `best_score >= target_score`, stop and report success.
+- **Diminishing returns**: If average improvement over last 5 iterations is less than 0.5%, consider stopping.
-Legacy stop conditions still apply:
-- **Target**: `score >= target_score` → stop
-- **N reached**: all requested iterations done → stop
+If stopping, skip to the final report. If continuing, proceed to next iteration.
 ## When Loop Ends — Final Report

package/skills/health/SKILL.md ADDED Viewed

@@ -0,0 +1,120 @@
+---
+name: evolver:health
+description: "Use when the user wants to check dataset quality, diagnose eval issues, or before running evolve. Checks size, difficulty distribution, dead examples, coverage, and splits. Auto-corrects issues found."
+allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent, AskUserQuestion]
+---
+# /evolver:health
+Check eval dataset quality and auto-correct issues. Can be run independently or is invoked by `/evolver:evolve` before the iteration loop.
+## Prerequisites
+`.evolver.json` must exist. If not, tell user to run `/evolver:setup`.
+## Resolve Tool Path and Python
+```bash
+TOOLS="${EVOLVER_TOOLS:-$([ -d ".evolver/tools" ] && echo ".evolver/tools" || echo "$HOME/.evolver/tools")}"
+EVOLVER_PY="${EVOLVER_PY:-$([ -f "$HOME/.evolver/venv/bin/python" ] && echo "$HOME/.evolver/venv/bin/python" || echo "python3")}"
+```
+## 1. Run Health Diagnostic
+```bash
+$EVOLVER_PY $TOOLS/dataset_health.py \
+    --config .evolver.json \
+    --production-seed production_seed.json \
+    --output health_report.json 2>/dev/null
+```
+Print summary:
+```bash
+python3 -c "
+import json, os
+if os.path.exists('health_report.json'):
+    r = json.load(open('health_report.json'))
+    print(f'Dataset Health: {r[\"health_score\"]}/10 ({r[\"example_count\"]} examples)')
+    for issue in r.get('issues', []):
+        print(f'  [{issue[\"severity\"]}] {issue[\"message\"]}')
+    if not r.get('issues'):
+        print('  No issues found.')
+"
+```
+## 2. Auto-Correct Issues
+If `health_report.json` has corrections, apply them automatically:
+```bash
+CORRECTIONS=$(python3 -c "
+import json, os
+if os.path.exists('health_report.json'):
+    r = json.load(open('health_report.json'))
+    for c in r.get('corrections', []):
+        print(c['action'])
+" 2>/dev/null)
+```
+For each correction:
+**If `create_splits`**: Assign 70/30 train/held_out splits:
+```bash
+$EVOLVER_PY -c "
+from langsmith import Client
+import json, random
+client = Client()
+config = json.load(open('.evolver.json'))
+examples = list(client.list_examples(dataset_name=config['dataset']))
+random.shuffle(examples)
+sp = int(len(examples) * 0.7)
+for ex in examples[:sp]:
+    client.update_example(ex.id, split='train')
+for ex in examples[sp:]:
+    client.update_example(ex.id, split='held_out')
+print(f'Assigned splits: {sp} train, {len(examples)-sp} held_out')
+"
+```
+**If `generate_hard`**: Spawn testgen agent to generate hard examples:
+```
+Agent(
+  subagent_type: "evolver-testgen",
+  description: "Generate hard examples to rebalance dataset",
+  prompt: "The dataset is skewed toward easy examples. Generate {count} HARD examples that the current agent is likely to fail on. Focus on edge cases, adversarial inputs, and complex multi-step queries. Read .evolver.json and production_seed.json for context."
+)
+```
+**If `fill_coverage`**: Spawn testgen agent for missing categories:
+```
+Agent(
+  subagent_type: "evolver-testgen",
+  description: "Generate examples for missing categories",
+  prompt: "The dataset is missing these production categories: {categories}. Generate 5 examples per missing category. Read .evolver.json and production_seed.json for context."
+)
+```
+**If `retire_dead`**: Move dead examples to retired split:
+```bash
+$EVOLVER_PY -c "
+from langsmith import Client
+import json
+client = Client()
+report = json.load(open('health_report.json'))
+dead_ids = report.get('dead_examples', {}).get('ids', [])
+config = json.load(open('.evolver.json'))
+examples = {str(e.id): e for e in client.list_examples(dataset_name=config['dataset'])}
+retired = 0
+for eid in dead_ids:
+    if eid in examples:
+        client.update_example(examples[eid].id, split='retired')
+        retired += 1
+print(f'Retired {retired} dead examples')
+"
+```
+After corrections, log what was done.
+## 3. Report
+Print final health status. If critical issues remain that couldn't be auto-corrected, warn the user.

package/skills/setup/SKILL.md CHANGED Viewed

@@ -86,82 +86,84 @@ The runner writes `{"input": "user question..."}` to a temp `.json` file and rep
 If no placeholder and no `--input` flag detected, the runner appends `--input <path> --output <path>`.
-## Phase 2: Confirm Detection (interactive)
+## Phase 2: Confirm Configuration (interactive)
-Use AskUserQuestion:
-```json
-{
-  "questions": [{
-    "question": "Here's what I detected. Does this look right?\n\nEntry point: {path}\nFramework: {framework}\nRun command: {command}\nLangSmith: {status}",
-    "header": "Confirm",
-    "multiSelect": false,
-    "options": [
-      {"label": "Looks good, proceed", "description": "Continue with detected configuration"},
-      {"label": "Let me adjust", "description": "I'll provide correct paths and commands"},
-      {"label": "Wrong directory", "description": "I need to cd somewhere else first"}
-    ]
-  }]
-}
-```
-## Phase 3: What to Optimize (interactive)
+Present all detected configuration in one view with smart defaults and ask for confirmation.
 Use AskUserQuestion:
 ```json
 {
   "questions": [{
-    "question": "What do you want to optimize?",
-    "header": "Goals",
-    "multiSelect": true,
-    "options": [
-      {"label": "Accuracy", "description": "Correctness of outputs — LLM-as-judge evaluator"},
-      {"label": "Latency", "description": "Response time — track and minimize"},
-      {"label": "Token efficiency", "description": "Fewer tokens for same quality"},
-      {"label": "Error handling", "description": "Reduce failures, timeouts, crashes"}
-    ]
-  }]
-}
-```
-Map selections to evaluator configuration for setup.py.
-## Phase 4: Test Data Source (interactive)
-Use AskUserQuestion with **preview**:
-```json
-{
-  "questions": [{
-    "question": "Where should test inputs come from?",
-    "header": "Test data",
+    "question": "Here's the configuration for your project:\n\n**Entry point**: {command}\n**Framework**: {framework}\n**Python**: {venv_path or 'system python3'}\n**Optimization goals**: accuracy (correctness evaluator)\n**Test data**: generate 30 examples with AI\n\nDoes this look good?",
+    "header": "Setup Configuration",
     "multiSelect": false,
     "options": [
-      {
-        "label": "Import from LangSmith",
-        "description": "Use real production traces as test inputs",
-        "preview": "## Import from LangSmith\n\nFetches up to 100 recent traces from your production project.\nPrioritizes traces with negative feedback.\nCreates a LangSmith Dataset with real user inputs.\n\nRequires: an existing LangSmith project with traces."
-      },
-      {
-        "label": "Generate from code",
-        "description": "AI generates test inputs by analyzing your code",
-        "preview": "## Generate from Code\n\nThe testgen agent reads your source code and generates\n30 diverse test inputs:\n- 40% standard cases\n- 20% edge cases\n- 20% cross-domain\n- 20% adversarial\n\nOutputs are scored by LLM-as-judge."
-      },
-      {
-        "label": "I have test data",
-        "description": "Point to an existing file with test inputs",
-        "preview": "## Provide Test Data\n\nSupported formats:\n- JSON array of inputs\n- JSON with {\"inputs\": {...}} objects\n- CSV with input columns\n\nExample:\n```json\n[\n  {\"input\": \"What is Python?\"},\n  {\"input\": \"Explain quantum computing\"}\n]\n```"
-      }
+      {"label": "Looks good, proceed", "description": "Use these settings and start setup"},
+      {"label": "Customize goals", "description": "Choose different optimization goals"},
+      {"label": "I have test data", "description": "Use existing JSON file or LangSmith project"},
+      {"label": "Let me adjust everything", "description": "Change entry point, framework, goals, and data source"}
     ]
   }]
 }
 ```
-If "Import from LangSmith": discover projects and ask which one (same as v2 Phase 1.9).
-If "I have test data": ask for file path.
-## Phase 5: Run Setup
+**If "Looks good, proceed"**: Use defaults — goals=accuracy, data=generate 30 with testgen. Skip straight to Phase 3.
+**If "Customize goals"**: Ask the goals question, then proceed to Phase 3 with testgen as default data source.
+  Use AskUserQuestion:
+  ```json
+  {
+    "questions": [{
+      "question": "What do you want to optimize?",
+      "header": "Goals",
+      "multiSelect": true,
+      "options": [
+        {"label": "Accuracy", "description": "Correctness of outputs — LLM-as-judge evaluator"},
+        {"label": "Latency", "description": "Response time — track and minimize"},
+        {"label": "Token efficiency", "description": "Fewer tokens for same quality"},
+        {"label": "Error handling", "description": "Reduce failures, timeouts, crashes"}
+      ]
+    }]
+  }
+  ```
+  Map selections to evaluator configuration for setup.py.
+**If "I have test data"**: Ask the data source question, then proceed to Phase 3 with accuracy as default goal.
+  Use AskUserQuestion with **preview**:
+  ```json
+  {
+    "questions": [{
+      "question": "Where should test inputs come from?",
+      "header": "Test data",
+      "multiSelect": false,
+      "options": [
+        {
+          "label": "Import from LangSmith",
+          "description": "Use real production traces as test inputs",
+          "preview": "## Import from LangSmith\n\nFetches up to 100 recent traces from your production project.\nPrioritizes traces with negative feedback.\nCreates a LangSmith Dataset with real user inputs.\n\nRequires: an existing LangSmith project with traces."
+        },
+        {
+          "label": "I have a file",
+          "description": "Point to an existing file with test inputs",
+          "preview": "## Provide Test Data\n\nSupported formats:\n- JSON array of inputs\n- JSON with {\"inputs\": {...}} objects\n- CSV with input columns\n\nExample:\n```json\n[\n  {\"input\": \"What is Python?\"},\n  {\"input\": \"Explain quantum computing\"}\n]\n```"
+        }
+      ]
+    }]
+  }
+  ```
+  If "Import from LangSmith": discover projects and ask which one (same as v2 Phase 1.9).
+  If "I have a file": ask for file path.
+**If "Let me adjust everything"**: Ask all three original questions in sequence — confirm detection (entry point, framework, run command), then goals, then data source — using the question formats above.
+## Phase 3: Run Setup
 Build the setup.py command based on all gathered information:
@@ -178,7 +180,7 @@ $EVOLVER_PY $TOOLS/setup.py \
 If "Generate from code" was selected AND no test data file exists, first spawn the testgen agent to generate inputs, then pass the generated file to setup.py.
-## Phase 6: Generate Test Data (if needed)
+## Phase 4: Generate Test Data (if needed)
 If testgen is needed, spawn it:
@@ -205,7 +207,7 @@ Agent(
 Then pass `--dataset-from-file test_inputs.json` to setup.py.
-## Phase 7: Report
+## Phase 5: Report
 ```
 Setup complete!

package/tools/run_eval.py CHANGED Viewed

@@ -166,11 +166,14 @@ def main():
     parser.add_argument("--worktree-path", required=True, help="Path to the candidate's worktree")
     parser.add_argument("--experiment-prefix", required=True, help="Experiment name prefix (e.g. v001a)")
     parser.add_argument("--timeout", type=int, default=120, help="Per-task timeout in seconds")
+    parser.add_argument("--concurrency", type=int, default=None, help="Max concurrent evaluations (default: from config or 1)")
     args = parser.parse_args()
     with open(args.config) as f:
         config = json.load(f)
+    concurrency = args.concurrency or config.get("eval_concurrency", 1)
     os.environ["EVAL_TASK_TIMEOUT"] = str(args.timeout)
     ensure_langsmith_api_key()
@@ -188,6 +191,8 @@ def main():
     print(f"  Dataset: {config['dataset']}")
     print(f"  Worktree: {args.worktree_path}")
     print(f"  Code evaluators: {['has_output'] + code_evaluators}")
+    if concurrency > 1:
+        print(f"  Concurrency: {concurrency} parallel evaluations")
     if llm_evaluators:
         print(f"  Pending LLM evaluators (agent): {llm_evaluators}")
@@ -197,7 +202,7 @@ def main():
             data=config["dataset"],
             evaluators=evaluators,
             experiment_prefix=args.experiment_prefix,
-            max_concurrency=1,
+            max_concurrency=concurrency,
         )
         experiment_name = results.experiment_name

package/tools/seed_from_traces.py CHANGED Viewed

@@ -1,8 +1,7 @@
 #!/usr/bin/env python3
 """Fetch and summarize production LangSmith traces for Harness Evolver.
-Queries the LangSmith REST API directly (urllib, stdlib-only) to fetch
-production traces and produce:
+Uses the LangSmith Python SDK to fetch production traces and produce:
   1. A markdown seed file for the testgen agent (production_seed.md)
   2. A JSON summary for programmatic use (production_seed.json)
@@ -11,85 +10,18 @@ Usage:
         --project ceppem-langgraph \
         --output-md production_seed.md \
         --output-json production_seed.json \
-        [--api-key-env LANGSMITH_API_KEY] \
         [--limit 100]
-Stdlib-only. No external dependencies (no langsmith-cli needed).
+Requires: pip install langsmith
 """
 import argparse
 import json
 import os
 import sys
-import urllib.parse
-import urllib.request
 from collections import Counter
 from datetime import datetime, timezone
-LANGSMITH_API_BASE = "https://api.smith.langchain.com/api/v1"
-def langsmith_request(endpoint, api_key, method="GET", body=None, params=None):
-    """Make a request to the LangSmith REST API."""
-    url = f"{LANGSMITH_API_BASE}/{endpoint}"
-    if params:
-        url += "?" + urllib.parse.urlencode(params)
-    headers = {
-        "x-api-key": api_key,
-        "Accept": "application/json",
-    }
-    data = None
-    if body is not None:
-        headers["Content-Type"] = "application/json"
-        data = json.dumps(body).encode("utf-8")
-    req = urllib.request.Request(url, data=data, headers=headers, method=method)
-    try:
-        with urllib.request.urlopen(req, timeout=30) as resp:
-            return json.loads(resp.read())
-    except urllib.error.HTTPError as e:
-        body_text = ""
-        try:
-            body_text = e.read().decode("utf-8", errors="replace")[:500]
-        except Exception:
-            pass
-        print(f"LangSmith API error {e.code}: {body_text}", file=sys.stderr)
-        return None
-    except Exception as e:
-        print(f"LangSmith API request failed: {e}", file=sys.stderr)
-        return None
-def fetch_runs(project_name, api_key, limit=100):
-    """Fetch recent root runs from a LangSmith project."""
-    # Try POST /runs/query first (newer API)
-    body = {
-        "project_name": project_name,
-        "is_root": True,
-        "limit": limit,
-    }
-    result = langsmith_request("runs/query", api_key, method="POST", body=body)
-    if result and isinstance(result, dict):
-        return result.get("runs", result.get("results", []))
-    if result and isinstance(result, list):
-        return result
-    # Fallback: GET /runs with query params
-    params = {
-        "project_name": project_name,
-        "is_root": "true",
-        "limit": str(limit),
-    }
-    result = langsmith_request("runs", api_key, params=params)
-    if result and isinstance(result, list):
-        return result
-    if result and isinstance(result, dict):
-        return result.get("runs", result.get("results", []))
-    return []
 def extract_input(run):
     """Extract user input from a run's inputs field."""
@@ -396,48 +328,35 @@ def generate_json_summary(analysis, project_name):
 def main():
     parser = argparse.ArgumentParser(description="Fetch and summarize production LangSmith traces")
     parser.add_argument("--project", required=True, help="LangSmith project name")
-    parser.add_argument("--api-key-env", default="LANGSMITH_API_KEY",
-                        help="Env var containing API key (default: LANGSMITH_API_KEY)")
     parser.add_argument("--limit", type=int, default=100, help="Max traces to fetch (default: 100)")
     parser.add_argument("--output-md", required=True, help="Output path for markdown seed")
     parser.add_argument("--output-json", required=True, help="Output path for JSON summary")
-    parser.add_argument("--use-sdk", action="store_true",
-                        help="Use langsmith Python SDK instead of REST API (v3 mode)")
+    # Kept for backwards compatibility — silently ignored (SDK is now the only mode)
+    parser.add_argument("--use-sdk", action="store_true", help=argparse.SUPPRESS)
     args = parser.parse_args()
     print(f"Fetching up to {args.limit} traces from LangSmith project '{args.project}'...")
-    if args.use_sdk:
-        try:
-            from langsmith import Client
-            client = Client()
-            raw_runs = list(client.list_runs(
-                project_name=args.project, is_root=True, limit=args.limit,
-            ))
-            # Convert SDK run objects to dicts matching our format
-            runs = []
-            for r in raw_runs:
-                run_dict = {
-                    "id": str(r.id),
-                    "name": r.name,
-                    "inputs": r.inputs,
-                    "outputs": r.outputs,
-                    "error": r.error,
-                    "total_tokens": r.total_tokens,
-                    "feedback_stats": None,
-                    "start_time": r.start_time.isoformat() if r.start_time else None,
-                    "end_time": r.end_time.isoformat() if r.end_time else None,
-                }
-                runs.append(run_dict)
-        except ImportError:
-            print("langsmith package not installed. Use --use-sdk with pip install langsmith", file=sys.stderr)
-            sys.exit(1)
-    else:
-        api_key = os.environ.get(args.api_key_env, "")
-        if not api_key:
-            print(f"No API key found in ${args.api_key_env} — cannot fetch production traces", file=sys.stderr)
-            sys.exit(1)
-        runs = fetch_runs(args.project, api_key, args.limit)
+    from langsmith import Client
+    client = Client()
+    raw_runs = list(client.list_runs(
+        project_name=args.project, is_root=True, limit=args.limit,
+    ))
+    # Convert SDK run objects to dicts matching our analysis format
+    runs = []
+    for r in raw_runs:
+        run_dict = {
+            "id": str(r.id),
+            "name": r.name,
+            "inputs": r.inputs,
+            "outputs": r.outputs,
+            "error": r.error,
+            "total_tokens": r.total_tokens,
+            "feedback_stats": None,
+            "start_time": r.start_time.isoformat() if r.start_time else None,
+            "end_time": r.end_time.isoformat() if r.end_time else None,
+        }
+        runs.append(run_dict)
     if not runs:
         print("No traces found. The project may be empty or the name may be wrong.")