npm - harness-evolver - Versions diffs - 2.9.1 → 3.0.0 - Mend

harness-evolver 2.9.1 → 3.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (54) hide show

package/README.md +62 -117
package/agents/evolver-architect.md +53 -0
package/agents/evolver-critic.md +44 -0
package/agents/evolver-proposer.md +128 -0
package/agents/evolver-testgen.md +67 -0
package/bin/install.js +181 -171
package/package.json +7 -7
package/skills/deploy/SKILL.md +49 -56
package/skills/evolve/SKILL.md +156 -687
package/skills/setup/SKILL.md +182 -0
package/skills/status/SKILL.md +23 -21
package/tools/read_results.py +240 -0
package/tools/run_eval.py +202 -0
package/tools/seed_from_traces.py +36 -8
package/tools/setup.py +393 -0
package/tools/trace_insights.py +86 -14
package/agents/harness-evolver-architect.md +0 -173
package/agents/harness-evolver-critic.md +0 -132
package/agents/harness-evolver-judge.md +0 -110
package/agents/harness-evolver-proposer.md +0 -317
package/agents/harness-evolver-testgen.md +0 -112
package/examples/classifier/README.md +0 -25
package/examples/classifier/config.json +0 -3
package/examples/classifier/eval.py +0 -58
package/examples/classifier/harness.py +0 -111
package/examples/classifier/tasks/task_001.json +0 -1
package/examples/classifier/tasks/task_002.json +0 -1
package/examples/classifier/tasks/task_003.json +0 -1
package/examples/classifier/tasks/task_004.json +0 -1
package/examples/classifier/tasks/task_005.json +0 -1
package/examples/classifier/tasks/task_006.json +0 -1
package/examples/classifier/tasks/task_007.json +0 -1
package/examples/classifier/tasks/task_008.json +0 -1
package/examples/classifier/tasks/task_009.json +0 -1
package/examples/classifier/tasks/task_010.json +0 -1
package/skills/architect/SKILL.md +0 -93
package/skills/compare/SKILL.md +0 -73
package/skills/critic/SKILL.md +0 -67
package/skills/diagnose/SKILL.md +0 -96
package/skills/import-traces/SKILL.md +0 -102
package/skills/init/SKILL.md +0 -293
package/tools/__pycache__/detect_stack.cpython-313.pyc +0 -0
package/tools/__pycache__/init.cpython-313.pyc +0 -0
package/tools/__pycache__/seed_from_traces.cpython-313.pyc +0 -0
package/tools/__pycache__/trace_logger.cpython-313.pyc +0 -0
package/tools/eval_llm_judge.py +0 -233
package/tools/eval_passthrough.py +0 -55
package/tools/evaluate.py +0 -255
package/tools/import_traces.py +0 -229
package/tools/init.py +0 -531
package/tools/llm_api.py +0 -125
package/tools/state.py +0 -219
package/tools/test_growth.py +0 -230
package/tools/trace_logger.py +0 -42

package/README.md CHANGED Viewed

@@ -11,9 +11,9 @@
   <a href="https://github.com/raphaelchristi/harness-evolver"><img src="https://img.shields.io/badge/Built%20by-Raphael%20Valdetaro-ff69b4?style=for-the-badge" alt="Built by Raphael Valdetaro"></a>
 </p>
-**Autonomous harness optimization for LLM agents.** Point at any codebase, and Harness Evolver will evolve the scaffolding around your LLM — prompts, retrieval, routing, output parsing — using a multi-agent loop inspired by [Meta-Harness](https://yoonholee.com/meta-harness/) (Lee et al., 2026).
+**LangSmith-native autonomous agent optimization.** Point at any LLM agent codebase, and Harness Evolver will evolve it — prompts, routing, tools, architecture — using multi-agent evolution with LangSmith as the evaluation backend.
-The harness is the 80% factor. Changing just the scaffolding can produce a [6x performance gap](https://arxiv.org/abs/2603.28052) on the same benchmark. This plugin automates that search.
+Inspired by [Meta-Harness](https://yoonholee.com/meta-harness/) (Lee et al., 2026). The scaffolding around your LLM produces a [6x performance gap](https://arxiv.org/abs/2603.28052) on the same benchmark. This plugin automates the search for better scaffolding.
 ---
@@ -23,7 +23,7 @@ The harness is the 80% factor. Changing just the scaffolding can produce a [6x p
 npx harness-evolver@latest
 ```
-> Works with Claude Code, Cursor, Codex, and Windsurf. Restart your agent after install.
+> Works with Claude Code, Cursor, Codex, and Windsurf. Requires LangSmith account + API key.
 ---
@@ -31,47 +31,43 @@ npx harness-evolver@latest
 ```bash
 cd my-llm-project
+export LANGSMITH_API_KEY="lsv2_pt_..."
 claude
-/harness-evolver:init        # scans code, creates eval + tasks if missing
-/harness-evolver:evolve      # runs the optimization loop
-/harness-evolver:status      # check progress anytime
+/evolver:setup      # explores project, configures LangSmith
+/evolver:evolve     # runs the optimization loop
+/evolver:status     # check progress
+/evolver:deploy     # tag, push, finalize
 ```
-**Zero-config mode:** If your project has no eval script or test cases, the plugin generates them automatically — test cases from code analysis, scoring via LLM-as-judge.
 ---
 ## How It Works
 <table>
 <tr>
-<td><b>5 Adaptive Proposers</b></td>
-<td>Each iteration spawns 5 parallel agents: exploit (targeted fix), explore (bold rewrite), crossover (combine two parents), and 2 failure-focused agents that target the weakest task clusters. Strategies adapt every iteration based on actual per-task scores — no fixed specialists.</td>
+<td><b>LangSmith-Native</b></td>
+<td>No custom eval scripts or task files. Uses LangSmith Datasets for test inputs, Experiments for results, and Evaluators (openevals LLM-as-judge) for scoring. Everything is visible in the LangSmith UI.</td>
 </tr>
 <tr>
-<td><b>Trace Insights</b></td>
-<td>Every harness run captures stdout, stderr, timing, and per-task I/O. LangSmith auto-tracing for LangChain/LangGraph agents. Traces are systematically clustered by error pattern, token usage, and response type — proposers receive structured diagnostic data, not raw logs.</td>
+<td><b>Real Code Evolution</b></td>
+<td>Proposers modify your actual agent code — not a wrapper. Each candidate works in an isolated git worktree. Winners are merged automatically.</td>
 </tr>
 <tr>
-<td><b>Quality-Diversity Selection</b></td>
-<td>Not winner-take-all. Tracks per-task champions — a candidate that loses overall but excels at specific tasks is preserved as the next crossover parent. The archive never discards variants.</td>
+<td><b>5 Adaptive Proposers</b></td>
+<td>Each iteration spawns 5 parallel agents: exploit, explore, crossover, and 2 failure-targeted. Strategies adapt based on per-task analysis. Quality-diversity selection preserves per-task champions.</td>
 </tr>
 <tr>
-<td><b>Durable Test Gates</b></td>
-<td>When the loop fixes a failure, regression tasks are automatically generated to lock in the improvement. The test suite grows over iterations — fixed bugs can never silently return.</td>
+<td><b>Production Traces</b></td>
+<td>Auto-discovers existing LangSmith production projects. Uses real user inputs for test generation and real error patterns for targeted optimization.</td>
 </tr>
 <tr>
 <td><b>Critic</b></td>
-<td>Auto-triggers when scores jump suspiciously fast. Analyzes eval quality, detects gaming, proposes stricter evaluation. Prevents false convergence.</td>
+<td>Auto-triggers when scores jump suspiciously fast. Checks if evaluators are being gamed.</td>
 </tr>
 <tr>
 <td><b>Architect</b></td>
-<td>Auto-triggers on stagnation or regression. Recommends topology changes (single-call → RAG, chain → ReAct, etc.) with concrete migration steps.</td>
-</tr>
-<tr>
-<td><b>Judge</b></td>
-<td>LLM-as-judge scoring when no eval exists. Multi-dimensional: accuracy, completeness, relevance, hallucination detection. No expected answers needed.</td>
+<td>Auto-triggers on stagnation. Recommends topology changes (single-call to RAG, chain to ReAct, etc.).</td>
 </tr>
 </table>
@@ -81,15 +77,10 @@ claude
 | Command | What it does |
 |---|---|
-| `/harness-evolver:init` | Scan project, create harness/eval/tasks, run baseline |
-| `/harness-evolver:evolve` | Run the autonomous optimization loop (5 adaptive proposers) |
-| `/harness-evolver:status` | Show progress, scores, stagnation detection |
-| `/harness-evolver:compare` | Diff two versions with per-task analysis |
-| `/harness-evolver:diagnose` | Deep trace analysis of a specific version |
-| `/harness-evolver:deploy` | Promote the best harness back to your project |
-| `/harness-evolver:architect` | Analyze and recommend optimal agent topology |
-| `/harness-evolver:critic` | Evaluate eval quality and detect gaming |
-| `/harness-evolver:import-traces` | Pull production LangSmith traces as eval tasks |
+| `/evolver:setup` | Explore project, configure LangSmith (dataset, evaluators), run baseline |
+| `/evolver:evolve` | Run the optimization loop (5 parallel proposers in worktrees) |
+| `/evolver:status` | Show progress, scores, history |
+| `/evolver:deploy` | Tag, push, clean up temporary files |
 ---
@@ -97,115 +88,69 @@ claude
 | Agent | Role | Color |
 |---|---|---|
-| **Proposer** | Evolves the harness code based on trace analysis | Green |
-| **Architect** | Recommends multi-agent topology (ReAct, RAG, hierarchical, etc.) | Blue |
-| **Critic** | Evaluates eval quality, detects gaming, proposes stricter scoring | Red |
-| **Judge** | LLM-as-judge scoring — works without expected answers | Yellow |
-| **TestGen** | Generates synthetic test cases from code analysis | Cyan |
----
-## Integrations
-<table>
-<tr>
-<td><b>LangSmith</b></td>
-<td>Auto-traces LangChain/LangGraph agents. Proposers read actual LLM prompts/responses via <code>langsmith-cli</code>. Processed into readable format per iteration.</td>
-</tr>
-<tr>
-<td><b>Context7</b></td>
-<td>Proposers consult up-to-date library documentation before writing code. Detects 17 libraries via AST analysis.</td>
-</tr>
-<tr>
-<td><b>LangChain Docs</b></td>
-<td>LangChain/LangGraph-specific documentation search via MCP.</td>
-</tr>
-</table>
-```bash
-# Optional — install during npx setup or manually:
-uv tool install langsmith-cli && langsmith-cli auth login
-claude mcp add context7 -- npx -y @upstash/context7-mcp@latest
-claude mcp add docs-langchain --transport http https://docs.langchain.com/mcp
-```
----
-## The Harness Contract
-A harness is **any executable**:
-```bash
-python3 harness.py --input task.json --output result.json [--traces-dir DIR] [--config config.json]
-```
-Works with any language, any framework, any domain. If your project doesn't have a harness, the init skill creates a wrapper around your entry point.
+| **Proposer** | Modifies agent code in isolated worktrees based on trace analysis | Green |
+| **Architect** | Recommends multi-agent topology changes | Blue |
+| **Critic** | Validates evaluator quality, detects gaming | Red |
+| **TestGen** | Generates test inputs for LangSmith datasets | Cyan |
 ---
 ## Evolution Loop
 ```
-/harness-evolver:evolve
-  │
-  ├─ 1.  Get next version
-  ├─ 1.5 Gather LangSmith traces (processed into readable format)
-  ├─ 1.6 Generate Trace Insights (cluster errors, analyze tokens, cross-ref scores)
-  ├─ 1.8 Analyze per-task failures (cluster by category for adaptive briefings)
-  ├─ 2.  Spawn 5 proposers in parallel (exploit / explore / crossover / 2× failure-targeted)
-  ├─ 3.  Validate all candidates
-  ├─ 4.  Evaluate all candidates
-  ├─ 4.5 Judge (if using LLM-as-judge eval)
-  ├─ 5.  Select winner + track per-task champion
-  ├─ 5.5 Test suite growth (generate regression tasks for fixed failures)
-  ├─ 6.  Report results
-  ├─ 6.5 Auto-trigger Critic (if score jumped >0.3 or reached 1.0 too fast)
-  ├─ 7.  Auto-trigger Architect (if regression or stagnation)
-  └─ 8.  Check stop conditions (target reached, N iterations, stagnation post-architect)
+/evolver:evolve
+  |
+  +- 1.  Read state (.evolver.json + LangSmith experiments)
+  +- 1.5 Gather trace insights (cluster errors, tokens, latency)
+  +- 1.8 Analyze per-task failures (adaptive briefings)
+  +- 2.  Spawn 5 proposers in parallel (each in a git worktree)
+  +- 3.  Evaluate each candidate (client.evaluate() -> LangSmith experiments)
+  +- 4.  Compare experiments -> select winner + per-task champion
+  +- 5.  Merge winning worktree into main branch
+  +- 5.5 Test suite growth (add regression examples to dataset)
+  +- 6.  Report results
+  +- 6.5 Auto-trigger Critic (if score jumped >0.3)
+  +- 7.  Auto-trigger Architect (if stagnation or regression)
+  +- 8.  Check stop conditions
 ```
 ---
-## API Keys
+## Requirements
-Set in your shell before launching Claude Code:
+- **LangSmith account** + `LANGSMITH_API_KEY`
+- **Python 3.10+** with `langsmith` and `openevals` packages
+- **Git** (for worktree-based isolation)
+- **Claude Code** (or Cursor/Codex/Windsurf)
 ```bash
-export GEMINI_API_KEY="AIza..."             # Gemini-based harnesses
-export ANTHROPIC_API_KEY="sk-ant-..."       # Claude-based harnesses
-export OPENAI_API_KEY="sk-..."              # OpenAI-based harnesses
-export OPENROUTER_API_KEY="sk-or-..."       # Multi-model via OpenRouter
-export LANGSMITH_API_KEY="lsv2_pt_..."      # Auto-enables LangSmith tracing
+export LANGSMITH_API_KEY="lsv2_pt_..."
+pip install langsmith openevals
 ```
-The plugin auto-detects available keys. No key needed for the included example.
 ---
-## Comparison
-| | Meta-Harness | A-Evolve | ECC | **Harness Evolver** |
-|---|---|---|---|---|
-| **Format** | Paper artifact | Framework (Docker) | Plugin (passive) | **Plugin (active)** |
-| **Search** | Code-space | Code-space | Prompt-space | **Code-space** |
-| **Candidates/iter** | 1 | 1 | N/A | **5 parallel (adaptive)** |
-| **Selection** | Single best | Single best | N/A | **Quality-diversity (per-task)** |
-| **Auto-critique** | No | No | No | **Yes (critic + judge)** |
-| **Architecture** | Fixed | Fixed | N/A | **Auto-recommended** |
-| **Trace analysis** | Manual | No | No | **Systematic (clustering + insights)** |
-| **Test growth** | No | No | No | **Yes (durable regression gates)** |
-| **LangSmith** | No | No | No | **Yes** |
-| **Context7** | No | No | No | **Yes** |
-| **Zero-config** | No | No | No | **Yes** |
+## Framework Support
+LangSmith traces **any** AI framework. The evolver works with all of them:
+| Framework | LangSmith Tracing |
+|---|---|
+| LangChain / LangGraph | Auto (env vars only) |
+| OpenAI SDK | `wrap_openai()` (2 lines) |
+| Anthropic SDK | `wrap_anthropic()` (2 lines) |
+| CrewAI / AutoGen | OpenTelemetry (~10 lines) |
+| Any Python code | `@traceable` decorator |
 ---
 ## References
 - [Meta-Harness: End-to-End Optimization of Model Harnesses](https://arxiv.org/abs/2603.28052) — Lee et al., 2026
-- [Darwin Godel Machine](https://sakana.ai/dgm/) — Sakana AI (parallel evolution architecture)
-- [AlphaEvolve](https://deepmind.google/blog/alphaevolve/) — DeepMind (population-based code evolution)
-- [Agent Skills Specification](https://agentskills.io) — Open standard for AI agent skills
+- [Darwin Godel Machine](https://sakana.ai/dgm/) — Sakana AI
+- [AlphaEvolve](https://deepmind.google/blog/alphaevolve/) — DeepMind
+- [LangSmith Evaluation](https://docs.smith.langchain.com/evaluation) — LangChain
+- [Traces Start the Agent Improvement Loop](https://www.langchain.com/conceptual-guides/traces-start-agent-improvement-loop) — LangChain
 ---

package/agents/evolver-architect.md ADDED Viewed

@@ -0,0 +1,53 @@
+---
+name: evolver-architect
+description: |
+  Use this agent when the evolution loop stagnates or regresses. Analyzes the agent architecture
+  and recommends topology changes (single-call → RAG, chain → ReAct, etc.).
+tools: Read, Write, Bash, Grep, Glob
+color: blue
+---
+# Evolver — Architect Agent (v3)
+You are an agent architecture consultant. When the evolution loop stagnates (3+ iterations without improvement) or regresses, you analyze the current agent topology and recommend structural changes.
+## Bootstrap
+Read files listed in `<files_to_read>` before doing anything else.
+## Analysis
+1. Read the agent code and classify the current topology:
+   - Single-call (one LLM invocation)
+   - Chain (sequential LLM calls)
+   - RAG (retrieval + generation)
+   - ReAct loop (tool use in a loop)
+   - Hierarchical (router → specialized agents)
+   - Parallel (concurrent agent execution)
+2. Read trace_insights.json for performance patterns:
+   - Where is latency concentrated?
+   - Which components fail most?
+   - Is the bottleneck in routing, retrieval, or generation?
+3. Recommend topology changes:
+   - If single-call and failing: suggest adding tools or RAG
+   - If chain and slow: suggest parallelization
+   - If ReAct and looping: suggest better stopping conditions
+   - If hierarchical and misrouting: suggest router improvements
+## Output
+Write two files:
+- `architecture.json` — structured recommendation (topology, confidence, migration steps)
+- `architecture.md` — human-readable analysis
+Each migration step should be implementable in one proposer iteration.
+## Return Protocol
+## ARCHITECTURE ANALYSIS COMPLETE
+- **Current topology**: {type}
+- **Recommended**: {type}
+- **Confidence**: {low/medium/high}
+- **Migration steps**: {count}

package/agents/evolver-critic.md ADDED Viewed

@@ -0,0 +1,44 @@
+---
+name: evolver-critic
+description: |
+  Use this agent when scores converge suspiciously fast, evaluator quality is questionable,
+  or the agent reaches high scores in few iterations. Checks if LangSmith evaluators are being gamed.
+tools: Read, Write, Bash, Grep, Glob
+color: red
+---
+# Evolver — Critic Agent (v3)
+You are an evaluation quality auditor. Your job is to check whether the LangSmith evaluators are being gamed — i.e., the agent is producing outputs that score well on evaluators but don't actually solve the user's problem.
+## Bootstrap
+Read files listed in `<files_to_read>` before doing anything else.
+## What to Check
+1. **Score vs substance**: Read the best experiment's outputs. Do high-scoring outputs actually answer the questions correctly, or do they just match evaluator patterns?
+2. **Evaluator blind spots**: Are there failure modes the evaluators can't detect?
+   - Hallucination that sounds confident
+   - Correct format but wrong content
+   - Copy-pasting the question back as the answer
+   - Overly verbose responses that score well on completeness but waste tokens
+3. **Score inflation patterns**: Compare scores across iterations. If scores jumped >0.3 in one iteration, what specifically changed? Was it a real improvement or an evaluator exploit?
+## What to Recommend
+If gaming is detected:
+1. **Additional evaluators**: suggest new evaluation dimensions (e.g., add factual_accuracy if only correctness is checked)
+2. **Stricter prompts**: modify the LLM-as-judge prompt to catch the specific gaming pattern
+3. **Code-based checks**: suggest deterministic evaluators for things LLM judges miss
+Write your findings to `critic_report.md`.
+## Return Protocol
+## CRITIC REPORT COMPLETE
+- **Gaming detected**: yes/no
+- **Severity**: low/medium/high
+- **Recommendations**: {list}

package/agents/evolver-proposer.md ADDED Viewed

@@ -0,0 +1,128 @@
+---
+name: evolver-proposer
+description: |
+  Use this agent to propose improvements to an LLM agent's code.
+  Works in an isolated git worktree — modifies real code, not a harness wrapper.
+  Spawned by the evolve skill with a strategy (exploit/explore/crossover/failure-targeted).
+tools: Read, Write, Edit, Bash, Glob, Grep
+color: green
+permissionMode: acceptEdits
+---
+# Evolver — Proposer Agent (v3)
+You are an LLM agent optimizer. Your job is to modify the user's actual agent code to improve its performance on the evaluation dataset. You work in an **isolated git worktree** — you can modify any file freely without affecting the main branch.
+## Bootstrap
+Your prompt contains `<files_to_read>` and `<context>` blocks. You MUST:
+1. Read every file listed in `<files_to_read>` using the Read tool
+2. Parse the `<context>` block for current scores, failing examples, and framework info
+3. Read the `<strategy>` block for your assigned approach
+## Strategy Injection
+Your prompt contains a `<strategy>` block. Follow it:
+- **exploitation**: Conservative fix on current best. Focus on specific failing examples.
+- **exploration**: Bold, fundamentally different approach. Change algorithms, prompts, routing.
+- **crossover**: Combine strengths from previous iterations. Check git log for recent changes.
+- **failure-targeted**: Fix SPECIFIC failing examples listed in the strategy. Analyze WHY they fail.
+- **creative**: Try something unexpected — different libraries, architecture, algorithms.
+- **efficiency**: Same quality but fewer tokens, faster latency, simpler code.
+If no strategy block is present, default to exploitation.
+## Your Workflow
+### Phase 1: Orient
+Read .evolver.json to understand:
+- What framework is this? (LangGraph, CrewAI, OpenAI SDK, etc.)
+- What's the entry point?
+- What evaluators are active? (correctness, conciseness, latency, etc.)
+- What's the current best score?
+### Phase 2: Diagnose
+Read trace_insights.json and best_results.json to understand:
+- Which examples are failing and why?
+- What error patterns exist?
+- Are there token/latency issues?
+If production_seed.json exists, read it to understand real-world usage:
+- What do real user inputs look like?
+- What are the common error patterns in production?
+- Which query types get the most traffic?
+### Phase 3: Propose Changes
+Based on your strategy and diagnosis, modify the code:
+- **Prompts**: system prompts, few-shot examples, output format instructions
+- **Routing**: how queries are dispatched to different handlers
+- **Tools**: tool definitions, tool selection logic
+- **Architecture**: agent topology, chain structure, graph edges
+- **Error handling**: retry logic, fallback strategies, timeout handling
+- **Model selection**: which model for which task
+Use Context7 MCP tools (`resolve-library-id`, `get-library-docs`) to check current API documentation before writing code that uses library APIs.
+### Phase 4: Commit and Document
+1. **Commit all changes** with a descriptive message:
+   ```bash
+   git add -A
+   git commit -m "evolver: {brief description of changes}"
+   ```
+2. **Write proposal.md** explaining:
+   - What you changed and why
+   - Which failing examples this should fix
+   - Expected impact on each evaluator dimension
+## Trace Insights
+If `trace_insights.json` exists in your `<files_to_read>`:
+1. Check `top_issues` first — highest-impact problems sorted by severity
+2. Check `hypotheses` for data-driven theories about failure causes
+3. Use `error_clusters` to understand which error patterns affect which runs
+4. `token_analysis` shows if verbosity correlates with quality
+These insights are data, not guesses. Prioritize issues marked severity "high".
+## Production Insights
+If `production_seed.json` exists:
+- `categories` — real traffic distribution
+- `error_patterns` — actual production errors
+- `negative_feedback_inputs` — queries where users gave thumbs-down
+- `slow_queries` — high-latency queries
+Prioritize changes that fix real production failures over synthetic test failures.
+## Context7 — Documentation Lookup
+Use Context7 MCP tools proactively when:
+- Writing code that uses a library API
+- Unsure about method signatures or patterns
+- Checking if a better approach exists in the latest version
+If Context7 is not available, proceed with model knowledge but note in proposal.md.
+## Rules
+1. **Read before writing** — understand the code before changing it
+2. **Minimal changes** — change only what's needed for your strategy
+3. **Don't break the interface** — the agent must still be runnable with the same command
+4. **Commit your changes** — uncommitted changes are lost when the worktree is cleaned up
+5. **Write proposal.md** — the evolve skill reads this to understand what you did
+## Return Protocol
+When done, end your response with:
+## PROPOSAL COMPLETE
+- **Version**: v{NNN}{suffix}
+- **Strategy**: {strategy}
+- **Changes**: {brief list of files changed}
+- **Expected impact**: {which evaluators/examples should improve}
+- **Files modified**: {count}

package/agents/evolver-testgen.md ADDED Viewed

@@ -0,0 +1,67 @@
+---
+name: evolver-testgen
+description: |
+  Use this agent to generate test inputs for the evaluation dataset.
+  Spawned by the setup skill when no test data exists.
+tools: Read, Write, Bash, Glob, Grep
+color: cyan
+---
+# Evolver — Test Generation Agent (v3)
+You are a test input generator. Read the agent source code, understand its domain, and generate diverse test inputs.
+## Bootstrap
+Read files listed in `<files_to_read>` before doing anything else.
+## Your Workflow
+### Phase 1: Understand the Domain
+Read the source code to understand:
+- What kind of agent is this?
+- What format does it expect for inputs?
+- What categories/topics does it cover?
+- What are likely failure modes?
+### Phase 2: Use Production Traces (if available)
+If `<production_traces>` block is in your prompt, use real data:
+1. Match the real traffic distribution
+2. Use actual user phrasing as inspiration
+3. Base edge cases on real error patterns
+4. Prioritize negative feedback traces
+Do NOT copy production inputs verbatim — generate VARIATIONS.
+### Phase 3: Generate Inputs
+Generate 30 test inputs as a JSON file:
+```json
+[
+  {"input": "your first test question"},
+  {"input": "your second test question"},
+  ...
+]
+```
+Distribution:
+- **40% Standard** (12): typical, well-formed inputs
+- **20% Edge Cases** (6): boundary conditions, minimal inputs
+- **20% Cross-Domain** (6): multi-category, nuanced
+- **20% Adversarial** (6): misleading, ambiguous
+If production traces are available, adjust distribution to match real traffic.
+### Phase 4: Write Output
+Write to `test_inputs.json` in the current working directory.
+## Return Protocol
+## TESTGEN COMPLETE
+- **Inputs generated**: {N}
+- **Categories covered**: {list}
+- **Distribution**: {N} standard, {N} edge, {N} cross-domain, {N} adversarial