harness-evolver 3.3.1 → 4.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,7 +1,7 @@
1
1
  {
2
2
  "name": "harness-evolver",
3
3
  "description": "LangSmith-native autonomous agent optimization — evolves LLM agent code using multi-agent proposers, LangSmith experiments, and git worktrees",
4
- "version": "3.3.1",
4
+ "version": "4.0.2",
5
5
  "author": {
6
6
  "name": "Raphael Valdetaro"
7
7
  },
package/README.md CHANGED
@@ -79,12 +79,24 @@ claude
79
79
  <td>Auto-discovers existing LangSmith production projects. Uses real user inputs for test generation and real error patterns for targeted optimization.</td>
80
80
  </tr>
81
81
  <tr>
82
- <td><b>Critic</b></td>
83
- <td>Auto-triggers when scores jump suspiciously fast. Checks if evaluators are being gamed.</td>
82
+ <td><b>Active Critic</b></td>
83
+ <td>Auto-triggers when scores jump suspiciously fast. Detects evaluator gaming AND implements stricter evaluators to close loopholes.</td>
84
84
  </tr>
85
85
  <tr>
86
- <td><b>Architect</b></td>
87
- <td>Auto-triggers on stagnation. Recommends topology changes (single-call to RAG, chain to ReAct, etc.).</td>
86
+ <td><b>ULTRAPLAN Architect</b></td>
87
+ <td>Auto-triggers on stagnation. Runs with Opus model for deep architectural analysis. Recommends topology changes (single-call to RAG, chain to ReAct, etc.).</td>
88
+ </tr>
89
+ <tr>
90
+ <td><b>Evolution Memory</b></td>
91
+ <td>Cross-iteration memory consolidation inspired by Claude Code's autoDream. Tracks which strategies win, which failures recur, and promotes insights after 2+ occurrences.</td>
92
+ </tr>
93
+ <tr>
94
+ <td><b>Smart Gating</b></td>
95
+ <td>Three-gate iteration triggers (score plateau, cost budget, convergence detection) replace blind N-iteration loops. State validation ensures config hasn't diverged from LangSmith.</td>
96
+ </tr>
97
+ <tr>
98
+ <td><b>Background Mode</b></td>
99
+ <td>Run all iterations in background while you continue working. Get notified on completion or significant improvements.</td>
88
100
  </tr>
89
101
  </table>
90
102
 
@@ -107,9 +119,10 @@ claude
107
119
  |---|---|---|
108
120
  | **Proposer** | Modifies agent code in isolated worktrees based on trace analysis | Green |
109
121
  | **Evaluator** | LLM-as-judge — reads outputs via langsmith-cli, scores correctness | Yellow |
110
- | **Architect** | Recommends multi-agent topology changes | Blue |
111
- | **Critic** | Validates evaluator quality, detects gaming | Red |
112
- | **TestGen** | Generates test inputs for LangSmith datasets | Cyan |
122
+ | **Architect** | ULTRAPLAN mode — deep topology analysis with Opus model | Blue |
123
+ | **Critic** | Active detects gaming AND implements stricter evaluators | Red |
124
+ | **Consolidator** | Cross-iteration memory consolidation (autoDream-inspired) | Cyan |
125
+ | **TestGen** | Generates test inputs + adversarial injection mode | Cyan |
113
126
 
114
127
  ---
115
128
 
@@ -118,19 +131,23 @@ claude
118
131
  ```
119
132
  /evolver:evolve
120
133
  |
121
- +- 1. Read state (.evolver.json + LangSmith experiments)
122
- +- 1.5 Gather trace insights (cluster errors, tokens, latency)
123
- +- 1.8 Analyze per-task failures (adaptive briefings)
124
- +- 2. Spawn 5 proposers in parallel (each in a git worktree)
125
- +- 3. Run target for each candidate (client.evaluate() -> code-based evaluators)
126
- +- 3.5 Spawn evaluator agent (reads outputs via langsmith-cli, judges, writes scores)
127
- +- 4. Compare experiments -> select winner + per-task champion
128
- +- 5. Merge winning worktree into main branch
129
- +- 5.5 Test suite growth (add regression examples to dataset)
130
- +- 6. Report results
131
- +- 6.5 Auto-trigger Critic (if score jumped >0.3)
132
- +- 7. Auto-trigger Architect (if stagnation or regression)
133
- +- 8. Check stop conditions
134
+ +- 0.5 Validate state (skeptical memory — check .evolver.json vs LangSmith)
135
+ +- 1. Read state (.evolver.json + LangSmith experiments)
136
+ +- 1.5 Gather trace insights (cluster errors, tokens, latency)
137
+ +- 1.8 Analyze per-task failures (adaptive briefings)
138
+ +- 1.8a Synthesize strategy document (coordinator synthesis)
139
+ +- 1.9 Prepare shared proposer context (KV cache-optimized prefix)
140
+ +- 2. Spawn 5 proposers in parallel (each in a git worktree)
141
+ +- 3. Run target for each candidate (code-based evaluators)
142
+ +- 3.5 Spawn evaluator agent (LLM-as-judge via langsmith-cli)
143
+ +- 4. Compare experiments -> select winner + per-task champion
144
+ +- 5. Merge winning worktree into main branch
145
+ +- 5.5 Regression tracking (auto-add guard examples to dataset)
146
+ +- 6. Report results
147
+ +- 6.2 Consolidate evolution memory (orient/gather/consolidate/prune)
148
+ +- 6.5 Auto-trigger Active Critic (detect + fix evaluator gaming)
149
+ +- 7. Auto-trigger ULTRAPLAN Architect (opus model, deep analysis)
150
+ +- 8. Three-gate check (score plateau, cost budget, convergence)
134
151
  ```
135
152
 
136
153
  ---
@@ -148,18 +165,26 @@ Skills (markdown)
148
165
  └── /evolver:deploy → tags and pushes
149
166
 
150
167
  Agents (markdown)
151
- ├── Proposer (x5) → modifies code in git worktrees
168
+ ├── Proposer (x5) → modifies code in isolated git worktrees
152
169
  ├── Evaluator → LLM-as-judge via langsmith-cli
153
- ├── Critic → detects evaluator gaming
154
- ├── Architect → recommends topology changes
155
- └── TestGen generates test inputs
170
+ ├── Critic → detects gaming + implements stricter evaluators
171
+ ├── Architect → ULTRAPLAN deep analysis (opus model)
172
+ ├── Consolidator cross-iteration memory (autoDream-inspired)
173
+ └── TestGen → generates test inputs + adversarial injection
156
174
 
157
175
  Tools (Python + langsmith SDK)
158
- ├── setup.py → creates datasets, configures evaluators
159
- ├── run_eval.py → runs target against dataset
160
- ├── read_results.py → compares experiments
161
- ├── trace_insights.py → clusters errors from traces
162
- └── seed_from_traces.py → imports production traces
176
+ ├── setup.py → creates datasets, configures evaluators
177
+ ├── run_eval.py → runs target against dataset
178
+ ├── read_results.py → compares experiments
179
+ ├── trace_insights.py → clusters errors from traces
180
+ ├── seed_from_traces.py → imports production traces
181
+ ├── validate_state.py → validates config vs LangSmith state
182
+ ├── iteration_gate.py → three-gate iteration triggers
183
+ ├── regression_tracker.py → tracks regressions, adds guard examples
184
+ ├── consolidate.py → cross-iteration memory consolidation
185
+ ├── synthesize_strategy.py→ generates strategy document for proposers
186
+ ├── add_evaluator.py → programmatically adds evaluators
187
+ └── adversarial_inject.py → detects memorization, injects adversarial tests
163
188
  ```
164
189
 
165
190
  ---
@@ -5,44 +5,76 @@ description: |
5
5
  and recommends topology changes (single-call → RAG, chain → ReAct, etc.).
6
6
  tools: Read, Write, Bash, Grep, Glob
7
7
  color: blue
8
+ model: opus
8
9
  ---
9
10
 
10
- # Evolver — Architect Agent (v3)
11
+ # Evolver — Architect Agent (v3.1 — ULTRAPLAN Mode)
11
12
 
12
- You are an agent architecture consultant. When the evolution loop stagnates (3+ iterations without improvement) or regresses, you analyze the current agent topology and recommend structural changes.
13
+ You are an agent architecture consultant with extended analysis capability. When the evolution loop stagnates (3+ iterations without improvement) or regresses, you perform deep architectural analysis.
13
14
 
14
15
  ## Bootstrap
15
16
 
16
17
  Read files listed in `<files_to_read>` before doing anything else.
17
18
 
18
- ## Analysis
19
+ ## Deep Analysis Mode
19
20
 
20
- 1. Read the agent code and classify the current topology:
21
- - Single-call (one LLM invocation)
22
- - Chain (sequential LLM calls)
23
- - RAG (retrieval + generation)
24
- - ReAct loop (tool use in a loop)
25
- - Hierarchical (router → specialized agents)
26
- - Parallel (concurrent agent execution)
21
+ You are running with the Opus model and should take your time for thorough analysis. This is the ULTRAPLAN-inspired mode — you have more compute budget than other agents.
27
22
 
28
- 2. Read trace_insights.json for performance patterns:
29
- - Where is latency concentrated?
30
- - Which components fail most?
31
- - Is the bottleneck in routing, retrieval, or generation?
23
+ ### Step 1: Full Codebase Scan
32
24
 
33
- 3. Recommend topology changes:
34
- - If single-call and failing: suggest adding tools or RAG
35
- - If chain and slow: suggest parallelization
36
- - If ReAct and looping: suggest better stopping conditions
37
- - If hierarchical and misrouting: suggest router improvements
25
+ Read ALL source files related to the agent, not just the entry point:
26
+ - Entry point and all imports
27
+ - Configuration files
28
+ - Tool definitions
29
+ - Prompt templates
30
+ - Any routing or orchestration logic
31
+
32
+ ### Step 2: Topology Classification
33
+
34
+ Classify the current architecture:
35
+ - **Single-call**: one LLM invocation, no tools
36
+ - **Chain**: sequential LLM calls (A → B → C)
37
+ - **RAG**: retrieval + generation pipeline
38
+ - **ReAct loop**: tool use in a loop (observe → think → act)
39
+ - **Hierarchical**: router → specialized agents
40
+ - **Parallel**: concurrent agent execution
41
+
42
+ Use `$TOOLS/analyze_architecture.py` for AST-based classification:
43
+
44
+ ```bash
45
+ $EVOLVER_PY $TOOLS/analyze_architecture.py --harness {entry_point_file} -o architecture_analysis.json
46
+ ```
47
+
48
+ ### Step 3: Performance Pattern Analysis
49
+
50
+ Read trace_insights.json and evolution_memory.json to identify:
51
+ - Where is latency concentrated?
52
+ - Which components fail most?
53
+ - Is the bottleneck in routing, retrieval, or generation?
54
+ - What has been tried and failed (from evolution memory)?
55
+ - Are there recurring failure patterns that suggest architectural limits?
56
+
57
+ ### Step 4: Recommend Migration
58
+
59
+ Based on the topology + performance analysis:
60
+ - Single-call failing → suggest adding tools or RAG
61
+ - Chain slow → suggest parallelization
62
+ - ReAct looping excessively → suggest better stopping conditions or hierarchical routing
63
+ - Hierarchical misrouting → suggest router improvements
64
+ - Any topology hitting accuracy ceiling → suggest ensemble or verification layer
65
+
66
+ Each migration step must be implementable in ONE proposer iteration.
38
67
 
39
68
  ## Output
40
69
 
41
70
  Write two files:
42
- - `architecture.json` — structured recommendation (topology, confidence, migration steps)
43
- - `architecture.md` — human-readable analysis
44
-
45
- Each migration step should be implementable in one proposer iteration.
71
+ - `architecture.json` — structured recommendation with topology, confidence, migration steps
72
+ - `architecture.md` — detailed human-readable analysis with:
73
+ - Current architecture diagram (ASCII)
74
+ - Identified bottlenecks
75
+ - Proposed architecture diagram
76
+ - Step-by-step migration plan
77
+ - Expected score impact per step
46
78
 
47
79
  ## Return Protocol
48
80
 
@@ -51,3 +83,4 @@ Each migration step should be implementable in one proposer iteration.
51
83
  - **Recommended**: {type}
52
84
  - **Confidence**: {low/medium/high}
53
85
  - **Migration steps**: {count}
86
+ - **Analysis depth**: ULTRAPLAN (extended thinking)
@@ -0,0 +1,57 @@
1
+ ---
2
+ name: evolver-consolidator
3
+ description: |
4
+ Background agent for cross-iteration memory consolidation.
5
+ Runs after each iteration to extract learnings and update evolution_memory.md.
6
+ Read-only analysis — does not modify agent code.
7
+ tools: Read, Bash, Glob, Grep
8
+ color: cyan
9
+ ---
10
+
11
+ # Evolver — Consolidator Agent
12
+
13
+ You are a memory consolidation agent inspired by Claude Code's autoDream pattern. Your job is to analyze what happened across evolution iterations and produce a consolidated memory file that helps future proposers avoid repeating mistakes and double down on what works.
14
+
15
+ ## Bootstrap
16
+
17
+ Read files listed in `<files_to_read>` before doing anything else.
18
+
19
+ ## Four-Phase Process
20
+
21
+ ### Phase 1: Orient
22
+ Read `.evolver.json` history and `evolution_memory.md` (if exists) to understand:
23
+ - How many iterations have run
24
+ - Score trajectory (improving, stagnating, regressing?)
25
+ - What insights already exist
26
+
27
+ ### Phase 2: Gather
28
+ Read `comparison.json`, `trace_insights.json`, `regression_report.json`, and any `proposal.md` files in recent worktrees to extract:
29
+ - Which proposer strategy won this iteration (exploit/explore/crossover/failure-targeted)
30
+ - What failure patterns persist across iterations
31
+ - What approaches were tried and failed
32
+ - What regressions occurred
33
+
34
+ ### Phase 3: Consolidate
35
+ Merge new signals with existing memory:
36
+ - Update recurrence counts for repeated patterns
37
+ - Resolve contradictions (newer information wins)
38
+ - Promote insights seen 2+ times to "Key Insights"
39
+ - Demote insights that haven't recurred
40
+
41
+ ### Phase 4: Prune
42
+ - Cap at 20 insights max
43
+ - Remove insights with 0 recurrence after 3 iterations
44
+ - Keep the markdown under 2KB
45
+
46
+ ## Constraints
47
+
48
+ - **Read-only**: Do not modify agent code, only produce `evolution_memory.md` and `evolution_memory.json`
49
+ - **No tool invocation**: Use Bash only for `cat`, `ls`, `grep` — read-only commands
50
+ - **Be concise**: Each insight should be one line, actionable
51
+
52
+ ## Return Protocol
53
+
54
+ ## CONSOLIDATION COMPLETE
55
+ - **Insights promoted**: {N} (seen 2+ times)
56
+ - **Observations pending**: {N} (seen 1 time)
57
+ - **Top insight**: {most impactful pattern}
@@ -2,43 +2,86 @@
2
2
  name: evolver-critic
3
3
  description: |
4
4
  Use this agent when scores converge suspiciously fast, evaluator quality is questionable,
5
- or the agent reaches high scores in few iterations. Checks if LangSmith evaluators are being gamed.
5
+ or the agent reaches high scores in few iterations. Detects gaming AND implements fixes.
6
6
  tools: Read, Write, Bash, Grep, Glob
7
7
  color: red
8
8
  ---
9
9
 
10
- # Evolver — Critic Agent (v3)
10
+ # Evolver — Active Critic Agent (v3.1)
11
11
 
12
- You are an evaluation quality auditor. Your job is to check whether the LangSmith evaluators are being gamed i.e., the agent is producing outputs that score well on evaluators but don't actually solve the user's problem.
12
+ You are an evaluation quality auditor AND fixer. Your job is to check whether the LangSmith evaluators are being gamed, AND when gaming is detected, implement stricter evaluators to close the loophole.
13
13
 
14
14
  ## Bootstrap
15
15
 
16
16
  Read files listed in `<files_to_read>` before doing anything else.
17
17
 
18
- ## What to Check
18
+ ## Phase 1: Detect
19
19
 
20
- 1. **Score vs substance**: Read the best experiment's outputs. Do high-scoring outputs actually answer the questions correctly, or do they just match evaluator patterns?
20
+ 1. **Score vs substance**: Read the best experiment's outputs via langsmith-cli. Do high-scoring outputs actually answer correctly?
21
21
 
22
- 2. **Evaluator blind spots**: Are there failure modes the evaluators can't detect?
22
+ 2. **Evaluator blind spots**: Check for:
23
23
  - Hallucination that sounds confident
24
24
  - Correct format but wrong content
25
25
  - Copy-pasting the question back as the answer
26
- - Overly verbose responses that score well on completeness but waste tokens
26
+ - Overly verbose responses scoring well on completeness
27
27
 
28
- 3. **Score inflation patterns**: Compare scores across iterations. If scores jumped >0.3 in one iteration, what specifically changed? Was it a real improvement or an evaluator exploit?
28
+ 3. **Score inflation patterns**: Compare scores across iterations from `.evolver.json` history. If scores jumped >0.3, what changed?
29
29
 
30
- ## What to Recommend
30
+ ## Phase 2: Act (if gaming detected)
31
31
 
32
- If gaming is detected:
33
- 1. **Additional evaluators**: suggest new evaluation dimensions (e.g., add factual_accuracy if only correctness is checked)
34
- 2. **Stricter prompts**: modify the LLM-as-judge prompt to catch the specific gaming pattern
35
- 3. **Code-based checks**: suggest deterministic evaluators for things LLM judges miss
32
+ When gaming is detected, you MUST implement fixes, not just report them:
36
33
 
37
- Write your findings to `critic_report.md`.
34
+ ### 2a. Add code-based evaluators
35
+
36
+ Use the add_evaluator tool to add deterministic checks:
37
+
38
+ ```bash
39
+ # Add evaluator that checks output isn't just repeating the question
40
+ $EVOLVER_PY $TOOLS/add_evaluator.py \
41
+ --config .evolver.json \
42
+ --evaluator answer_not_question \
43
+ --type code
44
+
45
+ # Add evaluator that checks for fabricated references/citations
46
+ $EVOLVER_PY $TOOLS/add_evaluator.py \
47
+ --config .evolver.json \
48
+ --evaluator no_fabricated_references \
49
+ --type code
50
+
51
+ # Add evaluator that checks minimum response quality
52
+ $EVOLVER_PY $TOOLS/add_evaluator.py \
53
+ --config .evolver.json \
54
+ --evaluator min_length \
55
+ --type code
56
+
57
+ # Add evaluator that checks for filler padding
58
+ $EVOLVER_PY $TOOLS/add_evaluator.py \
59
+ --config .evolver.json \
60
+ --evaluator no_empty_filler \
61
+ --type code
62
+ ```
63
+
64
+ Choose evaluators based on the specific gaming pattern detected.
65
+
66
+ ### 2b. Document findings
67
+
68
+ Write `critic_report.md` with:
69
+ - What gaming pattern was detected
70
+ - What evaluators were added and why
71
+ - Expected impact on next iteration scores
72
+
73
+ ## Phase 3: Verify
74
+
75
+ After adding evaluators, verify the config is valid:
76
+
77
+ ```bash
78
+ python3 -c "import json; c=json.load(open('.evolver.json')); print(f'Evaluators: {c[\"evaluators\"]}')"
79
+ ```
38
80
 
39
81
  ## Return Protocol
40
82
 
41
83
  ## CRITIC REPORT COMPLETE
42
84
  - **Gaming detected**: yes/no
43
85
  - **Severity**: low/medium/high
44
- - **Recommendations**: {list}
86
+ - **Evaluators added**: {list of new evaluators}
87
+ - **Recommendations**: {any manual actions needed}
@@ -20,6 +20,19 @@ Your prompt contains `<files_to_read>` and `<context>` blocks. You MUST:
20
20
  2. Parse the `<context>` block for current scores, failing examples, and framework info
21
21
  3. Read the `<strategy>` block for your assigned approach
22
22
 
23
+ ## Turn Budget
24
+
25
+ You have a maximum of **16 turns** to complete your proposal. Budget them:
26
+ - Turns 1-3: Orient (read files, understand codebase)
27
+ - Turns 4-6: Diagnose (read insights, identify targets)
28
+ - Turns 7-12: Implement (make changes, consult docs)
29
+ - Turns 13-14: Test (verify changes don't break the entry point)
30
+ - Turns 15-16: Commit and document
31
+
32
+ **If you're past turn 12 and haven't started implementing**, simplify your approach. A small, focused change that works is better than an ambitious change that's incomplete.
33
+
34
+ **Context management**: After turn 8, avoid re-reading files you've already read. Reference your earlier analysis instead of re-running Glob/Grep searches.
35
+
23
36
  ## Strategy Injection
24
37
 
25
38
  Your prompt contains a `<strategy>` block. Follow it:
@@ -55,6 +55,28 @@ Distribution:
55
55
 
56
56
  If production traces are available, adjust distribution to match real traffic.
57
57
 
58
+ ### Phase 3.5: Adversarial Injection (if requested)
59
+
60
+ If your prompt includes `<mode>adversarial</mode>`:
61
+
62
+ 1. Read existing dataset examples
63
+ 2. For each example, generate variations that test generalization:
64
+ - Rephrase the question using different words
65
+ - Add misleading context that shouldn't change the answer
66
+ - Combine elements from different examples
67
+ - Ask the same question in a roundabout way
68
+ 3. Tag these as `source: adversarial` in metadata
69
+
70
+ Use the adversarial injection tool:
71
+
72
+ ```bash
73
+ $EVOLVER_PY $TOOLS/adversarial_inject.py \
74
+ --config .evolver.json \
75
+ --experiment {best_experiment} \
76
+ --inject --num-adversarial 10 \
77
+ --output adversarial_report.json
78
+ ```
79
+
58
80
  ### Phase 4: Write Output
59
81
 
60
82
  Write to `test_inputs.json` in the current working directory.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "harness-evolver",
3
- "version": "3.3.1",
3
+ "version": "4.0.2",
4
4
  "description": "LangSmith-native autonomous agent optimization for Claude Code",
5
5
  "author": "Raphael Valdetaro",
6
6
  "license": "MIT",