harness-evolver 3.3.1 → 4.0.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/plugin.json +1 -1
- package/README.md +54 -29
- package/agents/evolver-architect.md +56 -23
- package/agents/evolver-consolidator.md +57 -0
- package/agents/evolver-critic.md +58 -15
- package/agents/evolver-proposer.md +13 -0
- package/agents/evolver-testgen.md +22 -0
- package/package.json +1 -1
- package/skills/evolve/SKILL.md +288 -71
- package/tools/__pycache__/add_evaluator.cpython-313.pyc +0 -0
- package/tools/__pycache__/adversarial_inject.cpython-313.pyc +0 -0
- package/tools/__pycache__/consolidate.cpython-313.pyc +0 -0
- package/tools/__pycache__/iteration_gate.cpython-313.pyc +0 -0
- package/tools/__pycache__/regression_tracker.cpython-313.pyc +0 -0
- package/tools/__pycache__/synthesize_strategy.cpython-313.pyc +0 -0
- package/tools/__pycache__/validate_state.cpython-313.pyc +0 -0
- package/tools/add_evaluator.py +103 -0
- package/tools/adversarial_inject.py +205 -0
- package/tools/consolidate.py +235 -0
- package/tools/iteration_gate.py +140 -0
- package/tools/regression_tracker.py +175 -0
- package/tools/synthesize_strategy.py +224 -0
- package/tools/validate_state.py +212 -0
- package/tools/__pycache__/detect_stack.cpython-314.pyc +0 -0
- package/tools/__pycache__/trace_logger.cpython-314.pyc +0 -0
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "harness-evolver",
|
|
3
3
|
"description": "LangSmith-native autonomous agent optimization — evolves LLM agent code using multi-agent proposers, LangSmith experiments, and git worktrees",
|
|
4
|
-
"version": "
|
|
4
|
+
"version": "4.0.2",
|
|
5
5
|
"author": {
|
|
6
6
|
"name": "Raphael Valdetaro"
|
|
7
7
|
},
|
package/README.md
CHANGED
|
@@ -79,12 +79,24 @@ claude
|
|
|
79
79
|
<td>Auto-discovers existing LangSmith production projects. Uses real user inputs for test generation and real error patterns for targeted optimization.</td>
|
|
80
80
|
</tr>
|
|
81
81
|
<tr>
|
|
82
|
-
<td><b>Critic</b></td>
|
|
83
|
-
<td>Auto-triggers when scores jump suspiciously fast.
|
|
82
|
+
<td><b>Active Critic</b></td>
|
|
83
|
+
<td>Auto-triggers when scores jump suspiciously fast. Detects evaluator gaming AND implements stricter evaluators to close loopholes.</td>
|
|
84
84
|
</tr>
|
|
85
85
|
<tr>
|
|
86
|
-
<td><b>Architect</b></td>
|
|
87
|
-
<td>Auto-triggers on stagnation. Recommends topology changes (single-call to RAG, chain to ReAct, etc.).</td>
|
|
86
|
+
<td><b>ULTRAPLAN Architect</b></td>
|
|
87
|
+
<td>Auto-triggers on stagnation. Runs with Opus model for deep architectural analysis. Recommends topology changes (single-call to RAG, chain to ReAct, etc.).</td>
|
|
88
|
+
</tr>
|
|
89
|
+
<tr>
|
|
90
|
+
<td><b>Evolution Memory</b></td>
|
|
91
|
+
<td>Cross-iteration memory consolidation inspired by Claude Code's autoDream. Tracks which strategies win, which failures recur, and promotes insights after 2+ occurrences.</td>
|
|
92
|
+
</tr>
|
|
93
|
+
<tr>
|
|
94
|
+
<td><b>Smart Gating</b></td>
|
|
95
|
+
<td>Three-gate iteration triggers (score plateau, cost budget, convergence detection) replace blind N-iteration loops. State validation ensures config hasn't diverged from LangSmith.</td>
|
|
96
|
+
</tr>
|
|
97
|
+
<tr>
|
|
98
|
+
<td><b>Background Mode</b></td>
|
|
99
|
+
<td>Run all iterations in background while you continue working. Get notified on completion or significant improvements.</td>
|
|
88
100
|
</tr>
|
|
89
101
|
</table>
|
|
90
102
|
|
|
@@ -107,9 +119,10 @@ claude
|
|
|
107
119
|
|---|---|---|
|
|
108
120
|
| **Proposer** | Modifies agent code in isolated worktrees based on trace analysis | Green |
|
|
109
121
|
| **Evaluator** | LLM-as-judge — reads outputs via langsmith-cli, scores correctness | Yellow |
|
|
110
|
-
| **Architect** |
|
|
111
|
-
| **Critic** |
|
|
112
|
-
| **
|
|
122
|
+
| **Architect** | ULTRAPLAN mode — deep topology analysis with Opus model | Blue |
|
|
123
|
+
| **Critic** | Active — detects gaming AND implements stricter evaluators | Red |
|
|
124
|
+
| **Consolidator** | Cross-iteration memory consolidation (autoDream-inspired) | Cyan |
|
|
125
|
+
| **TestGen** | Generates test inputs + adversarial injection mode | Cyan |
|
|
113
126
|
|
|
114
127
|
---
|
|
115
128
|
|
|
@@ -118,19 +131,23 @@ claude
|
|
|
118
131
|
```
|
|
119
132
|
/evolver:evolve
|
|
120
133
|
|
|
|
121
|
-
+-
|
|
122
|
-
+- 1.
|
|
123
|
-
+- 1.
|
|
124
|
-
+-
|
|
125
|
-
+-
|
|
126
|
-
+-
|
|
127
|
-
+-
|
|
128
|
-
+-
|
|
129
|
-
+-
|
|
130
|
-
+-
|
|
131
|
-
+-
|
|
132
|
-
+-
|
|
133
|
-
+-
|
|
134
|
+
+- 0.5 Validate state (skeptical memory — check .evolver.json vs LangSmith)
|
|
135
|
+
+- 1. Read state (.evolver.json + LangSmith experiments)
|
|
136
|
+
+- 1.5 Gather trace insights (cluster errors, tokens, latency)
|
|
137
|
+
+- 1.8 Analyze per-task failures (adaptive briefings)
|
|
138
|
+
+- 1.8a Synthesize strategy document (coordinator synthesis)
|
|
139
|
+
+- 1.9 Prepare shared proposer context (KV cache-optimized prefix)
|
|
140
|
+
+- 2. Spawn 5 proposers in parallel (each in a git worktree)
|
|
141
|
+
+- 3. Run target for each candidate (code-based evaluators)
|
|
142
|
+
+- 3.5 Spawn evaluator agent (LLM-as-judge via langsmith-cli)
|
|
143
|
+
+- 4. Compare experiments -> select winner + per-task champion
|
|
144
|
+
+- 5. Merge winning worktree into main branch
|
|
145
|
+
+- 5.5 Regression tracking (auto-add guard examples to dataset)
|
|
146
|
+
+- 6. Report results
|
|
147
|
+
+- 6.2 Consolidate evolution memory (orient/gather/consolidate/prune)
|
|
148
|
+
+- 6.5 Auto-trigger Active Critic (detect + fix evaluator gaming)
|
|
149
|
+
+- 7. Auto-trigger ULTRAPLAN Architect (opus model, deep analysis)
|
|
150
|
+
+- 8. Three-gate check (score plateau, cost budget, convergence)
|
|
134
151
|
```
|
|
135
152
|
|
|
136
153
|
---
|
|
@@ -148,18 +165,26 @@ Skills (markdown)
|
|
|
148
165
|
└── /evolver:deploy → tags and pushes
|
|
149
166
|
|
|
150
167
|
Agents (markdown)
|
|
151
|
-
├── Proposer (x5) → modifies code in git worktrees
|
|
168
|
+
├── Proposer (x5) → modifies code in isolated git worktrees
|
|
152
169
|
├── Evaluator → LLM-as-judge via langsmith-cli
|
|
153
|
-
├── Critic → detects
|
|
154
|
-
├── Architect →
|
|
155
|
-
|
|
170
|
+
├── Critic → detects gaming + implements stricter evaluators
|
|
171
|
+
├── Architect → ULTRAPLAN deep analysis (opus model)
|
|
172
|
+
├── Consolidator → cross-iteration memory (autoDream-inspired)
|
|
173
|
+
└── TestGen → generates test inputs + adversarial injection
|
|
156
174
|
|
|
157
175
|
Tools (Python + langsmith SDK)
|
|
158
|
-
├── setup.py
|
|
159
|
-
├── run_eval.py
|
|
160
|
-
├── read_results.py
|
|
161
|
-
├── trace_insights.py
|
|
162
|
-
|
|
176
|
+
├── setup.py → creates datasets, configures evaluators
|
|
177
|
+
├── run_eval.py → runs target against dataset
|
|
178
|
+
├── read_results.py → compares experiments
|
|
179
|
+
├── trace_insights.py → clusters errors from traces
|
|
180
|
+
├── seed_from_traces.py → imports production traces
|
|
181
|
+
├── validate_state.py → validates config vs LangSmith state
|
|
182
|
+
├── iteration_gate.py → three-gate iteration triggers
|
|
183
|
+
├── regression_tracker.py → tracks regressions, adds guard examples
|
|
184
|
+
├── consolidate.py → cross-iteration memory consolidation
|
|
185
|
+
├── synthesize_strategy.py→ generates strategy document for proposers
|
|
186
|
+
├── add_evaluator.py → programmatically adds evaluators
|
|
187
|
+
└── adversarial_inject.py → detects memorization, injects adversarial tests
|
|
163
188
|
```
|
|
164
189
|
|
|
165
190
|
---
|
|
@@ -5,44 +5,76 @@ description: |
|
|
|
5
5
|
and recommends topology changes (single-call → RAG, chain → ReAct, etc.).
|
|
6
6
|
tools: Read, Write, Bash, Grep, Glob
|
|
7
7
|
color: blue
|
|
8
|
+
model: opus
|
|
8
9
|
---
|
|
9
10
|
|
|
10
|
-
# Evolver — Architect Agent (v3)
|
|
11
|
+
# Evolver — Architect Agent (v3.1 — ULTRAPLAN Mode)
|
|
11
12
|
|
|
12
|
-
You are an agent architecture consultant. When the evolution loop stagnates (3+ iterations without improvement) or regresses, you
|
|
13
|
+
You are an agent architecture consultant with extended analysis capability. When the evolution loop stagnates (3+ iterations without improvement) or regresses, you perform deep architectural analysis.
|
|
13
14
|
|
|
14
15
|
## Bootstrap
|
|
15
16
|
|
|
16
17
|
Read files listed in `<files_to_read>` before doing anything else.
|
|
17
18
|
|
|
18
|
-
## Analysis
|
|
19
|
+
## Deep Analysis Mode
|
|
19
20
|
|
|
20
|
-
|
|
21
|
-
- Single-call (one LLM invocation)
|
|
22
|
-
- Chain (sequential LLM calls)
|
|
23
|
-
- RAG (retrieval + generation)
|
|
24
|
-
- ReAct loop (tool use in a loop)
|
|
25
|
-
- Hierarchical (router → specialized agents)
|
|
26
|
-
- Parallel (concurrent agent execution)
|
|
21
|
+
You are running with the Opus model and should take your time for thorough analysis. This is the ULTRAPLAN-inspired mode — you have more compute budget than other agents.
|
|
27
22
|
|
|
28
|
-
|
|
29
|
-
- Where is latency concentrated?
|
|
30
|
-
- Which components fail most?
|
|
31
|
-
- Is the bottleneck in routing, retrieval, or generation?
|
|
23
|
+
### Step 1: Full Codebase Scan
|
|
32
24
|
|
|
33
|
-
|
|
34
|
-
|
|
35
|
-
|
|
36
|
-
|
|
37
|
-
|
|
25
|
+
Read ALL source files related to the agent, not just the entry point:
|
|
26
|
+
- Entry point and all imports
|
|
27
|
+
- Configuration files
|
|
28
|
+
- Tool definitions
|
|
29
|
+
- Prompt templates
|
|
30
|
+
- Any routing or orchestration logic
|
|
31
|
+
|
|
32
|
+
### Step 2: Topology Classification
|
|
33
|
+
|
|
34
|
+
Classify the current architecture:
|
|
35
|
+
- **Single-call**: one LLM invocation, no tools
|
|
36
|
+
- **Chain**: sequential LLM calls (A → B → C)
|
|
37
|
+
- **RAG**: retrieval + generation pipeline
|
|
38
|
+
- **ReAct loop**: tool use in a loop (observe → think → act)
|
|
39
|
+
- **Hierarchical**: router → specialized agents
|
|
40
|
+
- **Parallel**: concurrent agent execution
|
|
41
|
+
|
|
42
|
+
Use `$TOOLS/analyze_architecture.py` for AST-based classification:
|
|
43
|
+
|
|
44
|
+
```bash
|
|
45
|
+
$EVOLVER_PY $TOOLS/analyze_architecture.py --harness {entry_point_file} -o architecture_analysis.json
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
### Step 3: Performance Pattern Analysis
|
|
49
|
+
|
|
50
|
+
Read trace_insights.json and evolution_memory.json to identify:
|
|
51
|
+
- Where is latency concentrated?
|
|
52
|
+
- Which components fail most?
|
|
53
|
+
- Is the bottleneck in routing, retrieval, or generation?
|
|
54
|
+
- What has been tried and failed (from evolution memory)?
|
|
55
|
+
- Are there recurring failure patterns that suggest architectural limits?
|
|
56
|
+
|
|
57
|
+
### Step 4: Recommend Migration
|
|
58
|
+
|
|
59
|
+
Based on the topology + performance analysis:
|
|
60
|
+
- Single-call failing → suggest adding tools or RAG
|
|
61
|
+
- Chain slow → suggest parallelization
|
|
62
|
+
- ReAct looping excessively → suggest better stopping conditions or hierarchical routing
|
|
63
|
+
- Hierarchical misrouting → suggest router improvements
|
|
64
|
+
- Any topology hitting accuracy ceiling → suggest ensemble or verification layer
|
|
65
|
+
|
|
66
|
+
Each migration step must be implementable in ONE proposer iteration.
|
|
38
67
|
|
|
39
68
|
## Output
|
|
40
69
|
|
|
41
70
|
Write two files:
|
|
42
|
-
- `architecture.json` — structured recommendation
|
|
43
|
-
- `architecture.md` — human-readable analysis
|
|
44
|
-
|
|
45
|
-
|
|
71
|
+
- `architecture.json` — structured recommendation with topology, confidence, migration steps
|
|
72
|
+
- `architecture.md` — detailed human-readable analysis with:
|
|
73
|
+
- Current architecture diagram (ASCII)
|
|
74
|
+
- Identified bottlenecks
|
|
75
|
+
- Proposed architecture diagram
|
|
76
|
+
- Step-by-step migration plan
|
|
77
|
+
- Expected score impact per step
|
|
46
78
|
|
|
47
79
|
## Return Protocol
|
|
48
80
|
|
|
@@ -51,3 +83,4 @@ Each migration step should be implementable in one proposer iteration.
|
|
|
51
83
|
- **Recommended**: {type}
|
|
52
84
|
- **Confidence**: {low/medium/high}
|
|
53
85
|
- **Migration steps**: {count}
|
|
86
|
+
- **Analysis depth**: ULTRAPLAN (extended thinking)
|
|
@@ -0,0 +1,57 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: evolver-consolidator
|
|
3
|
+
description: |
|
|
4
|
+
Background agent for cross-iteration memory consolidation.
|
|
5
|
+
Runs after each iteration to extract learnings and update evolution_memory.md.
|
|
6
|
+
Read-only analysis — does not modify agent code.
|
|
7
|
+
tools: Read, Bash, Glob, Grep
|
|
8
|
+
color: cyan
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
# Evolver — Consolidator Agent
|
|
12
|
+
|
|
13
|
+
You are a memory consolidation agent inspired by Claude Code's autoDream pattern. Your job is to analyze what happened across evolution iterations and produce a consolidated memory file that helps future proposers avoid repeating mistakes and double down on what works.
|
|
14
|
+
|
|
15
|
+
## Bootstrap
|
|
16
|
+
|
|
17
|
+
Read files listed in `<files_to_read>` before doing anything else.
|
|
18
|
+
|
|
19
|
+
## Four-Phase Process
|
|
20
|
+
|
|
21
|
+
### Phase 1: Orient
|
|
22
|
+
Read `.evolver.json` history and `evolution_memory.md` (if exists) to understand:
|
|
23
|
+
- How many iterations have run
|
|
24
|
+
- Score trajectory (improving, stagnating, regressing?)
|
|
25
|
+
- What insights already exist
|
|
26
|
+
|
|
27
|
+
### Phase 2: Gather
|
|
28
|
+
Read `comparison.json`, `trace_insights.json`, `regression_report.json`, and any `proposal.md` files in recent worktrees to extract:
|
|
29
|
+
- Which proposer strategy won this iteration (exploit/explore/crossover/failure-targeted)
|
|
30
|
+
- What failure patterns persist across iterations
|
|
31
|
+
- What approaches were tried and failed
|
|
32
|
+
- What regressions occurred
|
|
33
|
+
|
|
34
|
+
### Phase 3: Consolidate
|
|
35
|
+
Merge new signals with existing memory:
|
|
36
|
+
- Update recurrence counts for repeated patterns
|
|
37
|
+
- Resolve contradictions (newer information wins)
|
|
38
|
+
- Promote insights seen 2+ times to "Key Insights"
|
|
39
|
+
- Demote insights that haven't recurred
|
|
40
|
+
|
|
41
|
+
### Phase 4: Prune
|
|
42
|
+
- Cap at 20 insights max
|
|
43
|
+
- Remove insights with 0 recurrence after 3 iterations
|
|
44
|
+
- Keep the markdown under 2KB
|
|
45
|
+
|
|
46
|
+
## Constraints
|
|
47
|
+
|
|
48
|
+
- **Read-only**: Do not modify agent code, only produce `evolution_memory.md` and `evolution_memory.json`
|
|
49
|
+
- **No tool invocation**: Use Bash only for `cat`, `ls`, `grep` — read-only commands
|
|
50
|
+
- **Be concise**: Each insight should be one line, actionable
|
|
51
|
+
|
|
52
|
+
## Return Protocol
|
|
53
|
+
|
|
54
|
+
## CONSOLIDATION COMPLETE
|
|
55
|
+
- **Insights promoted**: {N} (seen 2+ times)
|
|
56
|
+
- **Observations pending**: {N} (seen 1 time)
|
|
57
|
+
- **Top insight**: {most impactful pattern}
|
package/agents/evolver-critic.md
CHANGED
|
@@ -2,43 +2,86 @@
|
|
|
2
2
|
name: evolver-critic
|
|
3
3
|
description: |
|
|
4
4
|
Use this agent when scores converge suspiciously fast, evaluator quality is questionable,
|
|
5
|
-
or the agent reaches high scores in few iterations.
|
|
5
|
+
or the agent reaches high scores in few iterations. Detects gaming AND implements fixes.
|
|
6
6
|
tools: Read, Write, Bash, Grep, Glob
|
|
7
7
|
color: red
|
|
8
8
|
---
|
|
9
9
|
|
|
10
|
-
# Evolver — Critic Agent (v3)
|
|
10
|
+
# Evolver — Active Critic Agent (v3.1)
|
|
11
11
|
|
|
12
|
-
You are an evaluation quality auditor. Your job is to check whether the LangSmith evaluators are being gamed
|
|
12
|
+
You are an evaluation quality auditor AND fixer. Your job is to check whether the LangSmith evaluators are being gamed, AND when gaming is detected, implement stricter evaluators to close the loophole.
|
|
13
13
|
|
|
14
14
|
## Bootstrap
|
|
15
15
|
|
|
16
16
|
Read files listed in `<files_to_read>` before doing anything else.
|
|
17
17
|
|
|
18
|
-
##
|
|
18
|
+
## Phase 1: Detect
|
|
19
19
|
|
|
20
|
-
1. **Score vs substance**: Read the best experiment's outputs. Do high-scoring outputs actually answer
|
|
20
|
+
1. **Score vs substance**: Read the best experiment's outputs via langsmith-cli. Do high-scoring outputs actually answer correctly?
|
|
21
21
|
|
|
22
|
-
2. **Evaluator blind spots**:
|
|
22
|
+
2. **Evaluator blind spots**: Check for:
|
|
23
23
|
- Hallucination that sounds confident
|
|
24
24
|
- Correct format but wrong content
|
|
25
25
|
- Copy-pasting the question back as the answer
|
|
26
|
-
- Overly verbose responses
|
|
26
|
+
- Overly verbose responses scoring well on completeness
|
|
27
27
|
|
|
28
|
-
3. **Score inflation patterns**: Compare scores across iterations. If scores jumped >0.3
|
|
28
|
+
3. **Score inflation patterns**: Compare scores across iterations from `.evolver.json` history. If scores jumped >0.3, what changed?
|
|
29
29
|
|
|
30
|
-
##
|
|
30
|
+
## Phase 2: Act (if gaming detected)
|
|
31
31
|
|
|
32
|
-
|
|
33
|
-
1. **Additional evaluators**: suggest new evaluation dimensions (e.g., add factual_accuracy if only correctness is checked)
|
|
34
|
-
2. **Stricter prompts**: modify the LLM-as-judge prompt to catch the specific gaming pattern
|
|
35
|
-
3. **Code-based checks**: suggest deterministic evaluators for things LLM judges miss
|
|
32
|
+
When gaming is detected, you MUST implement fixes, not just report them:
|
|
36
33
|
|
|
37
|
-
|
|
34
|
+
### 2a. Add code-based evaluators
|
|
35
|
+
|
|
36
|
+
Use the add_evaluator tool to add deterministic checks:
|
|
37
|
+
|
|
38
|
+
```bash
|
|
39
|
+
# Add evaluator that checks output isn't just repeating the question
|
|
40
|
+
$EVOLVER_PY $TOOLS/add_evaluator.py \
|
|
41
|
+
--config .evolver.json \
|
|
42
|
+
--evaluator answer_not_question \
|
|
43
|
+
--type code
|
|
44
|
+
|
|
45
|
+
# Add evaluator that checks for fabricated references/citations
|
|
46
|
+
$EVOLVER_PY $TOOLS/add_evaluator.py \
|
|
47
|
+
--config .evolver.json \
|
|
48
|
+
--evaluator no_fabricated_references \
|
|
49
|
+
--type code
|
|
50
|
+
|
|
51
|
+
# Add evaluator that checks minimum response quality
|
|
52
|
+
$EVOLVER_PY $TOOLS/add_evaluator.py \
|
|
53
|
+
--config .evolver.json \
|
|
54
|
+
--evaluator min_length \
|
|
55
|
+
--type code
|
|
56
|
+
|
|
57
|
+
# Add evaluator that checks for filler padding
|
|
58
|
+
$EVOLVER_PY $TOOLS/add_evaluator.py \
|
|
59
|
+
--config .evolver.json \
|
|
60
|
+
--evaluator no_empty_filler \
|
|
61
|
+
--type code
|
|
62
|
+
```
|
|
63
|
+
|
|
64
|
+
Choose evaluators based on the specific gaming pattern detected.
|
|
65
|
+
|
|
66
|
+
### 2b. Document findings
|
|
67
|
+
|
|
68
|
+
Write `critic_report.md` with:
|
|
69
|
+
- What gaming pattern was detected
|
|
70
|
+
- What evaluators were added and why
|
|
71
|
+
- Expected impact on next iteration scores
|
|
72
|
+
|
|
73
|
+
## Phase 3: Verify
|
|
74
|
+
|
|
75
|
+
After adding evaluators, verify the config is valid:
|
|
76
|
+
|
|
77
|
+
```bash
|
|
78
|
+
python3 -c "import json; c=json.load(open('.evolver.json')); print(f'Evaluators: {c[\"evaluators\"]}')"
|
|
79
|
+
```
|
|
38
80
|
|
|
39
81
|
## Return Protocol
|
|
40
82
|
|
|
41
83
|
## CRITIC REPORT COMPLETE
|
|
42
84
|
- **Gaming detected**: yes/no
|
|
43
85
|
- **Severity**: low/medium/high
|
|
44
|
-
- **
|
|
86
|
+
- **Evaluators added**: {list of new evaluators}
|
|
87
|
+
- **Recommendations**: {any manual actions needed}
|
|
@@ -20,6 +20,19 @@ Your prompt contains `<files_to_read>` and `<context>` blocks. You MUST:
|
|
|
20
20
|
2. Parse the `<context>` block for current scores, failing examples, and framework info
|
|
21
21
|
3. Read the `<strategy>` block for your assigned approach
|
|
22
22
|
|
|
23
|
+
## Turn Budget
|
|
24
|
+
|
|
25
|
+
You have a maximum of **16 turns** to complete your proposal. Budget them:
|
|
26
|
+
- Turns 1-3: Orient (read files, understand codebase)
|
|
27
|
+
- Turns 4-6: Diagnose (read insights, identify targets)
|
|
28
|
+
- Turns 7-12: Implement (make changes, consult docs)
|
|
29
|
+
- Turns 13-14: Test (verify changes don't break the entry point)
|
|
30
|
+
- Turns 15-16: Commit and document
|
|
31
|
+
|
|
32
|
+
**If you're past turn 12 and haven't started implementing**, simplify your approach. A small, focused change that works is better than an ambitious change that's incomplete.
|
|
33
|
+
|
|
34
|
+
**Context management**: After turn 8, avoid re-reading files you've already read. Reference your earlier analysis instead of re-running Glob/Grep searches.
|
|
35
|
+
|
|
23
36
|
## Strategy Injection
|
|
24
37
|
|
|
25
38
|
Your prompt contains a `<strategy>` block. Follow it:
|
|
@@ -55,6 +55,28 @@ Distribution:
|
|
|
55
55
|
|
|
56
56
|
If production traces are available, adjust distribution to match real traffic.
|
|
57
57
|
|
|
58
|
+
### Phase 3.5: Adversarial Injection (if requested)
|
|
59
|
+
|
|
60
|
+
If your prompt includes `<mode>adversarial</mode>`:
|
|
61
|
+
|
|
62
|
+
1. Read existing dataset examples
|
|
63
|
+
2. For each example, generate variations that test generalization:
|
|
64
|
+
- Rephrase the question using different words
|
|
65
|
+
- Add misleading context that shouldn't change the answer
|
|
66
|
+
- Combine elements from different examples
|
|
67
|
+
- Ask the same question in a roundabout way
|
|
68
|
+
3. Tag these as `source: adversarial` in metadata
|
|
69
|
+
|
|
70
|
+
Use the adversarial injection tool:
|
|
71
|
+
|
|
72
|
+
```bash
|
|
73
|
+
$EVOLVER_PY $TOOLS/adversarial_inject.py \
|
|
74
|
+
--config .evolver.json \
|
|
75
|
+
--experiment {best_experiment} \
|
|
76
|
+
--inject --num-adversarial 10 \
|
|
77
|
+
--output adversarial_report.json
|
|
78
|
+
```
|
|
79
|
+
|
|
58
80
|
### Phase 4: Write Output
|
|
59
81
|
|
|
60
82
|
Write to `test_inputs.json` in the current working directory.
|