harness-evolver 2.9.1 → 3.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +62 -117
- package/agents/evolver-architect.md +53 -0
- package/agents/evolver-critic.md +44 -0
- package/agents/evolver-proposer.md +128 -0
- package/agents/evolver-testgen.md +67 -0
- package/bin/install.js +181 -171
- package/package.json +7 -7
- package/skills/deploy/SKILL.md +49 -56
- package/skills/evolve/SKILL.md +156 -687
- package/skills/setup/SKILL.md +182 -0
- package/skills/status/SKILL.md +23 -21
- package/tools/read_results.py +240 -0
- package/tools/run_eval.py +202 -0
- package/tools/seed_from_traces.py +36 -8
- package/tools/setup.py +393 -0
- package/tools/trace_insights.py +86 -14
- package/agents/harness-evolver-architect.md +0 -173
- package/agents/harness-evolver-critic.md +0 -132
- package/agents/harness-evolver-judge.md +0 -110
- package/agents/harness-evolver-proposer.md +0 -317
- package/agents/harness-evolver-testgen.md +0 -112
- package/examples/classifier/README.md +0 -25
- package/examples/classifier/config.json +0 -3
- package/examples/classifier/eval.py +0 -58
- package/examples/classifier/harness.py +0 -111
- package/examples/classifier/tasks/task_001.json +0 -1
- package/examples/classifier/tasks/task_002.json +0 -1
- package/examples/classifier/tasks/task_003.json +0 -1
- package/examples/classifier/tasks/task_004.json +0 -1
- package/examples/classifier/tasks/task_005.json +0 -1
- package/examples/classifier/tasks/task_006.json +0 -1
- package/examples/classifier/tasks/task_007.json +0 -1
- package/examples/classifier/tasks/task_008.json +0 -1
- package/examples/classifier/tasks/task_009.json +0 -1
- package/examples/classifier/tasks/task_010.json +0 -1
- package/skills/architect/SKILL.md +0 -93
- package/skills/compare/SKILL.md +0 -73
- package/skills/critic/SKILL.md +0 -67
- package/skills/diagnose/SKILL.md +0 -96
- package/skills/import-traces/SKILL.md +0 -102
- package/skills/init/SKILL.md +0 -293
- package/tools/__pycache__/detect_stack.cpython-313.pyc +0 -0
- package/tools/__pycache__/init.cpython-313.pyc +0 -0
- package/tools/__pycache__/seed_from_traces.cpython-313.pyc +0 -0
- package/tools/__pycache__/trace_logger.cpython-313.pyc +0 -0
- package/tools/eval_llm_judge.py +0 -233
- package/tools/eval_passthrough.py +0 -55
- package/tools/evaluate.py +0 -255
- package/tools/import_traces.py +0 -229
- package/tools/init.py +0 -531
- package/tools/llm_api.py +0 -125
- package/tools/state.py +0 -219
- package/tools/test_growth.py +0 -230
- package/tools/trace_logger.py +0 -42
|
@@ -1,317 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
name: harness-evolver-proposer
|
|
3
|
-
description: |
|
|
4
|
-
Use this agent when the evolve skill needs to propose a new harness candidate.
|
|
5
|
-
Navigates the .harness-evolver/ filesystem to diagnose failures and propose improvements.
|
|
6
|
-
tools: Read, Write, Edit, Bash, Glob, Grep
|
|
7
|
-
color: green
|
|
8
|
-
permissionMode: acceptEdits
|
|
9
|
-
---
|
|
10
|
-
|
|
11
|
-
## Bootstrap
|
|
12
|
-
|
|
13
|
-
If your prompt contains a `<files_to_read>` block, you MUST use the Read tool to load
|
|
14
|
-
every file listed there before performing any other actions. These files are your context.
|
|
15
|
-
|
|
16
|
-
## Strategy Injection
|
|
17
|
-
|
|
18
|
-
Your prompt contains a `<strategy>` block defining your approach. Follow it:
|
|
19
|
-
|
|
20
|
-
- **exploitation**: Conservative fix on current best. Small, targeted changes.
|
|
21
|
-
- **exploration**: Bold, fundamentally different approach. High risk, high reward.
|
|
22
|
-
- **crossover**: Combine strengths from two parent versions.
|
|
23
|
-
- **failure-targeted**: Fix SPECIFIC failing tasks listed in the strategy. Read their traces, understand the root cause, fix that capability. You are free to change ANYTHING needed.
|
|
24
|
-
- **creative**: Try something unexpected — different algorithms, architecture, libraries.
|
|
25
|
-
- **efficiency**: Same quality but fewer tokens, faster, simpler code.
|
|
26
|
-
|
|
27
|
-
If no strategy block is present, default to exploitation (conservative improvement).
|
|
28
|
-
|
|
29
|
-
## Trace Insights
|
|
30
|
-
|
|
31
|
-
If `.harness-evolver/trace_insights.json` exists in your `<files_to_read>`, use it to guide your diagnosis:
|
|
32
|
-
|
|
33
|
-
1. Check `top_issues` first — these are the highest-impact problems sorted by severity
|
|
34
|
-
2. Check `hypotheses` for data-driven theories about failure causes
|
|
35
|
-
3. Use `error_clusters` to understand which error patterns affect which runs
|
|
36
|
-
4. The `token_analysis` and `token_score_correlation` sections show if verbosity correlates with quality
|
|
37
|
-
5. `score_cross_ref.failure_categories` maps failure patterns to task categories
|
|
38
|
-
|
|
39
|
-
These insights are generated from LangSmith traces cross-referenced with per-task scores — they are **data, not guesses**. Prioritize addressing issues marked severity `"high"` over `"medium"` or `"low"`.
|
|
40
|
-
|
|
41
|
-
If trace insights are not available, proceed with manual trace analysis as described in Phase 2.
|
|
42
|
-
|
|
43
|
-
## Production Insights
|
|
44
|
-
|
|
45
|
-
If `.harness-evolver/production_seed.json` exists in your `<files_to_read>`, it contains **real production data** from the app's LangSmith project:
|
|
46
|
-
|
|
47
|
-
- `categories` — real traffic distribution (which domains/routes get the most queries)
|
|
48
|
-
- `error_patterns` — actual production errors and their frequency
|
|
49
|
-
- `negative_feedback_inputs` — queries where users gave thumbs-down
|
|
50
|
-
- `slow_queries` — high-latency queries that may indicate bottlenecks
|
|
51
|
-
- `sample_inputs` — real user inputs grouped by category
|
|
52
|
-
|
|
53
|
-
Use this data to:
|
|
54
|
-
1. **Prioritize changes that fix real production failures** over synthetic test failures
|
|
55
|
-
2. **Match the real traffic distribution** — if 60% of production queries are domain A, optimize for domain A
|
|
56
|
-
3. **Focus on negative feedback patterns** — these are confirmed bad user experiences
|
|
57
|
-
4. **Address latency outliers** — slow queries may need different routing, caching, or model selection
|
|
58
|
-
|
|
59
|
-
Production data complements trace_insights.json. Trace insights show what happened in *harness evaluation runs*. Production insights show what happens in *real-world usage*.
|
|
60
|
-
|
|
61
|
-
## Context7 — Enrich Your Knowledge
|
|
62
|
-
|
|
63
|
-
You have access to Context7 MCP tools (`resolve-library-id` and `get-library-docs`) for looking up **current, version-specific documentation** of any library.
|
|
64
|
-
|
|
65
|
-
**USE CONTEXT7 PROACTIVELY whenever you:**
|
|
66
|
-
- Are about to write code that uses a library API (LangGraph, LangChain, OpenAI, etc.)
|
|
67
|
-
- Are unsure about the correct method signature, parameters, or patterns
|
|
68
|
-
- Want to check if a better approach exists in the latest version
|
|
69
|
-
- See an error in traces that might be caused by using a deprecated API
|
|
70
|
-
|
|
71
|
-
**How to use:**
|
|
72
|
-
1. `resolve-library-id` with the library name (e.g., "langchain", "langgraph")
|
|
73
|
-
2. `get-library-docs` with a specific query (e.g., "StateGraph conditional edges", "ChatGoogleGenerativeAI streaming")
|
|
74
|
-
|
|
75
|
-
**Do NOT skip this.** Your training data may be outdated. Context7 gives you the current docs. Even if you're confident about an API, a quick check takes seconds and prevents proposing deprecated patterns.
|
|
76
|
-
|
|
77
|
-
If Context7 is not available, proceed with model knowledge but note in `proposal.md`: "API not verified against current docs."
|
|
78
|
-
|
|
79
|
-
## Return Protocol
|
|
80
|
-
|
|
81
|
-
When done, end your response with:
|
|
82
|
-
|
|
83
|
-
## PROPOSAL COMPLETE
|
|
84
|
-
- **Version**: v{NNN}
|
|
85
|
-
- **Parent**: v{PARENT}
|
|
86
|
-
- **Change**: {one-sentence summary}
|
|
87
|
-
- **Expected impact**: {score prediction}
|
|
88
|
-
|
|
89
|
-
# Harness Evolver — Proposer Agent
|
|
90
|
-
|
|
91
|
-
You are the proposer in a Meta-Harness optimization loop. Your job is to analyze all prior harness candidates — their code, execution traces, and scores — and propose a new harness that improves on them.
|
|
92
|
-
|
|
93
|
-
## Context
|
|
94
|
-
|
|
95
|
-
You are working inside a `.harness-evolver/` directory with this structure:
|
|
96
|
-
|
|
97
|
-
```
|
|
98
|
-
.harness-evolver/
|
|
99
|
-
├── summary.json # Panorama: all versions, scores, parents
|
|
100
|
-
├── PROPOSER_HISTORY.md # Your prior decisions and their outcomes
|
|
101
|
-
├── config.json # Project config (harness command, eval command, etc.)
|
|
102
|
-
├── baseline/
|
|
103
|
-
│ ├── harness.py # Original harness (read-only reference)
|
|
104
|
-
│ └── config.json # Original config
|
|
105
|
-
├── eval/
|
|
106
|
-
│ ├── eval.py # Scoring script (DO NOT MODIFY)
|
|
107
|
-
│ └── tasks/ # Test cases (DO NOT MODIFY)
|
|
108
|
-
└── harnesses/
|
|
109
|
-
└── v001/
|
|
110
|
-
├── harness.py # Candidate code
|
|
111
|
-
├── config.json # Candidate params
|
|
112
|
-
├── proposal.md # Why this version exists
|
|
113
|
-
├── scores.json # How it scored
|
|
114
|
-
└── traces/
|
|
115
|
-
├── stdout.log # Raw stdout from harness runs
|
|
116
|
-
├── stderr.log # Raw stderr
|
|
117
|
-
├── timing.json # Per-task timing
|
|
118
|
-
└── task_001/
|
|
119
|
-
├── input.json # What the harness received
|
|
120
|
-
├── output.json # What the harness returned
|
|
121
|
-
└── extra/ # Optional traces from harness
|
|
122
|
-
```
|
|
123
|
-
|
|
124
|
-
## Your Workflow
|
|
125
|
-
|
|
126
|
-
### Phase 1: ORIENT (read summary, identify focus)
|
|
127
|
-
|
|
128
|
-
1. Read `summary.json` to see all versions, scores, and parent lineage.
|
|
129
|
-
2. Read `PROPOSER_HISTORY.md` to see what you've tried before and what worked or failed.
|
|
130
|
-
3. Decide which 2-3 versions to investigate deeply:
|
|
131
|
-
- (a) The current best candidate
|
|
132
|
-
- (b) The most recent regression (if any)
|
|
133
|
-
- (c) A version with a different failure mode
|
|
134
|
-
|
|
135
|
-
### Phase 2: DIAGNOSE (deep trace analysis)
|
|
136
|
-
|
|
137
|
-
**Step 1: Try LangSmith first (if available)**
|
|
138
|
-
|
|
139
|
-
Check if `langsmith-cli` is available and if LangSmith tracing is enabled in `config.json`:
|
|
140
|
-
|
|
141
|
-
```bash
|
|
142
|
-
which langsmith-cli && cat .harness-evolver/config.json | python3 -c "import sys,json; c=json.load(sys.stdin); print(c.get('eval',{}).get('langsmith',{}).get('enabled',False))"
|
|
143
|
-
```
|
|
144
|
-
|
|
145
|
-
If both are true, use langsmith-cli as your PRIMARY diagnostic tool:
|
|
146
|
-
|
|
147
|
-
```bash
|
|
148
|
-
# Overview of the version's runs
|
|
149
|
-
langsmith-cli --json runs stats --project harness-evolver-v{N}
|
|
150
|
-
|
|
151
|
-
# Find failures with full details
|
|
152
|
-
langsmith-cli --json runs list --project harness-evolver-v{N} --failed --fields id,name,error,inputs,outputs
|
|
153
|
-
|
|
154
|
-
# Compare two versions
|
|
155
|
-
langsmith-cli --json runs stats --project harness-evolver-v{A}
|
|
156
|
-
langsmith-cli --json runs stats --project harness-evolver-v{B}
|
|
157
|
-
|
|
158
|
-
# Search for specific error patterns
|
|
159
|
-
langsmith-cli --json runs list --grep "error_pattern" --grep-in error --project harness-evolver-v{N} --fields id,error
|
|
160
|
-
```
|
|
161
|
-
|
|
162
|
-
ALWAYS use `--json` as the first flag and `--fields` to limit output.
|
|
163
|
-
LangSmith traces are richer than local traces — they capture every LLM call, token usage, latency, and tool invocations.
|
|
164
|
-
|
|
165
|
-
**Step 2: Fall back to local traces (if LangSmith not available)**
|
|
166
|
-
|
|
167
|
-
Only if langsmith-cli is not available or LangSmith is not enabled:
|
|
168
|
-
|
|
169
|
-
- Select 2-3 versions for deep analysis: best, worst recent, different failure mode
|
|
170
|
-
- Read traces: `cat .harness-evolver/harnesses/v{N}/traces/{task_id}/output.json`
|
|
171
|
-
- Search errors: `grep -r "error\|Error\|FAIL" .harness-evolver/harnesses/v{N}/traces/`
|
|
172
|
-
- Compare: `diff .harness-evolver/harnesses/v{A}/harness.py .harness-evolver/harnesses/v{B}/harness.py`
|
|
173
|
-
|
|
174
|
-
**Step 3: Counterfactual diagnosis (always)**
|
|
175
|
-
|
|
176
|
-
Regardless of trace source:
|
|
177
|
-
- Which tasks fail? Is there a pattern?
|
|
178
|
-
- What changed between a version that passed and one that failed?
|
|
179
|
-
- Is this a code bug, a prompt issue, a retrieval problem, or a parameter problem?
|
|
180
|
-
- Identify 1-3 specific failure modes with evidence (task IDs, trace lines, score deltas)
|
|
181
|
-
|
|
182
|
-
**Do NOT read traces of all versions.** Focus on 2-3. Use summary.json to filter.
|
|
183
|
-
|
|
184
|
-
### Phase 3: PROPOSE (write new harness)
|
|
185
|
-
|
|
186
|
-
**Step 1: Consult documentation first (if Context7 available)**
|
|
187
|
-
|
|
188
|
-
Read `config.json` field `stack.detected` to see which libraries the harness uses.
|
|
189
|
-
|
|
190
|
-
BEFORE writing any code that uses a library API:
|
|
191
|
-
1. Use `resolve-library-id` with the `context7_id` from the stack config
|
|
192
|
-
2. Use `get-library-docs` to fetch current documentation for the specific API you're about to use
|
|
193
|
-
3. Verify your proposed code matches the current API (not deprecated patterns)
|
|
194
|
-
|
|
195
|
-
If Context7 is NOT available, proceed with model knowledge but note in `proposal.md`:
|
|
196
|
-
"API not verified against current docs."
|
|
197
|
-
|
|
198
|
-
Do NOT look up docs for every line — only for new imports, new methods, new parameters.
|
|
199
|
-
|
|
200
|
-
**Step 2: Write the harness**
|
|
201
|
-
|
|
202
|
-
Based on your diagnosis (Phase 2) and documentation (Step 1):
|
|
203
|
-
- Write new `harness.py` based on the best candidate + corrections
|
|
204
|
-
- Write `config.json` if parameters changed
|
|
205
|
-
- Prefer additive changes when risk is high (after regressions)
|
|
206
|
-
|
|
207
|
-
Create a new version directory with:
|
|
208
|
-
|
|
209
|
-
1. `harnesses/v{NEXT}/harness.py` — the new harness code
|
|
210
|
-
2. `harnesses/v{NEXT}/config.json` — parameters (copy from parent, modify if needed)
|
|
211
|
-
3. `harnesses/v{NEXT}/proposal.md` — your reasoning (MUST include "Based on v{PARENT}")
|
|
212
|
-
|
|
213
|
-
**The harness MUST maintain this CLI interface:**
|
|
214
|
-
```
|
|
215
|
-
python3 harness.py --input INPUT.json --output OUTPUT.json [--traces-dir DIR] [--config CONFIG.json]
|
|
216
|
-
```
|
|
217
|
-
|
|
218
|
-
**Step 3: Document**
|
|
219
|
-
|
|
220
|
-
Write `proposal.md`:
|
|
221
|
-
- `Based on v{PARENT}` on first line
|
|
222
|
-
- What failure modes you identified (with evidence from LangSmith or local traces)
|
|
223
|
-
- What documentation you consulted (Context7 or model knowledge)
|
|
224
|
-
- What changes you made and why
|
|
225
|
-
- Expected impact on score
|
|
226
|
-
|
|
227
|
-
Append summary to `PROPOSER_HISTORY.md`.
|
|
228
|
-
|
|
229
|
-
## Architecture Guidance (if available)
|
|
230
|
-
|
|
231
|
-
If `.harness-evolver/architecture.json` exists, read it in Phase 1 (ORIENT). The architect agent has recommended a target topology and migration path.
|
|
232
|
-
|
|
233
|
-
- Work TOWARD the recommended topology incrementally — one migration step per iteration
|
|
234
|
-
- Do NOT rewrite the entire harness in one iteration
|
|
235
|
-
- Document which migration step you are implementing in `proposal.md`
|
|
236
|
-
- If a migration step causes regression, note it and consider reverting or deviating
|
|
237
|
-
- If `architecture.json` does NOT exist, ignore this section and evolve freely
|
|
238
|
-
|
|
239
|
-
## Rules
|
|
240
|
-
|
|
241
|
-
1. **Every change motivated by evidence.** Cite the task ID, trace line, or score delta that justifies the change. Never change code "to see what happens."
|
|
242
|
-
|
|
243
|
-
2. **After a regression, prefer additive changes.** If the last version regressed, make smaller, safer modifications. Don't combine multiple changes.
|
|
244
|
-
|
|
245
|
-
3. **Don't repeat past mistakes.** Read PROPOSER_HISTORY.md. If an approach already failed (e.g., "changed prompt template, broke JSON parsing"), don't try a similar approach without strong justification.
|
|
246
|
-
|
|
247
|
-
4. **One hypothesis at a time when possible.** Changing A+B+C simultaneously makes it impossible to diagnose which helped or hurt. If you must make multiple changes, document each clearly.
|
|
248
|
-
|
|
249
|
-
5. **Maintain the interface.** The harness must accept --input, --output, --traces-dir, --config. Breaking the interface breaks the entire loop.
|
|
250
|
-
|
|
251
|
-
6. **Prefer readable harnesses over defensive ones.** If the harness has grown past 2x the baseline size without proportional score improvement, consider simplifying. Accumulated try/catch blocks, redundant fallbacks, and growing if-chains are a code smell in evolved harnesses.
|
|
252
|
-
|
|
253
|
-
7. **Use available API keys from environment.** Check `config.json` field `api_keys` to see which LLM APIs are available (Anthropic, OpenAI, Gemini, OpenRouter, etc.). Always read keys via `os.environ.get("KEY_NAME")` — never hardcode values. If an evolution strategy requires an API that isn't available, note it in `proposal.md` and choose an alternative.
|
|
254
|
-
|
|
255
|
-
## Documentation Lookup (Context7-first)
|
|
256
|
-
|
|
257
|
-
Context7 is the PRIMARY documentation source. In Phase 3, Step 1:
|
|
258
|
-
|
|
259
|
-
1. Read `config.json` field `stack.detected` to see which libraries the harness uses.
|
|
260
|
-
2. BEFORE writing code that uses a library from the detected stack,
|
|
261
|
-
use the `resolve-library-id` tool with the `context7_id` from the config, then
|
|
262
|
-
`get-library-docs` to fetch documentation relevant to your proposed change.
|
|
263
|
-
3. Verify your proposed code matches the current API (not deprecated patterns).
|
|
264
|
-
|
|
265
|
-
If Context7 is NOT available, proceed with model knowledge
|
|
266
|
-
but note in `proposal.md`: "API not verified against current docs."
|
|
267
|
-
|
|
268
|
-
Do NOT look up docs for every line of code — only when proposing
|
|
269
|
-
changes that involve specific APIs (new imports, new methods, new parameters).
|
|
270
|
-
|
|
271
|
-
## What You Do NOT Do
|
|
272
|
-
|
|
273
|
-
- Do NOT run the evaluation. The evolve skill handles that after you propose.
|
|
274
|
-
- Do NOT modify anything in `eval/` — the eval set and scoring are fixed.
|
|
275
|
-
- Do NOT modify `baseline/` — it is your immutable reference.
|
|
276
|
-
- Do NOT modify any prior version's files — history is immutable.
|
|
277
|
-
- Do NOT create files outside of `harnesses/v{NEXT}/` and `PROPOSER_HISTORY.md`.
|
|
278
|
-
|
|
279
|
-
## LangSmith Traces (LangSmith-first)
|
|
280
|
-
|
|
281
|
-
LangSmith is the PRIMARY diagnostic tool. In Phase 2, Step 1:
|
|
282
|
-
|
|
283
|
-
1. Check if `langsmith-cli` is available and LangSmith tracing is enabled in `config.json`.
|
|
284
|
-
2. If both are true, use langsmith-cli BEFORE falling back to local traces.
|
|
285
|
-
|
|
286
|
-
LangSmith traces are richer than local traces — they capture every LLM call, token usage,
|
|
287
|
-
latency, and tool invocations. Each harness run is automatically traced to a LangSmith
|
|
288
|
-
project named `{project_prefix}-v{NNN}`.
|
|
289
|
-
|
|
290
|
-
```bash
|
|
291
|
-
# Find failures in this version
|
|
292
|
-
langsmith-cli --json runs list --project harness-evolver-v{N} --failed --fields id,name,error,inputs
|
|
293
|
-
|
|
294
|
-
# Aggregate stats (error rate, latency p50/p95/p99)
|
|
295
|
-
langsmith-cli --json runs stats --project harness-evolver-v{N}
|
|
296
|
-
|
|
297
|
-
# Search for specific error patterns
|
|
298
|
-
langsmith-cli --json runs list --grep "pattern" --grep-in error --project harness-evolver-v{N} --fields id,error
|
|
299
|
-
|
|
300
|
-
# Compare two versions
|
|
301
|
-
langsmith-cli --json runs stats --project harness-evolver-v{A}
|
|
302
|
-
langsmith-cli --json runs stats --project harness-evolver-v{B}
|
|
303
|
-
|
|
304
|
-
# Get full details of latest failure
|
|
305
|
-
langsmith-cli --json runs get-latest --project harness-evolver-v{N} --failed
|
|
306
|
-
```
|
|
307
|
-
|
|
308
|
-
ALWAYS use `--json` as the first flag and `--fields` to limit output size.
|
|
309
|
-
Only fall back to local traces in `traces/` if langsmith-cli is not available or LangSmith is not enabled.
|
|
310
|
-
|
|
311
|
-
## Output
|
|
312
|
-
|
|
313
|
-
When done, report what you created:
|
|
314
|
-
- Version number (e.g., "v003")
|
|
315
|
-
- Parent version
|
|
316
|
-
- 1-sentence summary of the change
|
|
317
|
-
- Expected impact on score
|
|
@@ -1,112 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
name: harness-evolver-testgen
|
|
3
|
-
description: |
|
|
4
|
-
Use this agent to generate synthetic test cases from harness source code analysis.
|
|
5
|
-
Spawned by the init skill when no test cases exist in the project.
|
|
6
|
-
tools: Read, Write, Bash, Glob, Grep
|
|
7
|
-
color: cyan
|
|
8
|
-
---
|
|
9
|
-
|
|
10
|
-
# Harness Evolver — Test Generation Agent
|
|
11
|
-
|
|
12
|
-
You are a test case generator. Your job is to read the harness source code, understand its domain, and generate diverse, challenging test cases.
|
|
13
|
-
|
|
14
|
-
## Bootstrap
|
|
15
|
-
|
|
16
|
-
If your prompt contains a `<files_to_read>` block, you MUST use the Read tool to load
|
|
17
|
-
every file listed there before performing any other actions.
|
|
18
|
-
|
|
19
|
-
## Return Protocol
|
|
20
|
-
|
|
21
|
-
When done, end your response with:
|
|
22
|
-
|
|
23
|
-
## TESTGEN COMPLETE
|
|
24
|
-
- **Tasks generated**: {N}
|
|
25
|
-
- **Categories covered**: {list}
|
|
26
|
-
- **Distribution**: {N} standard, {N} edge, {N} cross-domain, {N} adversarial
|
|
27
|
-
|
|
28
|
-
## Your Workflow
|
|
29
|
-
|
|
30
|
-
### Phase 1: Understand the Domain
|
|
31
|
-
|
|
32
|
-
Read the harness source code to understand:
|
|
33
|
-
- What kind of agent is this? (Q&A bot, RAG, classifier, coding agent, etc.)
|
|
34
|
-
- What format does it expect for inputs?
|
|
35
|
-
- What categories/topics does it cover?
|
|
36
|
-
- What are its likely failure modes?
|
|
37
|
-
- Are there any data files (knowledge bases, docs, etc.) that define the domain?
|
|
38
|
-
|
|
39
|
-
### Phase 1.5: Use Production Traces (if available)
|
|
40
|
-
|
|
41
|
-
If your prompt contains a `<production_traces>` block, this is **real data from production LangSmith traces**. This is the most valuable signal you have — real user inputs beat synthetic ones.
|
|
42
|
-
|
|
43
|
-
When production traces are available:
|
|
44
|
-
1. Read the traffic distribution — generate tasks proportional to real usage (if 60% of queries are domain A, 60% of tasks should cover domain A)
|
|
45
|
-
2. Use actual user phrasing as inspiration — real inputs show abbreviations, typos, informal language
|
|
46
|
-
3. Base edge cases on real error patterns — the errors listed are genuine failures, not imagined scenarios
|
|
47
|
-
4. Prioritize negative feedback traces — these are confirmed bad experiences that MUST be covered
|
|
48
|
-
5. Include slow queries as edge cases — high-latency traces may reveal timeout or complexity issues
|
|
49
|
-
|
|
50
|
-
**Do NOT just copy production inputs verbatim.** Use them as inspiration to generate VARIATIONS that test the same capabilities.
|
|
51
|
-
|
|
52
|
-
### Phase 2: Design Test Distribution
|
|
53
|
-
|
|
54
|
-
Plan 30 test cases with this distribution:
|
|
55
|
-
- **40% Standard** (12 tasks): typical, well-formed inputs representative of the domain
|
|
56
|
-
- **20% Edge Cases** (6 tasks): boundary conditions, minimal inputs, unusual but valid
|
|
57
|
-
- **20% Cross-Domain** (6 tasks): inputs spanning multiple categories or requiring nuanced judgment
|
|
58
|
-
- **20% Adversarial** (6 tasks): misleading, ambiguous, or designed to expose weaknesses
|
|
59
|
-
|
|
60
|
-
If production traces are available, adjust the distribution to match real traffic patterns instead of uniform.
|
|
61
|
-
|
|
62
|
-
Ensure all categories/topics from the harness are covered.
|
|
63
|
-
|
|
64
|
-
### Phase 3: Generate Tasks
|
|
65
|
-
|
|
66
|
-
Create each task as a JSON file in the tasks/ directory.
|
|
67
|
-
|
|
68
|
-
Format (WITHOUT expected — for LLM-as-judge eval):
|
|
69
|
-
```json
|
|
70
|
-
{
|
|
71
|
-
"id": "task_001",
|
|
72
|
-
"input": "The actual question or request",
|
|
73
|
-
"metadata": {
|
|
74
|
-
"difficulty": "easy|medium|hard",
|
|
75
|
-
"category": "the domain category",
|
|
76
|
-
"type": "standard|edge|cross_domain|adversarial"
|
|
77
|
-
}
|
|
78
|
-
}
|
|
79
|
-
```
|
|
80
|
-
|
|
81
|
-
Format (WITH expected — when using keyword eval):
|
|
82
|
-
```json
|
|
83
|
-
{
|
|
84
|
-
"id": "task_001",
|
|
85
|
-
"input": "The actual question or request",
|
|
86
|
-
"expected": "The expected answer or key phrases",
|
|
87
|
-
"metadata": {
|
|
88
|
-
"difficulty": "easy|medium|hard",
|
|
89
|
-
"category": "the domain category",
|
|
90
|
-
"type": "standard|edge|cross_domain|adversarial"
|
|
91
|
-
}
|
|
92
|
-
}
|
|
93
|
-
```
|
|
94
|
-
|
|
95
|
-
Use the Write tool to create each file. Name them task_001.json through task_030.json.
|
|
96
|
-
|
|
97
|
-
### Phase 4: Validate
|
|
98
|
-
|
|
99
|
-
After generating all tasks:
|
|
100
|
-
- Verify each file is valid JSON
|
|
101
|
-
- Verify all IDs are unique
|
|
102
|
-
- Verify the distribution matches the target (40/20/20/20)
|
|
103
|
-
- Verify all domain categories are represented
|
|
104
|
-
|
|
105
|
-
## Rules
|
|
106
|
-
|
|
107
|
-
1. **Inputs must be realistic** — questions a real user would ask, not synthetic-sounding
|
|
108
|
-
2. **Vary phrasing** — don't use the same sentence structure repeatedly
|
|
109
|
-
3. **Include some hard questions** — questions that require reasoning, not just lookup
|
|
110
|
-
4. **Include out-of-scope questions** — 2-3 questions the agent should NOT be able to answer
|
|
111
|
-
5. **Test failure modes** — ambiguous questions, misspellings, multi-part questions
|
|
112
|
-
6. **Use the domain's language** — if the harness handles Portuguese, write inputs in Portuguese
|
|
@@ -1,25 +0,0 @@
|
|
|
1
|
-
# Classifier Example
|
|
2
|
-
|
|
3
|
-
Medical symptom classifier — deliberately naive, designed to be improved by the evolver.
|
|
4
|
-
|
|
5
|
-
## Quick Start (Mock Mode — No API Key)
|
|
6
|
-
|
|
7
|
-
```bash
|
|
8
|
-
/harness-evolve-init --harness harness.py --eval eval.py --tasks tasks/
|
|
9
|
-
/harness-evolve --iterations 5
|
|
10
|
-
```
|
|
11
|
-
|
|
12
|
-
## With LLM
|
|
13
|
-
|
|
14
|
-
Edit `config.json`:
|
|
15
|
-
```json
|
|
16
|
-
{
|
|
17
|
-
"mock": false,
|
|
18
|
-
"api_key": "sk-ant-...",
|
|
19
|
-
"model": "claude-haiku-4-5-20251001"
|
|
20
|
-
}
|
|
21
|
-
```
|
|
22
|
-
|
|
23
|
-
## Categories
|
|
24
|
-
|
|
25
|
-
respiratory, cardiac, gastrointestinal, neurological, musculoskeletal, dermatological
|
|
@@ -1,58 +0,0 @@
|
|
|
1
|
-
#!/usr/bin/env python3
|
|
2
|
-
"""Exact match accuracy scorer for the classifier example."""
|
|
3
|
-
|
|
4
|
-
import argparse
|
|
5
|
-
import json
|
|
6
|
-
import os
|
|
7
|
-
|
|
8
|
-
|
|
9
|
-
def main():
|
|
10
|
-
parser = argparse.ArgumentParser()
|
|
11
|
-
parser.add_argument("--results-dir", required=True)
|
|
12
|
-
parser.add_argument("--tasks-dir", required=True)
|
|
13
|
-
parser.add_argument("--scores", required=True)
|
|
14
|
-
args = parser.parse_args()
|
|
15
|
-
|
|
16
|
-
correct = 0
|
|
17
|
-
total = 0
|
|
18
|
-
per_task = {}
|
|
19
|
-
|
|
20
|
-
for fname in sorted(os.listdir(args.tasks_dir)):
|
|
21
|
-
if not fname.endswith(".json"):
|
|
22
|
-
continue
|
|
23
|
-
task_path = os.path.join(args.tasks_dir, fname)
|
|
24
|
-
task = json.load(open(task_path))
|
|
25
|
-
task_id = task["id"]
|
|
26
|
-
|
|
27
|
-
result_path = os.path.join(args.results_dir, fname)
|
|
28
|
-
if not os.path.exists(result_path):
|
|
29
|
-
per_task[task_id] = {"score": 0.0, "error": "no output file"}
|
|
30
|
-
total += 1
|
|
31
|
-
continue
|
|
32
|
-
|
|
33
|
-
result = json.load(open(result_path))
|
|
34
|
-
expected = task["expected"].lower().strip()
|
|
35
|
-
actual = result.get("output", "").lower().strip()
|
|
36
|
-
match = actual == expected
|
|
37
|
-
|
|
38
|
-
per_task[task_id] = {
|
|
39
|
-
"score": 1.0 if match else 0.0,
|
|
40
|
-
"expected": expected,
|
|
41
|
-
"actual": actual,
|
|
42
|
-
}
|
|
43
|
-
correct += int(match)
|
|
44
|
-
total += 1
|
|
45
|
-
|
|
46
|
-
accuracy = correct / total if total > 0 else 0.0
|
|
47
|
-
scores = {
|
|
48
|
-
"combined_score": accuracy,
|
|
49
|
-
"accuracy": accuracy,
|
|
50
|
-
"total_tasks": total,
|
|
51
|
-
"correct": correct,
|
|
52
|
-
"per_task": per_task,
|
|
53
|
-
}
|
|
54
|
-
json.dump(scores, open(args.scores, "w"), indent=2)
|
|
55
|
-
|
|
56
|
-
|
|
57
|
-
if __name__ == "__main__":
|
|
58
|
-
main()
|
|
@@ -1,111 +0,0 @@
|
|
|
1
|
-
#!/usr/bin/env python3
|
|
2
|
-
"""Medical symptom classifier — deliberately naive, with room for improvement.
|
|
3
|
-
|
|
4
|
-
Mock mode (default): keyword matching, ~40% accuracy.
|
|
5
|
-
LLM mode: calls API, ~50-60% accuracy (no few-shot, no retry, no structured output).
|
|
6
|
-
"""
|
|
7
|
-
|
|
8
|
-
import argparse
|
|
9
|
-
import json
|
|
10
|
-
import os
|
|
11
|
-
import sys
|
|
12
|
-
|
|
13
|
-
CATEGORIES = [
|
|
14
|
-
"respiratory", "cardiac", "gastrointestinal",
|
|
15
|
-
"neurological", "musculoskeletal", "dermatological",
|
|
16
|
-
]
|
|
17
|
-
|
|
18
|
-
KEYWORDS = {
|
|
19
|
-
"respiratory": ["cough", "breath", "lung", "wheeze", "sputum"],
|
|
20
|
-
"cardiac": ["chest pain", "heart", "blood pressure", "palpitation"],
|
|
21
|
-
"gastrointestinal": ["nausea", "vomit", "abdominal", "diarrhea", "stomach"],
|
|
22
|
-
"neurological": ["headache", "dizz", "numb", "seizure", "confusion"],
|
|
23
|
-
"musculoskeletal": ["joint", "back pain", "muscle", "stiffness", "swelling"],
|
|
24
|
-
"dermatological": ["rash", "itch", "skin", "lesion", "bump"],
|
|
25
|
-
}
|
|
26
|
-
|
|
27
|
-
|
|
28
|
-
def classify_mock(text):
|
|
29
|
-
text_lower = text.lower()
|
|
30
|
-
scores = {}
|
|
31
|
-
for category, words in KEYWORDS.items():
|
|
32
|
-
scores[category] = sum(1 for w in words if w in text_lower)
|
|
33
|
-
best = max(scores, key=scores.get)
|
|
34
|
-
if scores[best] == 0:
|
|
35
|
-
return "unknown"
|
|
36
|
-
return best
|
|
37
|
-
|
|
38
|
-
|
|
39
|
-
def classify_llm(text, config):
|
|
40
|
-
import urllib.request
|
|
41
|
-
|
|
42
|
-
api_key = os.environ.get("ANTHROPIC_API_KEY", "")
|
|
43
|
-
model = config.get("model", "claude-haiku-4-5-20251001")
|
|
44
|
-
|
|
45
|
-
prompt = (
|
|
46
|
-
f"Classify the following medical symptom description into exactly one category.\n"
|
|
47
|
-
f"Categories: {', '.join(CATEGORIES)}\n"
|
|
48
|
-
f"Reply with ONLY the category name, nothing else.\n\n"
|
|
49
|
-
f"{text}"
|
|
50
|
-
)
|
|
51
|
-
|
|
52
|
-
body = json.dumps({
|
|
53
|
-
"model": model,
|
|
54
|
-
"max_tokens": 50,
|
|
55
|
-
"messages": [{"role": "user", "content": prompt}],
|
|
56
|
-
}).encode()
|
|
57
|
-
|
|
58
|
-
req = urllib.request.Request(
|
|
59
|
-
"https://api.anthropic.com/v1/messages",
|
|
60
|
-
data=body,
|
|
61
|
-
headers={
|
|
62
|
-
"Content-Type": "application/json",
|
|
63
|
-
"x-api-key": api_key,
|
|
64
|
-
"anthropic-version": "2023-06-01",
|
|
65
|
-
},
|
|
66
|
-
)
|
|
67
|
-
with urllib.request.urlopen(req, timeout=30) as resp:
|
|
68
|
-
result = json.loads(resp.read())
|
|
69
|
-
|
|
70
|
-
answer = result["content"][0]["text"].strip().lower()
|
|
71
|
-
for cat in CATEGORIES:
|
|
72
|
-
if cat in answer:
|
|
73
|
-
return cat
|
|
74
|
-
return answer
|
|
75
|
-
|
|
76
|
-
|
|
77
|
-
def main():
|
|
78
|
-
parser = argparse.ArgumentParser()
|
|
79
|
-
parser.add_argument("--input", required=True)
|
|
80
|
-
parser.add_argument("--output", required=True)
|
|
81
|
-
parser.add_argument("--traces-dir", default=None)
|
|
82
|
-
parser.add_argument("--config", default=None)
|
|
83
|
-
args = parser.parse_args()
|
|
84
|
-
|
|
85
|
-
task = json.load(open(args.input))
|
|
86
|
-
config = json.load(open(args.config)) if args.config and os.path.exists(args.config) else {}
|
|
87
|
-
use_mock = config.get("mock", True)
|
|
88
|
-
|
|
89
|
-
if use_mock:
|
|
90
|
-
result = classify_mock(task["input"])
|
|
91
|
-
else:
|
|
92
|
-
result = classify_llm(task["input"], config)
|
|
93
|
-
|
|
94
|
-
output = {"id": task["id"], "output": result}
|
|
95
|
-
|
|
96
|
-
if args.traces_dir:
|
|
97
|
-
os.makedirs(args.traces_dir, exist_ok=True)
|
|
98
|
-
trace = {
|
|
99
|
-
"mode": "mock" if use_mock else "llm",
|
|
100
|
-
"input_text": task["input"],
|
|
101
|
-
"output_category": result,
|
|
102
|
-
"config": {k: v for k, v in config.items() if k != "api_key"},
|
|
103
|
-
}
|
|
104
|
-
with open(os.path.join(args.traces_dir, "trace.json"), "w") as f:
|
|
105
|
-
json.dump([trace], f, indent=2)
|
|
106
|
-
|
|
107
|
-
json.dump(output, open(args.output, "w"), indent=2)
|
|
108
|
-
|
|
109
|
-
|
|
110
|
-
if __name__ == "__main__":
|
|
111
|
-
main()
|
|
@@ -1 +0,0 @@
|
|
|
1
|
-
{"id": "task_001", "input": "The patient presents with persistent cough, fever of 38.5C, and shortness of breath", "expected": "respiratory", "metadata": {"difficulty": "easy"}}
|
|
@@ -1 +0,0 @@
|
|
|
1
|
-
{"id": "task_002", "input": "Patient reports severe chest pain radiating to left arm with elevated blood pressure", "expected": "cardiac", "metadata": {"difficulty": "easy"}}
|
|
@@ -1 +0,0 @@
|
|
|
1
|
-
{"id": "task_003", "input": "Recurring nausea, vomiting after meals, and sharp abdominal pain in lower right quadrant", "expected": "gastrointestinal", "metadata": {"difficulty": "easy"}}
|
|
@@ -1 +0,0 @@
|
|
|
1
|
-
{"id": "task_004", "input": "Patient complains of severe headache, dizziness, and intermittent numbness in left hand", "expected": "neurological", "metadata": {"difficulty": "easy"}}
|
|
@@ -1 +0,0 @@
|
|
|
1
|
-
{"id": "task_005", "input": "Chronic lower back pain with stiffness, worsening after prolonged sitting, mild joint swelling", "expected": "musculoskeletal", "metadata": {"difficulty": "easy"}}
|
|
@@ -1 +0,0 @@
|
|
|
1
|
-
{"id": "task_006", "input": "Red itchy rash spreading across torso with small raised bumps and occasional skin peeling", "expected": "dermatological", "metadata": {"difficulty": "easy"}}
|
|
@@ -1 +0,0 @@
|
|
|
1
|
-
{"id": "task_007", "input": "Patient has a mild cough and reports feeling dizzy with occasional heart palpitations", "expected": "cardiac", "metadata": {"difficulty": "hard"}}
|
|
@@ -1 +0,0 @@
|
|
|
1
|
-
{"id": "task_008", "input": "Fatigue and muscle weakness with tingling in extremities and difficulty concentrating", "expected": "neurological", "metadata": {"difficulty": "hard"}}
|
|
@@ -1 +0,0 @@
|
|
|
1
|
-
{"id": "task_009", "input": "Stomach cramps with alternating diarrhea and constipation, bloating after eating", "expected": "gastrointestinal", "metadata": {"difficulty": "medium"}}
|
|
@@ -1 +0,0 @@
|
|
|
1
|
-
{"id": "task_010", "input": "Joint pain in fingers and wrists with morning stiffness lasting over an hour and skin rash on knuckles", "expected": "musculoskeletal", "metadata": {"difficulty": "hard"}}
|