harness-evolver 1.8.0 → 2.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +1 -1
- package/skills/evolve/SKILL.md +53 -8
- package/tools/evaluate.py +6 -1
package/package.json
CHANGED
package/skills/evolve/SKILL.md
CHANGED
|
@@ -36,15 +36,34 @@ python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); pri
|
|
|
36
36
|
|
|
37
37
|
### 1.5. Gather LangSmith Traces (MANDATORY after every evaluation)
|
|
38
38
|
|
|
39
|
-
**Run these commands unconditionally after EVERY evaluation** (including baseline).
|
|
39
|
+
**Run these commands unconditionally after EVERY evaluation** (including baseline). Do NOT guess project names — discover them.
|
|
40
|
+
|
|
41
|
+
**Step 1: Find the actual LangSmith project name**
|
|
40
42
|
|
|
41
43
|
```bash
|
|
42
|
-
langsmith-cli --json
|
|
44
|
+
langsmith-cli --json projects list --name-pattern "harness-evolver*" --limit 10 2>/dev/null
|
|
45
|
+
```
|
|
43
46
|
|
|
44
|
-
|
|
47
|
+
This returns all projects matching the prefix. Pick the most recently updated one, or the one matching the current version. Save the project name:
|
|
48
|
+
|
|
49
|
+
```bash
|
|
50
|
+
LS_PROJECT=$(langsmith-cli --json projects list --name-pattern "harness-evolver*" --limit 1 2>/dev/null | python3 -c "import sys,json; data=json.load(sys.stdin); print(data[0]['name'] if data else '')" 2>/dev/null || echo "")
|
|
45
51
|
```
|
|
46
52
|
|
|
47
|
-
|
|
53
|
+
If `LS_PROJECT` is empty, langsmith-cli is not available or no projects exist — skip to step 2.
|
|
54
|
+
|
|
55
|
+
**Step 2: Gather traces from the discovered project**
|
|
56
|
+
|
|
57
|
+
```bash
|
|
58
|
+
if [ -n "$LS_PROJECT" ]; then
|
|
59
|
+
langsmith-cli --json runs list --project "$LS_PROJECT" --failed --fields id,name,error,inputs --limit 10 > .harness-evolver/langsmith_diagnosis.json 2>/dev/null || echo "[]" > .harness-evolver/langsmith_diagnosis.json
|
|
60
|
+
langsmith-cli --json runs stats --project "$LS_PROJECT" > .harness-evolver/langsmith_stats.json 2>/dev/null || echo "{}" > .harness-evolver/langsmith_stats.json
|
|
61
|
+
echo "$LS_PROJECT" > .harness-evolver/langsmith_project.txt
|
|
62
|
+
else
|
|
63
|
+
echo "[]" > .harness-evolver/langsmith_diagnosis.json
|
|
64
|
+
echo "{}" > .harness-evolver/langsmith_stats.json
|
|
65
|
+
fi
|
|
66
|
+
```
|
|
48
67
|
|
|
49
68
|
These files are included in the proposer's `<files_to_read>` so it has real trace data for diagnosis.
|
|
50
69
|
|
|
@@ -187,15 +206,41 @@ Agent(
|
|
|
187
206
|
)
|
|
188
207
|
```
|
|
189
208
|
|
|
190
|
-
|
|
209
|
+
**Also spawn these additional candidates:**
|
|
210
|
+
|
|
211
|
+
**Candidate D (Prompt Specialist)** — `run_in_background: true`:
|
|
212
|
+
Same as Exploiter but with a different focus:
|
|
213
|
+
```
|
|
214
|
+
<strategy>
|
|
215
|
+
APPROACH: prompt-engineering
|
|
216
|
+
You are the PROMPT SPECIALIST. Focus ONLY on improving the system prompt,
|
|
217
|
+
few-shot examples, output format instructions, and prompt structure.
|
|
218
|
+
Do NOT change the retrieval logic, pipeline structure, or code architecture.
|
|
219
|
+
</strategy>
|
|
220
|
+
```
|
|
221
|
+
Output to: `.harness-evolver/harnesses/{version}d/`
|
|
222
|
+
|
|
223
|
+
**Candidate E (Data/Retrieval Specialist)** — `run_in_background: true`:
|
|
224
|
+
```
|
|
225
|
+
<strategy>
|
|
226
|
+
APPROACH: retrieval-optimization
|
|
227
|
+
You are the RETRIEVAL SPECIALIST. Focus ONLY on improving how data is
|
|
228
|
+
retrieved, filtered, ranked, and presented to the LLM.
|
|
229
|
+
Do NOT change the system prompt text or output formatting.
|
|
230
|
+
Improve: search logic, relevance scoring, cross-domain retrieval, chunking.
|
|
231
|
+
</strategy>
|
|
232
|
+
```
|
|
233
|
+
Output to: `.harness-evolver/harnesses/{version}e/`
|
|
234
|
+
|
|
235
|
+
Wait for all 5 to complete. The background agents will notify when done.
|
|
191
236
|
|
|
192
|
-
**
|
|
237
|
+
**Minimum 3 candidates ALWAYS, even on iteration 1.** On iteration 1, the crossover agent uses baseline as both parents but with instruction to "combine the best retrieval strategy with the best prompt strategy from your analysis of the baseline." On iteration 2+, crossover uses two genuinely different parents.
|
|
193
238
|
|
|
194
|
-
**
|
|
239
|
+
**On iteration 3+**: If scores are improving, keep all 5 strategies. If stagnating, replace Candidate D with a "Radical" strategy that rewrites the harness from scratch.
|
|
195
240
|
|
|
196
241
|
### 3. Validate All Candidates
|
|
197
242
|
|
|
198
|
-
For each candidate (a, b, c):
|
|
243
|
+
For each candidate (a, b, c, d, e):
|
|
199
244
|
```bash
|
|
200
245
|
python3 $TOOLS/evaluate.py validate --harness .harness-evolver/harnesses/{version}{suffix}/harness.py --config .harness-evolver/harnesses/{version}{suffix}/config.json
|
|
201
246
|
```
|
package/tools/evaluate.py
CHANGED
|
@@ -118,12 +118,17 @@ def cmd_run(args):
|
|
|
118
118
|
api_key = os.environ.get(ls.get("api_key_env", "LANGSMITH_API_KEY"), "")
|
|
119
119
|
if api_key:
|
|
120
120
|
version = os.path.basename(os.path.dirname(traces_dir))
|
|
121
|
+
ls_project = f"{ls.get('project_prefix', 'harness-evolver')}-{version}"
|
|
121
122
|
langsmith_env = {
|
|
122
123
|
**os.environ,
|
|
123
124
|
"LANGCHAIN_TRACING_V2": "true",
|
|
124
125
|
"LANGCHAIN_API_KEY": api_key,
|
|
125
|
-
"LANGCHAIN_PROJECT":
|
|
126
|
+
"LANGCHAIN_PROJECT": ls_project,
|
|
126
127
|
}
|
|
128
|
+
# Write the project name so the evolve skill knows where to find traces
|
|
129
|
+
ls_project_file = os.path.join(os.path.dirname(os.path.dirname(traces_dir)), "langsmith_project.txt")
|
|
130
|
+
with open(ls_project_file, "w") as f:
|
|
131
|
+
f.write(ls_project)
|
|
127
132
|
|
|
128
133
|
for task_file in task_files:
|
|
129
134
|
task_path = os.path.join(tasks_dir, task_file)
|