harness-evolver 2.8.1 → 2.9.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/package.json +1 -1
- package/skills/deploy/SKILL.md +36 -7
- package/skills/evolve/SKILL.md +30 -1
- package/skills/init/SKILL.md +65 -3
package/package.json
CHANGED
package/skills/deploy/SKILL.md
CHANGED
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
name: harness-evolver:deploy
|
|
3
3
|
description: "Use when the user wants to use the best evolved harness in their project, promote a version to production, copy the winning harness back, or is done evolving and wants to apply the result."
|
|
4
4
|
argument-hint: "[version]"
|
|
5
|
-
allowed-tools: [Read, Write, Bash, Glob]
|
|
5
|
+
allowed-tools: [Read, Write, Bash, Glob, AskUserQuestion]
|
|
6
6
|
---
|
|
7
7
|
|
|
8
8
|
# /harness-evolver:deploy
|
|
@@ -32,19 +32,48 @@ cat .harness-evolver/harnesses/{version}/scores.json
|
|
|
32
32
|
|
|
33
33
|
Report: version, score, improvement over baseline, what changed.
|
|
34
34
|
|
|
35
|
-
### 3. Ask
|
|
35
|
+
### 3. Ask Deploy Options (Interactive)
|
|
36
36
|
|
|
37
|
-
|
|
38
|
-
|
|
37
|
+
Use AskUserQuestion with TWO questions:
|
|
38
|
+
|
|
39
|
+
```
|
|
40
|
+
Question 1: "Where should the evolved harness go?"
|
|
41
|
+
Header: "Deploy to"
|
|
42
|
+
Options:
|
|
43
|
+
- "Overwrite original" — Replace {original_harness_path} with the evolved version
|
|
44
|
+
- "Copy to new file" — Save as harness_evolved.py alongside the original
|
|
45
|
+
- "Just show the diff" — Don't copy anything, just show what changed
|
|
46
|
+
```
|
|
47
|
+
|
|
48
|
+
```
|
|
49
|
+
Question 2 (ONLY if user chose "Overwrite original"):
|
|
50
|
+
"Back up the current harness before overwriting?"
|
|
51
|
+
Header: "Backup"
|
|
52
|
+
Options:
|
|
53
|
+
- "Yes, backup first" — Save current as {harness}.bak before overwriting
|
|
54
|
+
- "No, just overwrite" — Replace directly (git history has the original)
|
|
55
|
+
```
|
|
39
56
|
|
|
40
57
|
### 4. Copy Files
|
|
41
58
|
|
|
59
|
+
Based on the user's choices:
|
|
60
|
+
|
|
61
|
+
**If "Overwrite original"**:
|
|
62
|
+
- If backup: `cp {original_harness} {original_harness}.bak`
|
|
63
|
+
- Then: `cp .harness-evolver/harnesses/{version}/harness.py {original_harness}`
|
|
64
|
+
- Copy config.json if exists
|
|
65
|
+
|
|
66
|
+
**If "Copy to new file"**:
|
|
42
67
|
```bash
|
|
43
|
-
cp .harness-evolver/harnesses/{version}/harness.py ./
|
|
44
|
-
cp .harness-evolver/harnesses/{version}/config.json ./
|
|
68
|
+
cp .harness-evolver/harnesses/{version}/harness.py ./harness_evolved.py
|
|
69
|
+
cp .harness-evolver/harnesses/{version}/config.json ./config_evolved.json # if exists
|
|
45
70
|
```
|
|
46
71
|
|
|
47
|
-
If
|
|
72
|
+
**If "Just show the diff"**:
|
|
73
|
+
```bash
|
|
74
|
+
diff {original_harness} .harness-evolver/harnesses/{version}/harness.py
|
|
75
|
+
```
|
|
76
|
+
Do not copy anything.
|
|
48
77
|
|
|
49
78
|
### 5. Report
|
|
50
79
|
|
package/skills/evolve/SKILL.md
CHANGED
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
name: harness-evolver:evolve
|
|
3
3
|
description: "Use when the user wants to run the optimization loop, improve harness performance, evolve the harness, or iterate on harness quality. Requires .harness-evolver/ to exist (run harness-evolver:init first)."
|
|
4
4
|
argument-hint: "[--iterations N]"
|
|
5
|
-
allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent]
|
|
5
|
+
allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent, AskUserQuestion]
|
|
6
6
|
---
|
|
7
7
|
|
|
8
8
|
# /harness-evolver:evolve
|
|
@@ -24,6 +24,35 @@ TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo
|
|
|
24
24
|
- `--iterations N` (default: 10)
|
|
25
25
|
- Read `config.json` for `evolution.stagnation_limit` (default: 3) and `evolution.target_score`
|
|
26
26
|
|
|
27
|
+
## Pre-Loop: Interactive Configuration
|
|
28
|
+
|
|
29
|
+
If no `--iterations` argument was provided, ask the user interactively:
|
|
30
|
+
|
|
31
|
+
Use AskUserQuestion with TWO questions:
|
|
32
|
+
|
|
33
|
+
```
|
|
34
|
+
Question 1: "How many evolution iterations?"
|
|
35
|
+
Header: "Iterations"
|
|
36
|
+
Options:
|
|
37
|
+
- "3 (quick)" — Fast exploration, good for testing setup
|
|
38
|
+
- "5 (balanced)" — Good trade-off between speed and quality
|
|
39
|
+
- "10 (thorough)" — Deep optimization, takes longer
|
|
40
|
+
|
|
41
|
+
Question 2: "Stop early if score reaches?"
|
|
42
|
+
Header: "Target"
|
|
43
|
+
Options:
|
|
44
|
+
- "0.8 (good enough)" — Stop when the harness is reasonably good
|
|
45
|
+
- "0.9 (high quality)" — Stop when quality is high
|
|
46
|
+
- "0.95 (near perfect)" — Push for near-perfect scores
|
|
47
|
+
- "No limit" — Run all iterations regardless of score
|
|
48
|
+
```
|
|
49
|
+
|
|
50
|
+
Apply the answers:
|
|
51
|
+
- Set iterations from question 1 (3, 5, or 10)
|
|
52
|
+
- Set target_score from question 2 (0.8, 0.9, 0.95, or None)
|
|
53
|
+
|
|
54
|
+
If `--iterations` WAS provided as argument, skip these questions and use the argument value.
|
|
55
|
+
|
|
27
56
|
## The Loop
|
|
28
57
|
|
|
29
58
|
For each iteration:
|
package/skills/init/SKILL.md
CHANGED
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
name: harness-evolver:init
|
|
3
3
|
description: "Use when the user wants to set up harness optimization in their project, optimize an LLM agent, improve a harness, or mentions harness-evolver for the first time in a project without .harness-evolver/ directory."
|
|
4
4
|
argument-hint: "[directory]"
|
|
5
|
-
allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent]
|
|
5
|
+
allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent, AskUserQuestion]
|
|
6
6
|
---
|
|
7
7
|
|
|
8
8
|
# /harness-evolve-init
|
|
@@ -30,15 +30,77 @@ Look for:
|
|
|
30
30
|
- Existing tasks: directories with JSON files containing `id` + `input` fields
|
|
31
31
|
- Config: `config.json`, `config.yaml`, `.env`
|
|
32
32
|
|
|
33
|
+
## Phase 1.5: Confirm Detection (Interactive)
|
|
34
|
+
|
|
35
|
+
After scanning, present what was found and ask the user to confirm before proceeding.
|
|
36
|
+
|
|
37
|
+
Use AskUserQuestion:
|
|
38
|
+
|
|
39
|
+
```
|
|
40
|
+
Question: "Here's what I detected. Does this look right?"
|
|
41
|
+
Header: "Confirm"
|
|
42
|
+
Options:
|
|
43
|
+
- "Looks good, proceed" — Continue with detected paths
|
|
44
|
+
- "Let me adjust paths" — User will provide correct paths
|
|
45
|
+
- "Start over in different directory" — Abort and let user cd elsewhere
|
|
46
|
+
|
|
47
|
+
Show in the question description:
|
|
48
|
+
- Harness: {path or "not found"}
|
|
49
|
+
- Eval: {path or "not found — will use LLM-as-judge"}
|
|
50
|
+
- Tasks: {path with N files, or "not found — will generate"}
|
|
51
|
+
- Stack: {detected frameworks or "none detected"}
|
|
52
|
+
- Architecture: {topology or "unknown"}
|
|
53
|
+
```
|
|
54
|
+
|
|
55
|
+
If user chose "Let me adjust paths", ask which paths to change and update accordingly.
|
|
56
|
+
|
|
57
|
+
## Phase 1.8: Eval Mode (Interactive — only if NO eval found)
|
|
58
|
+
|
|
59
|
+
If no eval.py was detected, ask the user which evaluation mode to use:
|
|
60
|
+
|
|
61
|
+
```
|
|
62
|
+
Question: "No eval script found. How should outputs be scored?"
|
|
63
|
+
Header: "Eval mode"
|
|
64
|
+
Options:
|
|
65
|
+
- "LLM-as-judge (zero-config)" — Claude Code scores outputs on accuracy, completeness, relevance, hallucination. No expected answers needed.
|
|
66
|
+
- "Keyword matching" — Simple string matching against expected answers. Requires 'expected' field in tasks.
|
|
67
|
+
- "I'll provide my own eval.py" — Pause and let user create their eval script.
|
|
68
|
+
```
|
|
69
|
+
|
|
70
|
+
If "LLM-as-judge": copy eval_passthrough.py as eval.py.
|
|
71
|
+
If "Keyword matching": create a simple keyword eval (check if expected substrings appear in output).
|
|
72
|
+
If "I'll provide my own": print instructions for the eval contract and wait.
|
|
73
|
+
|
|
74
|
+
## Phase 1.9: LangSmith Project (Interactive — only if LANGSMITH_API_KEY detected)
|
|
75
|
+
|
|
76
|
+
If a LangSmith API key is available, discover projects and ask which one has production traces:
|
|
77
|
+
|
|
78
|
+
```bash
|
|
79
|
+
langsmith-cli --json projects list --limit 10 2>/dev/null
|
|
80
|
+
```
|
|
81
|
+
|
|
82
|
+
Use AskUserQuestion:
|
|
83
|
+
```
|
|
84
|
+
Question: "LangSmith detected. Which project has your production traces?"
|
|
85
|
+
Header: "LangSmith"
|
|
86
|
+
Options: (build from discovered projects — pick top 3-4 by recent activity)
|
|
87
|
+
- "{project_name_1}" — {run_count} runs, last active {date}
|
|
88
|
+
- "{project_name_2}" — {run_count} runs, last active {date}
|
|
89
|
+
- "{project_name_3}" — {run_count} runs, last active {date}
|
|
90
|
+
- "Skip — don't use production traces" — Proceed without production data
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
If a project is selected, pass it as `--langsmith-project` to init.py.
|
|
94
|
+
|
|
33
95
|
## Phase 2: Create What's Missing
|
|
34
96
|
|
|
35
97
|
Three artifacts needed. For each — use existing if found, create if not.
|
|
36
98
|
|
|
37
99
|
**Harness** (`harness.py`): If user's entry point doesn't match our CLI interface (`--input`, `--output`, `--traces-dir`, `--config`), create a thin wrapper that imports their code. Read their entry point first to understand the I/O format. Ask if unsure.
|
|
38
100
|
|
|
39
|
-
**Eval** (`eval.py`): If an eval script exists, use it.
|
|
101
|
+
**Eval** (`eval.py`): If an eval script exists, use it. If the user already chose an eval mode in Phase 1.8, follow that choice.
|
|
40
102
|
|
|
41
|
-
If NO eval exists:
|
|
103
|
+
If NO eval exists and no mode was chosen yet:
|
|
42
104
|
- Copy `eval_passthrough.py` from `$TOOLS/eval_passthrough.py` as the project's eval.py:
|
|
43
105
|
```bash
|
|
44
106
|
cp $TOOLS/eval_passthrough.py eval.py
|