harness-evolver 2.8.1 → 2.9.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "harness-evolver",
3
- "version": "2.8.1",
3
+ "version": "2.9.1",
4
4
  "description": "Meta-Harness-style autonomous harness optimization for Claude Code",
5
5
  "author": "Raphael Valdetaro",
6
6
  "license": "MIT",
@@ -2,7 +2,7 @@
2
2
  name: harness-evolver:deploy
3
3
  description: "Use when the user wants to use the best evolved harness in their project, promote a version to production, copy the winning harness back, or is done evolving and wants to apply the result."
4
4
  argument-hint: "[version]"
5
- allowed-tools: [Read, Write, Bash, Glob]
5
+ allowed-tools: [Read, Write, Bash, Glob, AskUserQuestion]
6
6
  ---
7
7
 
8
8
  # /harness-evolver:deploy
@@ -32,19 +32,48 @@ cat .harness-evolver/harnesses/{version}/scores.json
32
32
 
33
33
  Report: version, score, improvement over baseline, what changed.
34
34
 
35
- ### 3. Ask for Confirmation
35
+ ### 3. Ask Deploy Options (Interactive)
36
36
 
37
- > Deploy `{version}` (score: {score}, +{delta} over baseline) to your project?
38
- > This will copy `harness.py` and `config.json` to the project root.
37
+ Use AskUserQuestion with TWO questions:
38
+
39
+ ```
40
+ Question 1: "Where should the evolved harness go?"
41
+ Header: "Deploy to"
42
+ Options:
43
+ - "Overwrite original" — Replace {original_harness_path} with the evolved version
44
+ - "Copy to new file" — Save as harness_evolved.py alongside the original
45
+ - "Just show the diff" — Don't copy anything, just show what changed
46
+ ```
47
+
48
+ ```
49
+ Question 2 (ONLY if user chose "Overwrite original"):
50
+ "Back up the current harness before overwriting?"
51
+ Header: "Backup"
52
+ Options:
53
+ - "Yes, backup first" — Save current as {harness}.bak before overwriting
54
+ - "No, just overwrite" — Replace directly (git history has the original)
55
+ ```
39
56
 
40
57
  ### 4. Copy Files
41
58
 
59
+ Based on the user's choices:
60
+
61
+ **If "Overwrite original"**:
62
+ - If backup: `cp {original_harness} {original_harness}.bak`
63
+ - Then: `cp .harness-evolver/harnesses/{version}/harness.py {original_harness}`
64
+ - Copy config.json if exists
65
+
66
+ **If "Copy to new file"**:
42
67
  ```bash
43
- cp .harness-evolver/harnesses/{version}/harness.py ./harness.py
44
- cp .harness-evolver/harnesses/{version}/config.json ./config.json # if exists
68
+ cp .harness-evolver/harnesses/{version}/harness.py ./harness_evolved.py
69
+ cp .harness-evolver/harnesses/{version}/config.json ./config_evolved.json # if exists
45
70
  ```
46
71
 
47
- If the original entry point had a different name (e.g., `graph.py`), ask the user where to put it.
72
+ **If "Just show the diff"**:
73
+ ```bash
74
+ diff {original_harness} .harness-evolver/harnesses/{version}/harness.py
75
+ ```
76
+ Do not copy anything.
48
77
 
49
78
  ### 5. Report
50
79
 
@@ -2,7 +2,7 @@
2
2
  name: harness-evolver:evolve
3
3
  description: "Use when the user wants to run the optimization loop, improve harness performance, evolve the harness, or iterate on harness quality. Requires .harness-evolver/ to exist (run harness-evolver:init first)."
4
4
  argument-hint: "[--iterations N]"
5
- allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent]
5
+ allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent, AskUserQuestion]
6
6
  ---
7
7
 
8
8
  # /harness-evolver:evolve
@@ -24,6 +24,46 @@ TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo
24
24
  - `--iterations N` (default: 10)
25
25
  - Read `config.json` for `evolution.stagnation_limit` (default: 3) and `evolution.target_score`
26
26
 
27
+ ## Pre-Loop: Interactive Configuration
28
+
29
+ If no `--iterations` argument was provided, ask the user interactively:
30
+
31
+ Use AskUserQuestion with TWO questions in a single call (simple single-select, no preview needed):
32
+
33
+ ```json
34
+ {
35
+ "questions": [
36
+ {
37
+ "question": "How many evolution iterations?",
38
+ "header": "Iterations",
39
+ "multiSelect": false,
40
+ "options": [
41
+ {"label": "3 (quick)", "description": "Fast exploration, good for testing setup. ~15 min."},
42
+ {"label": "5 (balanced)", "description": "Good trade-off between speed and quality. ~30 min."},
43
+ {"label": "10 (thorough)", "description": "Deep optimization with adaptive strategies. ~1 hour."}
44
+ ]
45
+ },
46
+ {
47
+ "question": "Stop early if score reaches?",
48
+ "header": "Target",
49
+ "multiSelect": false,
50
+ "options": [
51
+ {"label": "0.8 (good enough)", "description": "Stop when the harness is reasonably good"},
52
+ {"label": "0.9 (high quality)", "description": "Stop when quality is high"},
53
+ {"label": "0.95 (near perfect)", "description": "Push for near-perfect scores"},
54
+ {"label": "No limit", "description": "Run all iterations regardless of score"}
55
+ ]
56
+ }
57
+ ]
58
+ }
59
+ ```
60
+
61
+ Apply the answers:
62
+ - Set iterations from question 1 (3, 5, or 10)
63
+ - Set target_score from question 2 (0.8, 0.9, 0.95, or None)
64
+
65
+ If `--iterations` WAS provided as argument, skip these questions and use the argument value.
66
+
27
67
  ## The Loop
28
68
 
29
69
  For each iteration:
@@ -2,7 +2,7 @@
2
2
  name: harness-evolver:init
3
3
  description: "Use when the user wants to set up harness optimization in their project, optimize an LLM agent, improve a harness, or mentions harness-evolver for the first time in a project without .harness-evolver/ directory."
4
4
  argument-hint: "[directory]"
5
- allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent]
5
+ allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent, AskUserQuestion]
6
6
  ---
7
7
 
8
8
  # /harness-evolve-init
@@ -30,15 +30,117 @@ Look for:
30
30
  - Existing tasks: directories with JSON files containing `id` + `input` fields
31
31
  - Config: `config.json`, `config.yaml`, `.env`
32
32
 
33
+ ## Phase 1.5: Confirm Detection (Interactive)
34
+
35
+ After scanning, present what was found and ask the user to confirm before proceeding.
36
+
37
+ Use AskUserQuestion:
38
+
39
+ ```
40
+ Question: "Here's what I detected. Does this look right?"
41
+ Header: "Confirm"
42
+ Options:
43
+ - "Looks good, proceed" — Continue with detected paths
44
+ - "Let me adjust paths" — User will provide correct paths
45
+ - "Start over in different directory" — Abort and let user cd elsewhere
46
+
47
+ Show in the question description:
48
+ - Harness: {path or "not found"}
49
+ - Eval: {path or "not found — will use LLM-as-judge"}
50
+ - Tasks: {path with N files, or "not found — will generate"}
51
+ - Stack: {detected frameworks or "none detected"}
52
+ - Architecture: {topology or "unknown"}
53
+ ```
54
+
55
+ If user chose "Let me adjust paths", ask which paths to change and update accordingly.
56
+
57
+ ## Phase 1.8: Eval Mode (Interactive — only if NO eval found)
58
+
59
+ If no eval.py was detected, ask the user which evaluation mode to use.
60
+
61
+ Use AskUserQuestion with **preview** (single-select with side-by-side preview):
62
+
63
+ ```json
64
+ {
65
+ "questions": [{
66
+ "question": "No eval script found. How should outputs be scored?",
67
+ "header": "Eval mode",
68
+ "multiSelect": false,
69
+ "options": [
70
+ {
71
+ "label": "LLM-as-judge (zero-config)",
72
+ "description": "Claude Code scores outputs automatically. No expected answers needed.",
73
+ "preview": "## LLM-as-Judge\n\nScoring dimensions:\n- **Accuracy** (40%) — correctness of output\n- **Completeness** (20%) — covers all aspects\n- **Relevance** (20%) — focused on the question\n- **No-Hallucination** (20%) — supported by facts\n\nEach scored 1-5, normalized to 0.0-1.0.\n\n**Requirements:** None. Works with any task format.\n\n```json\n{\"id\": \"task_001\", \"input\": \"your question\"}\n```"
74
+ },
75
+ {
76
+ "label": "Keyword matching",
77
+ "description": "Check if expected substrings appear in the output. Requires 'expected' field.",
78
+ "preview": "## Keyword Matching\n\nSimple deterministic scoring:\n- Score 1.0 if ALL expected keywords found in output\n- Score 0.0 otherwise\n\n**Requirements:** Tasks must include `expected` field:\n\n```json\n{\n \"id\": \"task_001\",\n \"input\": \"What is the capital of France?\",\n \"expected\": \"Paris\"\n}\n```\n\nFast, deterministic, no LLM calls during eval."
79
+ },
80
+ {
81
+ "label": "I'll provide my own eval.py",
82
+ "description": "Pause setup. You write the eval script following the contract.",
83
+ "preview": "## Custom Eval Contract\n\nYour eval.py must accept:\n```\npython3 eval.py \\\n --results-dir DIR \\\n --tasks-dir DIR \\\n --scores OUTPUT.json\n```\n\nMust write scores.json:\n```json\n{\n \"combined_score\": 0.85,\n \"per_task\": {\n \"task_001\": {\"score\": 0.9},\n \"task_002\": {\"score\": 0.8}\n }\n}\n```\n\nScores must be 0.0 to 1.0."
84
+ }
85
+ ]
86
+ }]
87
+ }
88
+ ```
89
+
90
+ If "LLM-as-judge": copy eval_passthrough.py as eval.py.
91
+ If "Keyword matching": create a simple keyword eval (check if expected substrings appear in output).
92
+ If "I'll provide my own": print instructions for the eval contract and wait.
93
+
94
+ ## Phase 1.9: LangSmith Project (Interactive — only if LANGSMITH_API_KEY detected)
95
+
96
+ If a LangSmith API key is available, discover projects and ask which one has production traces:
97
+
98
+ ```bash
99
+ langsmith-cli --json projects list --limit 10 2>/dev/null
100
+ ```
101
+
102
+ Use AskUserQuestion with **preview** (single-select with side-by-side). Build options dynamically from the discovered projects:
103
+
104
+ ```json
105
+ {
106
+ "questions": [{
107
+ "question": "LangSmith detected. Which project has your production traces?",
108
+ "header": "LangSmith",
109
+ "multiSelect": false,
110
+ "options": [
111
+ {
112
+ "label": "{project_name_1}",
113
+ "description": "{run_count} runs, last active {date}",
114
+ "preview": "## {project_name_1}\n\n- **Runs:** {run_count}\n- **Last active:** {date}\n- **Created:** {created_date}\n\nSelecting this project will:\n1. Fetch up to 100 recent traces\n2. Analyze traffic distribution and error patterns\n3. Generate production_seed.md for testgen\n4. Proposers will see real usage data"
115
+ },
116
+ {
117
+ "label": "{project_name_2}",
118
+ "description": "{run_count} runs, last active {date}",
119
+ "preview": "## {project_name_2}\n\n- **Runs:** {run_count}\n- **Last active:** {date}\n- **Created:** {created_date}\n\n(same explanation)"
120
+ },
121
+ {
122
+ "label": "Skip",
123
+ "description": "Don't use production traces",
124
+ "preview": "## Skip Production Traces\n\nThe evolver will work without production data:\n- Testgen generates synthetic tasks from code analysis\n- No real-world traffic distribution\n- No production error patterns\n\nYou can import traces later with:\n`/harness-evolver:import-traces`"
125
+ }
126
+ ]
127
+ }]
128
+ }
129
+ ```
130
+
131
+ Build the options from the `langsmith-cli` output. Use up to 3 projects (sorted by most recent activity) + the "Skip" option. Fill in actual values for run_count, date, etc.
132
+
133
+ If a project is selected, pass it as `--langsmith-project` to init.py.
134
+
33
135
  ## Phase 2: Create What's Missing
34
136
 
35
137
  Three artifacts needed. For each — use existing if found, create if not.
36
138
 
37
139
  **Harness** (`harness.py`): If user's entry point doesn't match our CLI interface (`--input`, `--output`, `--traces-dir`, `--config`), create a thin wrapper that imports their code. Read their entry point first to understand the I/O format. Ask if unsure.
38
140
 
39
- **Eval** (`eval.py`): If an eval script exists, use it.
141
+ **Eval** (`eval.py`): If an eval script exists, use it. If the user already chose an eval mode in Phase 1.8, follow that choice.
40
142
 
41
- If NO eval exists:
143
+ If NO eval exists and no mode was chosen yet:
42
144
  - Copy `eval_passthrough.py` from `$TOOLS/eval_passthrough.py` as the project's eval.py:
43
145
  ```bash
44
146
  cp $TOOLS/eval_passthrough.py eval.py