harness-evolver 2.9.0 → 2.9.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "harness-evolver",
3
- "version": "2.9.0",
3
+ "version": "2.9.1",
4
4
  "description": "Meta-Harness-style autonomous harness optimization for Claude Code",
5
5
  "author": "Raphael Valdetaro",
6
6
  "license": "MIT",
@@ -28,23 +28,34 @@ TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo
28
28
 
29
29
  If no `--iterations` argument was provided, ask the user interactively:
30
30
 
31
- Use AskUserQuestion with TWO questions:
32
-
33
- ```
34
- Question 1: "How many evolution iterations?"
35
- Header: "Iterations"
36
- Options:
37
- - "3 (quick)" Fast exploration, good for testing setup
38
- - "5 (balanced)" — Good trade-off between speed and quality
39
- - "10 (thorough)" — Deep optimization, takes longer
40
-
41
- Question 2: "Stop early if score reaches?"
42
- Header: "Target"
43
- Options:
44
- - "0.8 (good enough)" — Stop when the harness is reasonably good
45
- - "0.9 (high quality)" — Stop when quality is high
46
- - "0.95 (near perfect)" — Push for near-perfect scores
47
- - "No limit" Run all iterations regardless of score
31
+ Use AskUserQuestion with TWO questions in a single call (simple single-select, no preview needed):
32
+
33
+ ```json
34
+ {
35
+ "questions": [
36
+ {
37
+ "question": "How many evolution iterations?",
38
+ "header": "Iterations",
39
+ "multiSelect": false,
40
+ "options": [
41
+ {"label": "3 (quick)", "description": "Fast exploration, good for testing setup. ~15 min."},
42
+ {"label": "5 (balanced)", "description": "Good trade-off between speed and quality. ~30 min."},
43
+ {"label": "10 (thorough)", "description": "Deep optimization with adaptive strategies. ~1 hour."}
44
+ ]
45
+ },
46
+ {
47
+ "question": "Stop early if score reaches?",
48
+ "header": "Target",
49
+ "multiSelect": false,
50
+ "options": [
51
+ {"label": "0.8 (good enough)", "description": "Stop when the harness is reasonably good"},
52
+ {"label": "0.9 (high quality)", "description": "Stop when quality is high"},
53
+ {"label": "0.95 (near perfect)", "description": "Push for near-perfect scores"},
54
+ {"label": "No limit", "description": "Run all iterations regardless of score"}
55
+ ]
56
+ }
57
+ ]
58
+ }
48
59
  ```
49
60
 
50
61
  Apply the answers:
@@ -56,15 +56,35 @@ If user chose "Let me adjust paths", ask which paths to change and update accord
56
56
 
57
57
  ## Phase 1.8: Eval Mode (Interactive — only if NO eval found)
58
58
 
59
- If no eval.py was detected, ask the user which evaluation mode to use:
60
-
61
- ```
62
- Question: "No eval script found. How should outputs be scored?"
63
- Header: "Eval mode"
64
- Options:
65
- - "LLM-as-judge (zero-config)" — Claude Code scores outputs on accuracy, completeness, relevance, hallucination. No expected answers needed.
66
- - "Keyword matching" Simple string matching against expected answers. Requires 'expected' field in tasks.
67
- - "I'll provide my own eval.py" — Pause and let user create their eval script.
59
+ If no eval.py was detected, ask the user which evaluation mode to use.
60
+
61
+ Use AskUserQuestion with **preview** (single-select with side-by-side preview):
62
+
63
+ ```json
64
+ {
65
+ "questions": [{
66
+ "question": "No eval script found. How should outputs be scored?",
67
+ "header": "Eval mode",
68
+ "multiSelect": false,
69
+ "options": [
70
+ {
71
+ "label": "LLM-as-judge (zero-config)",
72
+ "description": "Claude Code scores outputs automatically. No expected answers needed.",
73
+ "preview": "## LLM-as-Judge\n\nScoring dimensions:\n- **Accuracy** (40%) — correctness of output\n- **Completeness** (20%) — covers all aspects\n- **Relevance** (20%) — focused on the question\n- **No-Hallucination** (20%) — supported by facts\n\nEach scored 1-5, normalized to 0.0-1.0.\n\n**Requirements:** None. Works with any task format.\n\n```json\n{\"id\": \"task_001\", \"input\": \"your question\"}\n```"
74
+ },
75
+ {
76
+ "label": "Keyword matching",
77
+ "description": "Check if expected substrings appear in the output. Requires 'expected' field.",
78
+ "preview": "## Keyword Matching\n\nSimple deterministic scoring:\n- Score 1.0 if ALL expected keywords found in output\n- Score 0.0 otherwise\n\n**Requirements:** Tasks must include `expected` field:\n\n```json\n{\n \"id\": \"task_001\",\n \"input\": \"What is the capital of France?\",\n \"expected\": \"Paris\"\n}\n```\n\nFast, deterministic, no LLM calls during eval."
79
+ },
80
+ {
81
+ "label": "I'll provide my own eval.py",
82
+ "description": "Pause setup. You write the eval script following the contract.",
83
+ "preview": "## Custom Eval Contract\n\nYour eval.py must accept:\n```\npython3 eval.py \\\n --results-dir DIR \\\n --tasks-dir DIR \\\n --scores OUTPUT.json\n```\n\nMust write scores.json:\n```json\n{\n \"combined_score\": 0.85,\n \"per_task\": {\n \"task_001\": {\"score\": 0.9},\n \"task_002\": {\"score\": 0.8}\n }\n}\n```\n\nScores must be 0.0 to 1.0."
84
+ }
85
+ ]
86
+ }]
87
+ }
68
88
  ```
69
89
 
70
90
  If "LLM-as-judge": copy eval_passthrough.py as eval.py.
@@ -79,17 +99,37 @@ If a LangSmith API key is available, discover projects and ask which one has pro
79
99
  langsmith-cli --json projects list --limit 10 2>/dev/null
80
100
  ```
81
101
 
82
- Use AskUserQuestion:
83
- ```
84
- Question: "LangSmith detected. Which project has your production traces?"
85
- Header: "LangSmith"
86
- Options: (build from discovered projects — pick top 3-4 by recent activity)
87
- - "{project_name_1}" {run_count} runs, last active {date}
88
- - "{project_name_2}" — {run_count} runs, last active {date}
89
- - "{project_name_3}" — {run_count} runs, last active {date}
90
- - "Skip — don't use production traces" — Proceed without production data
102
+ Use AskUserQuestion with **preview** (single-select with side-by-side). Build options dynamically from the discovered projects:
103
+
104
+ ```json
105
+ {
106
+ "questions": [{
107
+ "question": "LangSmith detected. Which project has your production traces?",
108
+ "header": "LangSmith",
109
+ "multiSelect": false,
110
+ "options": [
111
+ {
112
+ "label": "{project_name_1}",
113
+ "description": "{run_count} runs, last active {date}",
114
+ "preview": "## {project_name_1}\n\n- **Runs:** {run_count}\n- **Last active:** {date}\n- **Created:** {created_date}\n\nSelecting this project will:\n1. Fetch up to 100 recent traces\n2. Analyze traffic distribution and error patterns\n3. Generate production_seed.md for testgen\n4. Proposers will see real usage data"
115
+ },
116
+ {
117
+ "label": "{project_name_2}",
118
+ "description": "{run_count} runs, last active {date}",
119
+ "preview": "## {project_name_2}\n\n- **Runs:** {run_count}\n- **Last active:** {date}\n- **Created:** {created_date}\n\n(same explanation)"
120
+ },
121
+ {
122
+ "label": "Skip",
123
+ "description": "Don't use production traces",
124
+ "preview": "## Skip Production Traces\n\nThe evolver will work without production data:\n- Testgen generates synthetic tasks from code analysis\n- No real-world traffic distribution\n- No production error patterns\n\nYou can import traces later with:\n`/harness-evolver:import-traces`"
125
+ }
126
+ ]
127
+ }]
128
+ }
91
129
  ```
92
130
 
131
+ Build the options from the `langsmith-cli` output. Use up to 3 projects (sorted by most recent activity) + the "Skip" option. Fill in actual values for run_count, date, etc.
132
+
93
133
  If a project is selected, pass it as `--langsmith-project` to init.py.
94
134
 
95
135
  ## Phase 2: Create What's Missing