harness-evolver 0.2.0 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "harness-evolver",
3
- "version": "0.2.0",
3
+ "version": "0.4.0",
4
4
  "description": "Meta-Harness-style autonomous harness optimization for Claude Code",
5
5
  "author": "Raphael Valdetaro Christi Cordeiro",
6
6
  "license": "MIT",
@@ -0,0 +1,73 @@
1
+ ---
2
+ name: compare
3
+ description: "Use when the user wants to compare two harness versions, understand what changed between iterations, see why one version scored better than another, or debug a regression."
4
+ argument-hint: "<vA> <vB>"
5
+ allowed-tools: [Read, Bash, Glob, Grep]
6
+ ---
7
+
8
+ # /harness-evolver:compare
9
+
10
+ Compare two harness versions side by side.
11
+
12
+ ## Arguments
13
+
14
+ - `vA` — first version (e.g., `v001`, `baseline`)
15
+ - `vB` — second version (e.g., `v003`)
16
+
17
+ If only one version given, compare it against the current best.
18
+ If no versions given, compare the two most recent.
19
+
20
+ ## What To Do
21
+
22
+ ### 1. Code Diff
23
+
24
+ ```bash
25
+ diff .harness-evolver/harnesses/{vA}/harness.py .harness-evolver/harnesses/{vB}/harness.py
26
+ ```
27
+
28
+ If config changed:
29
+ ```bash
30
+ diff .harness-evolver/harnesses/{vA}/config.json .harness-evolver/harnesses/{vB}/config.json
31
+ ```
32
+
33
+ ### 2. Score Comparison
34
+
35
+ ```bash
36
+ cat .harness-evolver/harnesses/{vA}/scores.json
37
+ cat .harness-evolver/harnesses/{vB}/scores.json
38
+ ```
39
+
40
+ Report: combined_score delta, per-task wins/losses.
41
+
42
+ ### 3. Per-Task Analysis
43
+
44
+ For tasks where scores diverge, show what each version produced:
45
+
46
+ ```bash
47
+ cat .harness-evolver/harnesses/{vA}/traces/task_{ID}/output.json
48
+ cat .harness-evolver/harnesses/{vB}/traces/task_{ID}/output.json
49
+ ```
50
+
51
+ ### 4. Proposal Context
52
+
53
+ ```bash
54
+ cat .harness-evolver/harnesses/{vB}/proposal.md
55
+ ```
56
+
57
+ Show what the proposer intended and whether the result matched expectations.
58
+
59
+ ## Report Format
60
+
61
+ ```
62
+ v001 (0.62) vs v003 (0.71) — +0.09 improvement
63
+
64
+ Code changes:
65
+ + Added few-shot examples (3 examples)
66
+ ~ Changed prompt template
67
+ - Removed retry logic
68
+
69
+ Per-task:
70
+ task_001: 1.0 → 1.0 (unchanged)
71
+ task_007: 0.0 → 1.0 (FIXED — was cardiac, now correctly classified)
72
+ task_008: 1.0 → 0.0 (REGRESSION — was neurological, now wrong)
73
+ ```
@@ -0,0 +1,53 @@
1
+ ---
2
+ name: deploy
3
+ description: "Use when the user wants to use the best evolved harness in their project, promote a version to production, copy the winning harness back, or is done evolving and wants to apply the result."
4
+ argument-hint: "[version]"
5
+ allowed-tools: [Read, Write, Bash, Glob]
6
+ ---
7
+
8
+ # /harness-evolver:deploy
9
+
10
+ Promote the best (or specified) harness version back to the user's project.
11
+
12
+ ## Arguments
13
+
14
+ - `version` — optional. If not given, deploys the best version from `summary.json`.
15
+
16
+ ## What To Do
17
+
18
+ ### 1. Identify Best Version
19
+
20
+ ```bash
21
+ python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(s['best']['version'], s['best']['combined_score'])"
22
+ ```
23
+
24
+ Or use the user-specified version.
25
+
26
+ ### 2. Show What Will Be Deployed
27
+
28
+ ```bash
29
+ cat .harness-evolver/harnesses/{version}/proposal.md
30
+ cat .harness-evolver/harnesses/{version}/scores.json
31
+ ```
32
+
33
+ Report: version, score, improvement over baseline, what changed.
34
+
35
+ ### 3. Ask for Confirmation
36
+
37
+ > Deploy `{version}` (score: {score}, +{delta} over baseline) to your project?
38
+ > This will copy `harness.py` and `config.json` to the project root.
39
+
40
+ ### 4. Copy Files
41
+
42
+ ```bash
43
+ cp .harness-evolver/harnesses/{version}/harness.py ./harness.py
44
+ cp .harness-evolver/harnesses/{version}/config.json ./config.json # if exists
45
+ ```
46
+
47
+ If the original entry point had a different name (e.g., `graph.py`), ask the user where to put it.
48
+
49
+ ### 5. Report
50
+
51
+ - What was copied and where
52
+ - Score improvement: baseline → deployed version
53
+ - Suggest: review the diff before committing
@@ -0,0 +1,96 @@
1
+ ---
2
+ name: diagnose
3
+ description: "Use when the user wants to understand why a specific harness version failed, investigate a regression, analyze trace data, or debug a low score. Also use when the user says 'why did v003 fail' or 'what went wrong'."
4
+ argument-hint: "[version]"
5
+ allowed-tools: [Read, Bash, Glob, Grep]
6
+ ---
7
+
8
+ # /harness-evolver:diagnose
9
+
10
+ Deep analysis of a harness version's execution traces and scores.
11
+
12
+ ## Arguments
13
+
14
+ - `version` — version to diagnose (e.g., `v003`). If not given, diagnose the worst or most recent regression.
15
+
16
+ ## Resolve Tool Path
17
+
18
+ ```bash
19
+ TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo "$HOME/.harness-evolver/tools")
20
+ ```
21
+
22
+ ## What To Do
23
+
24
+ ### 1. Identify the Version
25
+
26
+ If not specified, find the worst or most recent regression:
27
+
28
+ ```bash
29
+ python3 $TOOLS/state.py show --base-dir .harness-evolver
30
+ cat .harness-evolver/summary.json
31
+ ```
32
+
33
+ ### 2. Score Breakdown
34
+
35
+ ```bash
36
+ cat .harness-evolver/harnesses/{version}/scores.json
37
+ ```
38
+
39
+ Identify which tasks failed (`score: 0.0`) and which passed.
40
+
41
+ ### 3. Trace Analysis (failed tasks)
42
+
43
+ For each failed task:
44
+
45
+ ```bash
46
+ cat .harness-evolver/harnesses/{version}/traces/{task_id}/input.json
47
+ cat .harness-evolver/harnesses/{version}/traces/{task_id}/output.json
48
+ ```
49
+
50
+ Look for patterns: wrong format? wrong category? empty output? crash?
51
+
52
+ ### 4. Error Search
53
+
54
+ ```bash
55
+ grep -r "error\|Error\|FAIL\|exception\|Traceback" .harness-evolver/harnesses/{version}/traces/
56
+ cat .harness-evolver/harnesses/{version}/traces/stderr.log
57
+ ```
58
+
59
+ ### 5. Compare with Parent
60
+
61
+ Read the proposal to find the parent version:
62
+
63
+ ```bash
64
+ cat .harness-evolver/harnesses/{version}/proposal.md
65
+ ```
66
+
67
+ Then diff:
68
+
69
+ ```bash
70
+ diff .harness-evolver/harnesses/{parent}/harness.py .harness-evolver/harnesses/{version}/harness.py
71
+ ```
72
+
73
+ ### 6. LangSmith (if available)
74
+
75
+ If `langsmith-cli` is installed and LangSmith is configured:
76
+
77
+ ```bash
78
+ langsmith-cli --json runs list --project harness-evolver-{version} --failed --fields id,name,error,inputs
79
+ langsmith-cli --json runs stats --project harness-evolver-{version}
80
+ ```
81
+
82
+ ### 7. Report
83
+
84
+ ```
85
+ Diagnosis: v003 (score: 0.31) — REGRESSION from v001 (0.62)
86
+
87
+ Root cause: Prompt template change broke JSON parsing
88
+ - 4/10 tasks returned malformed output
89
+ - stderr shows: json.JSONDecodeError on 4 tasks
90
+ - The change on line 42 removed the "Reply with ONLY..." instruction
91
+
92
+ Affected tasks: task_002, task_005, task_007, task_010
93
+ Unaffected tasks: task_001, task_003, task_004, task_006, task_008, task_009
94
+
95
+ Recommendation: Revert the prompt change, keep the retry logic from v002
96
+ ```
@@ -0,0 +1,94 @@
1
+ ---
2
+ name: evolve
3
+ description: "Use when the user wants to run the optimization loop, improve harness performance, evolve the harness, or iterate on harness quality. Requires .harness-evolver/ to exist (run harness-evolver:init first)."
4
+ argument-hint: "[--iterations N]"
5
+ allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent]
6
+ ---
7
+
8
+ # /harness-evolve
9
+
10
+ Run the autonomous propose-evaluate-iterate loop.
11
+
12
+ ## Prerequisites
13
+
14
+ `.harness-evolver/summary.json` must exist. If not, tell user to run `harness-evolver:init`.
15
+
16
+ ## Resolve Tool Path
17
+
18
+ ```bash
19
+ TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo "$HOME/.harness-evolver/tools")
20
+ ```
21
+
22
+ ## Parse Arguments
23
+
24
+ - `--iterations N` (default: 10)
25
+ - Read `config.json` for `evolution.stagnation_limit` (default: 3) and `evolution.target_score`
26
+
27
+ ## The Loop
28
+
29
+ For each iteration:
30
+
31
+ ### 1. Get Next Version
32
+
33
+ ```bash
34
+ python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(f'v{s[\"iterations\"]+1:03d}')"
35
+ ```
36
+
37
+ ### 2. Propose
38
+
39
+ Spawn the `harness-evolver-proposer` agent:
40
+
41
+ > You are proposing iteration {i}. Create version {version} in `.harness-evolver/harnesses/{version}/`.
42
+ > Working directory contains `.harness-evolver/` with all prior candidates and traces.
43
+
44
+ The proposer creates: `harness.py`, `config.json`, `proposal.md`.
45
+
46
+ ### 3. Validate
47
+
48
+ ```bash
49
+ python3 $TOOLS/evaluate.py validate \
50
+ --harness .harness-evolver/harnesses/{version}/harness.py \
51
+ --config .harness-evolver/harnesses/{version}/config.json
52
+ ```
53
+
54
+ If fails: one retry via proposer. If still fails: score 0.0, continue.
55
+
56
+ ### 4. Evaluate
57
+
58
+ ```bash
59
+ python3 $TOOLS/evaluate.py run \
60
+ --harness .harness-evolver/harnesses/{version}/harness.py \
61
+ --config .harness-evolver/harnesses/{version}/config.json \
62
+ --tasks-dir .harness-evolver/eval/tasks/ \
63
+ --eval .harness-evolver/eval/eval.py \
64
+ --traces-dir .harness-evolver/harnesses/{version}/traces/ \
65
+ --scores .harness-evolver/harnesses/{version}/scores.json \
66
+ --timeout 60
67
+ ```
68
+
69
+ ### 5. Update State
70
+
71
+ ```bash
72
+ python3 $TOOLS/state.py update \
73
+ --base-dir .harness-evolver \
74
+ --version {version} \
75
+ --scores .harness-evolver/harnesses/{version}/scores.json \
76
+ --proposal .harness-evolver/harnesses/{version}/proposal.md
77
+ ```
78
+
79
+ ### 6. Report
80
+
81
+ Read `summary.json`. Print: `Iteration {i}/{N}: {version} scored {score} (best: {best} at {best_score})`
82
+
83
+ ### 7. Check Stop Conditions
84
+
85
+ - **Stagnation**: last 3 scores within 1% of each other → stop
86
+ - **Target**: `combined_score >= target_score` → stop
87
+ - **N reached**: done
88
+
89
+ ## When Loop Ends — Final Report
90
+
91
+ - Best version and score
92
+ - Improvement over baseline (absolute and %)
93
+ - Total iterations run
94
+ - Suggest: "The best harness is at `.harness-evolver/harnesses/{best}/harness.py`. Copy it to your project."
@@ -0,0 +1,66 @@
1
+ ---
2
+ name: init
3
+ description: "Use when the user wants to set up harness optimization in their project, optimize an LLM agent, improve a harness, or mentions harness-evolver for the first time in a project without .harness-evolver/ directory."
4
+ argument-hint: "[directory]"
5
+ allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent]
6
+ ---
7
+
8
+ # /harness-evolve-init
9
+
10
+ Set up the Harness Evolver in a project. Scans the codebase, identifies the entry point, creates missing artifacts, runs baseline evaluation.
11
+
12
+ ## Resolve Tool Path
13
+
14
+ ```bash
15
+ TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo "$HOME/.harness-evolver/tools")
16
+ ```
17
+
18
+ Use `$TOOLS` prefix for all tool calls below.
19
+
20
+ ## Phase 1: Scan
21
+
22
+ ```bash
23
+ find . -maxdepth 3 -type f -name "*.py" | head -30
24
+ python3 $TOOLS/detect_stack.py .
25
+ ```
26
+
27
+ Look for:
28
+ - Entry points: files with `if __name__`, or named `main.py`, `app.py`, `agent.py`, `graph.py`, `pipeline.py`, `bot.py`
29
+ - Existing eval: `eval.py`, `score.py`, `judge.py`
30
+ - Existing tasks: directories with JSON files containing `id` + `input` fields
31
+ - Config: `config.json`, `config.yaml`, `.env`
32
+
33
+ ## Phase 2: Create What's Missing
34
+
35
+ Three artifacts needed. For each — use existing if found, create if not.
36
+
37
+ **Harness** (`harness.py`): If user's entry point doesn't match our CLI interface (`--input`, `--output`, `--traces-dir`, `--config`), create a thin wrapper that imports their code. Read their entry point first to understand the I/O format. Ask if unsure.
38
+
39
+ **Eval** (`eval.py`): Ask the user what "correct" means for their domain. Generate the simplest eval that gives signal. Even rough scoring works — the evolver iterates.
40
+
41
+ **Tasks** (`tasks/`): If no test data exists, ask the user for 5-10 example input/output pairs. Each task is `{"id": "task_001", "input": "...", "expected": "...", "metadata": {}}`.
42
+
43
+ ## Phase 3: Run Init
44
+
45
+ ```bash
46
+ python3 $TOOLS/init.py [directory] \
47
+ --harness harness.py --eval eval.py --tasks tasks/ \
48
+ --tools-dir $TOOLS
49
+ ```
50
+
51
+ Add `--harness-config config.json` if a config exists.
52
+
53
+ ## After Init — Report
54
+
55
+ - What was detected vs created
56
+ - Stack + integrations (LangSmith, Context7)
57
+ - Baseline score
58
+ - Next: `harness-evolver:evolve` to start
59
+
60
+ ## Gotchas
61
+
62
+ - The harness must write valid JSON to `--output`. If the user's code returns non-JSON, the wrapper must serialize it.
63
+ - Tasks must have unique `id` fields. Duplicate IDs cause silent eval errors.
64
+ - The `expected` field is never shown to the harness — only the eval script sees it.
65
+ - If `.harness-evolver/` already exists, warn before overwriting.
66
+ - If no Python files exist in CWD, the user is probably in the wrong directory.
@@ -0,0 +1,34 @@
1
+ ---
2
+ name: status
3
+ description: "Use when the user asks about evolution progress, current scores, best harness version, how many iterations ran, or whether the loop is stagnating. Also use when the user says 'status', 'progress', or 'how is it going'."
4
+ allowed-tools: [Read, Bash]
5
+ ---
6
+
7
+ # /harness-evolve-status
8
+
9
+ Show evolution progress.
10
+
11
+ ## Resolve Tool Path
12
+
13
+ ```bash
14
+ TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo "$HOME/.harness-evolver/tools")
15
+ ```
16
+
17
+ ## What To Do
18
+
19
+ If `.harness-evolver/` does not exist, tell user to run `harness-evolver:init` first.
20
+
21
+ Otherwise:
22
+
23
+ ```bash
24
+ python3 $TOOLS/state.py show --base-dir .harness-evolver
25
+ ```
26
+
27
+ Then read and display `.harness-evolver/STATE.md` for the full history table.
28
+
29
+ ## If User Wants More Detail
30
+
31
+ - Scores per task: `cat .harness-evolver/harnesses/{version}/scores.json`
32
+ - What changed: `cat .harness-evolver/harnesses/{version}/proposal.md`
33
+ - Compare two versions: `diff .harness-evolver/harnesses/{vA}/harness.py .harness-evolver/harnesses/{vB}/harness.py`
34
+ - Full history: `cat .harness-evolver/PROPOSER_HISTORY.md`
@@ -1,93 +0,0 @@
1
- ---
2
- name: harness-evolve
3
- description: "Run the harness evolution loop. Autonomously proposes, evaluates, and iterates on harness designs using full execution traces as feedback."
4
- argument-hint: "[--iterations N] [--candidates-per-iter N]"
5
- allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent]
6
- ---
7
-
8
- # /harness-evolve
9
-
10
- Run the Meta-Harness optimization loop.
11
-
12
- ## Arguments
13
-
14
- - `--iterations N` (default: 10) — number of evolution iterations
15
- - `--candidates-per-iter N` (default: 1) — harnesses per iteration
16
-
17
- ## Prerequisites
18
-
19
- Run `/harness-evolve-init` first. The `.harness-evolver/` directory must exist with a valid `summary.json`.
20
-
21
- ## The Loop
22
-
23
- For each iteration i from 1 to N:
24
-
25
- ### 1. PROPOSE
26
-
27
- Determine the next version number by reading `summary.json`:
28
-
29
- ```bash
30
- python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(f'v{s[\"iterations\"]+1:03d}')"
31
- ```
32
-
33
- Spawn the `harness-evolver-proposer` agent with this prompt:
34
-
35
- > You are proposing iteration {i}. Create version {version_number} in `.harness-evolver/harnesses/{version_number}/`.
36
- > Working directory contains `.harness-evolver/` with all prior candidates and traces.
37
-
38
- The proposer agent will create:
39
- - `.harness-evolver/harnesses/v{NNN}/harness.py`
40
- - `.harness-evolver/harnesses/v{NNN}/config.json`
41
- - `.harness-evolver/harnesses/v{NNN}/proposal.md`
42
-
43
- ### 2. VALIDATE
44
-
45
- ```bash
46
- python3 ~/.harness-evolver/tools/evaluate.py validate \
47
- --harness .harness-evolver/harnesses/v{NNN}/harness.py \
48
- --config .harness-evolver/harnesses/v{NNN}/config.json
49
- ```
50
-
51
- If validation fails, ask the proposer to fix (1 retry). If it fails again, set score to 0.0 and continue.
52
-
53
- ### 3. EVALUATE
54
-
55
- ```bash
56
- python3 ~/.harness-evolver/tools/evaluate.py run \
57
- --harness .harness-evolver/harnesses/v{NNN}/harness.py \
58
- --config .harness-evolver/harnesses/v{NNN}/config.json \
59
- --tasks-dir .harness-evolver/eval/tasks/ \
60
- --eval .harness-evolver/eval/eval.py \
61
- --traces-dir .harness-evolver/harnesses/v{NNN}/traces/ \
62
- --scores .harness-evolver/harnesses/v{NNN}/scores.json \
63
- --timeout 60
64
- ```
65
-
66
- ### 4. UPDATE STATE
67
-
68
- ```bash
69
- python3 ~/.harness-evolver/tools/state.py update \
70
- --base-dir .harness-evolver \
71
- --version v{NNN} \
72
- --scores .harness-evolver/harnesses/v{NNN}/scores.json \
73
- --proposal .harness-evolver/harnesses/v{NNN}/proposal.md
74
- ```
75
-
76
- ### 5. REPORT
77
-
78
- Read the updated `summary.json` and report:
79
- - `Iteration {i}/{N}: v{NNN} scored {score} (best: v{best} at {best_score})`
80
- - If regression (score < parent score): warn
81
- - If new best: celebrate
82
-
83
- ### Stop Conditions
84
-
85
- - All N iterations completed
86
- - **Stagnation**: 3 consecutive iterations without >1% improvement. Read `summary.json` history to check.
87
- - **Target reached**: if `config.json` has `target_score` set and achieved.
88
-
89
- When stopping, report final summary: best version, score, number of iterations, improvement over baseline.
90
-
91
- ## Tool Path Resolution
92
-
93
- Check `.harness-evolver/tools/` first (local override), then `~/.harness-evolver/tools/` (global install).
@@ -1,50 +0,0 @@
1
- ---
2
- name: harness-evolve-init
3
- description: "Initialize harness evolution in the current project. Auto-detects harness.py, eval.py, and tasks/ in the working directory."
4
- argument-hint: "[directory] [--harness <path>] [--eval <path>] [--tasks <path>]"
5
- allowed-tools: [Read, Write, Bash, Glob]
6
- ---
7
-
8
- # /harness-evolve-init
9
-
10
- Initialize the Harness Evolver for this project.
11
-
12
- ## Usage
13
-
14
- ```
15
- /harness-evolve-init # auto-detect everything in CWD
16
- /harness-evolve-init ./my-project # auto-detect in a specific directory
17
- /harness-evolve-init --harness run.py # override one path, auto-detect the rest
18
- ```
19
-
20
- ## How Auto-Detection Works
21
-
22
- The tool scans the directory for:
23
- 1. **Exact names:** `harness.py`, `eval.py`, `tasks/`, `config.json`
24
- 2. **Fuzzy fallback:** `*harness*`, `*agent*`, `*run*` for harness; `*eval*`, `*score*` for eval; any dir with JSON files containing `id`/`input` fields for tasks
25
-
26
- If all 3 are found, init proceeds immediately. If something is missing, it reports what's needed.
27
-
28
- ## What To Do
29
-
30
- Run the init tool:
31
-
32
- ```bash
33
- python3 ~/.harness-evolver/tools/init.py {directory if provided} \
34
- --tools-dir ~/.harness-evolver/tools
35
- ```
36
-
37
- Add explicit flags only if the user provided them:
38
- - `--harness PATH` — override harness auto-detection
39
- - `--eval PATH` — override eval auto-detection
40
- - `--tasks PATH` — override tasks auto-detection
41
- - `--harness-config PATH` — optional config for the harness
42
-
43
- If `~/.harness-evolver/tools/init.py` does not exist, check `.harness-evolver/tools/init.py` (local override).
44
-
45
- After init completes, report:
46
- - What was detected (harness, eval, tasks)
47
- - Baseline score
48
- - Number of tasks
49
- - Integrations detected (LangSmith, Context7, stack)
50
- - Next step: run `/harness-evolve` to start the optimization loop
@@ -1,25 +0,0 @@
1
- ---
2
- name: harness-evolve-status
3
- description: "Show the current status of harness evolution: best score, iteration count, progress history."
4
- allowed-tools: [Read, Bash]
5
- ---
6
-
7
- # /harness-evolve-status
8
-
9
- Show the current evolution status.
10
-
11
- ## What To Do
12
-
13
- ```bash
14
- python3 ~/.harness-evolver/tools/state.py show --base-dir .harness-evolver
15
- ```
16
-
17
- If that doesn't exist, try:
18
-
19
- ```bash
20
- python3 .harness-evolver/tools/state.py show --base-dir .harness-evolver
21
- ```
22
-
23
- Also read and display the contents of `.harness-evolver/STATE.md` for the full status table.
24
-
25
- If `.harness-evolver/` doesn't exist, tell the user to run `/harness-evolve-init` first.