harness-evolver 2.9.1 → 3.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (54) hide show
  1. package/README.md +62 -117
  2. package/agents/evolver-architect.md +53 -0
  3. package/agents/evolver-critic.md +44 -0
  4. package/agents/evolver-proposer.md +128 -0
  5. package/agents/evolver-testgen.md +67 -0
  6. package/bin/install.js +181 -171
  7. package/package.json +7 -7
  8. package/skills/deploy/SKILL.md +49 -56
  9. package/skills/evolve/SKILL.md +156 -687
  10. package/skills/setup/SKILL.md +182 -0
  11. package/skills/status/SKILL.md +23 -21
  12. package/tools/read_results.py +240 -0
  13. package/tools/run_eval.py +202 -0
  14. package/tools/seed_from_traces.py +36 -8
  15. package/tools/setup.py +393 -0
  16. package/tools/trace_insights.py +86 -14
  17. package/agents/harness-evolver-architect.md +0 -173
  18. package/agents/harness-evolver-critic.md +0 -132
  19. package/agents/harness-evolver-judge.md +0 -110
  20. package/agents/harness-evolver-proposer.md +0 -317
  21. package/agents/harness-evolver-testgen.md +0 -112
  22. package/examples/classifier/README.md +0 -25
  23. package/examples/classifier/config.json +0 -3
  24. package/examples/classifier/eval.py +0 -58
  25. package/examples/classifier/harness.py +0 -111
  26. package/examples/classifier/tasks/task_001.json +0 -1
  27. package/examples/classifier/tasks/task_002.json +0 -1
  28. package/examples/classifier/tasks/task_003.json +0 -1
  29. package/examples/classifier/tasks/task_004.json +0 -1
  30. package/examples/classifier/tasks/task_005.json +0 -1
  31. package/examples/classifier/tasks/task_006.json +0 -1
  32. package/examples/classifier/tasks/task_007.json +0 -1
  33. package/examples/classifier/tasks/task_008.json +0 -1
  34. package/examples/classifier/tasks/task_009.json +0 -1
  35. package/examples/classifier/tasks/task_010.json +0 -1
  36. package/skills/architect/SKILL.md +0 -93
  37. package/skills/compare/SKILL.md +0 -73
  38. package/skills/critic/SKILL.md +0 -67
  39. package/skills/diagnose/SKILL.md +0 -96
  40. package/skills/import-traces/SKILL.md +0 -102
  41. package/skills/init/SKILL.md +0 -293
  42. package/tools/__pycache__/detect_stack.cpython-313.pyc +0 -0
  43. package/tools/__pycache__/init.cpython-313.pyc +0 -0
  44. package/tools/__pycache__/seed_from_traces.cpython-313.pyc +0 -0
  45. package/tools/__pycache__/trace_logger.cpython-313.pyc +0 -0
  46. package/tools/eval_llm_judge.py +0 -233
  47. package/tools/eval_passthrough.py +0 -55
  48. package/tools/evaluate.py +0 -255
  49. package/tools/import_traces.py +0 -229
  50. package/tools/init.py +0 -531
  51. package/tools/llm_api.py +0 -125
  52. package/tools/state.py +0 -219
  53. package/tools/test_growth.py +0 -230
  54. package/tools/trace_logger.py +0 -42
@@ -1,93 +0,0 @@
1
- ---
2
- name: harness-evolver:architect
3
- description: "Use when the user wants to analyze harness architecture, get a topology recommendation, understand if their agent pattern is optimal, or after stagnation in the evolution loop."
4
- argument-hint: "[--force]"
5
- allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent]
6
- ---
7
-
8
- # /harness-evolver:architect
9
-
10
- Analyze the current harness architecture and recommend the optimal multi-agent topology.
11
-
12
- ## Prerequisites
13
-
14
- `.harness-evolver/` must exist. If not, tell user to run `harness-evolver:init` first.
15
-
16
- ```bash
17
- if [ ! -d ".harness-evolver" ]; then
18
- echo "ERROR: .harness-evolver/ not found. Run /harness-evolver:init first."
19
- exit 1
20
- fi
21
- ```
22
-
23
- ## Resolve Tool Path
24
-
25
- ```bash
26
- TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo "$HOME/.harness-evolver/tools")
27
- ```
28
-
29
- Use `$TOOLS` prefix for all tool calls below.
30
-
31
- ## What To Do
32
-
33
- 1. Check `.harness-evolver/` exists.
34
-
35
- 2. Run architecture analysis tool:
36
- ```bash
37
- python3 $TOOLS/analyze_architecture.py \
38
- --harness .harness-evolver/baseline/harness.py \
39
- -o .harness-evolver/architecture_signals.json
40
- ```
41
-
42
- If evolution has run, add trace and score data:
43
- ```bash
44
- python3 $TOOLS/analyze_architecture.py \
45
- --harness .harness-evolver/harnesses/{best}/harness.py \
46
- --traces-dir .harness-evolver/harnesses/{best}/traces \
47
- --summary .harness-evolver/summary.json \
48
- -o .harness-evolver/architecture_signals.json
49
- ```
50
-
51
- 3. Dispatch using the Agent tool with `subagent_type: "harness-evolver-architect"`:
52
-
53
- ```
54
- Agent(
55
- subagent_type: "harness-evolver-architect",
56
- description: "Architect: topology analysis",
57
- prompt: |
58
- <objective>
59
- Analyze the harness architecture and recommend the optimal multi-agent topology.
60
- {If called from evolve: "The evolution loop stagnated/regressed after N iterations."}
61
- {If called by user: "The user requested an architecture analysis."}
62
- </objective>
63
-
64
- <files_to_read>
65
- - .harness-evolver/architecture_signals.json
66
- - .harness-evolver/config.json
67
- - .harness-evolver/baseline/harness.py
68
- - .harness-evolver/summary.json (if exists)
69
- - .harness-evolver/PROPOSER_HISTORY.md (if exists)
70
- </files_to_read>
71
-
72
- <output>
73
- Write:
74
- - .harness-evolver/architecture.json
75
- - .harness-evolver/architecture.md
76
- </output>
77
-
78
- <success_criteria>
79
- - Classifies current topology correctly
80
- - Recommendation includes migration path with concrete steps
81
- - Considers detected stack and API key availability
82
- - Confidence rating is honest (low/medium/high)
83
- </success_criteria>
84
- )
85
- ```
86
-
87
- 4. Wait for `## ARCHITECTURE ANALYSIS COMPLETE`.
88
-
89
- 5. Print summary: current -> recommended, confidence, migration steps.
90
-
91
- ## Arguments
92
-
93
- - `--force` — re-run analysis even if `architecture.json` already exists. Without this flag, if `architecture.json` exists, just display the existing recommendation.
@@ -1,73 +0,0 @@
1
- ---
2
- name: harness-evolver:compare
3
- description: "Use when the user wants to compare two harness versions, understand what changed between iterations, see why one version scored better than another, or debug a regression."
4
- argument-hint: "<vA> <vB>"
5
- allowed-tools: [Read, Bash, Glob, Grep]
6
- ---
7
-
8
- # /harness-evolver:compare
9
-
10
- Compare two harness versions side by side.
11
-
12
- ## Arguments
13
-
14
- - `vA` — first version (e.g., `v001`, `baseline`)
15
- - `vB` — second version (e.g., `v003`)
16
-
17
- If only one version given, compare it against the current best.
18
- If no versions given, compare the two most recent.
19
-
20
- ## What To Do
21
-
22
- ### 1. Code Diff
23
-
24
- ```bash
25
- diff .harness-evolver/harnesses/{vA}/harness.py .harness-evolver/harnesses/{vB}/harness.py
26
- ```
27
-
28
- If config changed:
29
- ```bash
30
- diff .harness-evolver/harnesses/{vA}/config.json .harness-evolver/harnesses/{vB}/config.json
31
- ```
32
-
33
- ### 2. Score Comparison
34
-
35
- ```bash
36
- cat .harness-evolver/harnesses/{vA}/scores.json
37
- cat .harness-evolver/harnesses/{vB}/scores.json
38
- ```
39
-
40
- Report: combined_score delta, per-task wins/losses.
41
-
42
- ### 3. Per-Task Analysis
43
-
44
- For tasks where scores diverge, show what each version produced:
45
-
46
- ```bash
47
- cat .harness-evolver/harnesses/{vA}/traces/task_{ID}/output.json
48
- cat .harness-evolver/harnesses/{vB}/traces/task_{ID}/output.json
49
- ```
50
-
51
- ### 4. Proposal Context
52
-
53
- ```bash
54
- cat .harness-evolver/harnesses/{vB}/proposal.md
55
- ```
56
-
57
- Show what the proposer intended and whether the result matched expectations.
58
-
59
- ## Report Format
60
-
61
- ```
62
- v001 (0.62) vs v003 (0.71) — +0.09 improvement
63
-
64
- Code changes:
65
- + Added few-shot examples (3 examples)
66
- ~ Changed prompt template
67
- - Removed retry logic
68
-
69
- Per-task:
70
- task_001: 1.0 → 1.0 (unchanged)
71
- task_007: 0.0 → 1.0 (FIXED — was cardiac, now correctly classified)
72
- task_008: 1.0 → 0.0 (REGRESSION — was neurological, now wrong)
73
- ```
@@ -1,67 +0,0 @@
1
- ---
2
- name: harness-evolver:critic
3
- description: "Use when scores converge suspiciously fast, eval quality is questionable, the harness reaches 1.0 in few iterations, or the user wants to validate that improvements are genuine. Also triggers automatically when score jumps >0.3 in one iteration."
4
- allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent]
5
- ---
6
-
7
- # /harness-evolver:critic
8
-
9
- Analyze eval quality and detect eval gaming.
10
-
11
- ## Resolve Tool Path
12
-
13
- ```bash
14
- TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo "$HOME/.harness-evolver/tools")
15
- ```
16
-
17
- ## Prerequisites
18
-
19
- `.harness-evolver/` must exist with at least one evaluated version (v001+).
20
-
21
- ## What To Do
22
-
23
- 1. Read `summary.json` and identify the suspicious pattern (score jump, premature convergence).
24
-
25
- 2. Dispatch using the Agent tool with `subagent_type: "harness-evolver-critic"`:
26
-
27
- ```
28
- Agent(
29
- subagent_type: "harness-evolver-critic",
30
- description: "Critic: analyze eval quality",
31
- prompt: |
32
- <objective>
33
- Analyze eval quality for this harness evolution project.
34
- The best version is {version} with score {score} achieved in {iterations} iteration(s).
35
- {Specific concern: "Score jumped from X to Y in one iteration" or "Perfect score in N iterations"}
36
- </objective>
37
-
38
- <files_to_read>
39
- - .harness-evolver/eval/eval.py
40
- - .harness-evolver/summary.json
41
- - .harness-evolver/harnesses/{best_version}/scores.json
42
- - .harness-evolver/harnesses/{best_version}/harness.py
43
- - .harness-evolver/harnesses/{best_version}/proposal.md
44
- - .harness-evolver/config.json
45
- - .harness-evolver/langsmith_stats.json (if exists)
46
- </files_to_read>
47
-
48
- <output>
49
- Write:
50
- - .harness-evolver/critic_report.md (human-readable analysis)
51
- - .harness-evolver/eval/eval_improved.py (if weaknesses found)
52
- </output>
53
-
54
- <success_criteria>
55
- - Identifies specific weaknesses in eval.py with examples
56
- - If gaming detected, shows exact tasks/outputs that expose the weakness
57
- - Improved eval preserves the --results-dir/--tasks-dir/--scores interface
58
- - Re-scores the best version with improved eval to quantify the difference
59
- </success_criteria>
60
- )
61
- ```
62
-
63
- 3. Wait for `## CRITIC REPORT COMPLETE`.
64
-
65
- 4. Report findings to user. If `eval_improved.py` was written:
66
- - Show score comparison (current eval vs improved eval)
67
- - Ask: "Adopt the improved eval? This will affect future iterations."
@@ -1,96 +0,0 @@
1
- ---
2
- name: harness-evolver:diagnose
3
- description: "Use when the user wants to understand why a specific harness version failed, investigate a regression, analyze trace data, or debug a low score. Also use when the user says 'why did v003 fail' or 'what went wrong'."
4
- argument-hint: "[version]"
5
- allowed-tools: [Read, Bash, Glob, Grep]
6
- ---
7
-
8
- # /harness-evolver:diagnose
9
-
10
- Deep analysis of a harness version's execution traces and scores.
11
-
12
- ## Arguments
13
-
14
- - `version` — version to diagnose (e.g., `v003`). If not given, diagnose the worst or most recent regression.
15
-
16
- ## Resolve Tool Path
17
-
18
- ```bash
19
- TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo "$HOME/.harness-evolver/tools")
20
- ```
21
-
22
- ## What To Do
23
-
24
- ### 1. Identify the Version
25
-
26
- If not specified, find the worst or most recent regression:
27
-
28
- ```bash
29
- python3 $TOOLS/state.py show --base-dir .harness-evolver
30
- cat .harness-evolver/summary.json
31
- ```
32
-
33
- ### 2. Score Breakdown
34
-
35
- ```bash
36
- cat .harness-evolver/harnesses/{version}/scores.json
37
- ```
38
-
39
- Identify which tasks failed (`score: 0.0`) and which passed.
40
-
41
- ### 3. Trace Analysis (failed tasks)
42
-
43
- For each failed task:
44
-
45
- ```bash
46
- cat .harness-evolver/harnesses/{version}/traces/{task_id}/input.json
47
- cat .harness-evolver/harnesses/{version}/traces/{task_id}/output.json
48
- ```
49
-
50
- Look for patterns: wrong format? wrong category? empty output? crash?
51
-
52
- ### 4. Error Search
53
-
54
- ```bash
55
- grep -r "error\|Error\|FAIL\|exception\|Traceback" .harness-evolver/harnesses/{version}/traces/
56
- cat .harness-evolver/harnesses/{version}/traces/stderr.log
57
- ```
58
-
59
- ### 5. Compare with Parent
60
-
61
- Read the proposal to find the parent version:
62
-
63
- ```bash
64
- cat .harness-evolver/harnesses/{version}/proposal.md
65
- ```
66
-
67
- Then diff:
68
-
69
- ```bash
70
- diff .harness-evolver/harnesses/{parent}/harness.py .harness-evolver/harnesses/{version}/harness.py
71
- ```
72
-
73
- ### 6. LangSmith (if available)
74
-
75
- If `langsmith-cli` is installed and LangSmith is configured:
76
-
77
- ```bash
78
- langsmith-cli --json runs list --project harness-evolver-{version} --failed --fields id,name,error,inputs
79
- langsmith-cli --json runs stats --project harness-evolver-{version}
80
- ```
81
-
82
- ### 7. Report
83
-
84
- ```
85
- Diagnosis: v003 (score: 0.31) — REGRESSION from v001 (0.62)
86
-
87
- Root cause: Prompt template change broke JSON parsing
88
- - 4/10 tasks returned malformed output
89
- - stderr shows: json.JSONDecodeError on 4 tasks
90
- - The change on line 42 removed the "Reply with ONLY..." instruction
91
-
92
- Affected tasks: task_002, task_005, task_007, task_010
93
- Unaffected tasks: task_001, task_003, task_004, task_006, task_008, task_009
94
-
95
- Recommendation: Revert the prompt change, keep the retry logic from v002
96
- ```
@@ -1,102 +0,0 @@
1
- ---
2
- name: harness-evolver:import-traces
3
- description: "Use when the user wants to import real production traces from LangSmith as test tasks, convert traces to eval tasks, enrich their eval set with real-world data, or pull production data into their harness evaluation."
4
- argument-hint: "[--project NAME] [--limit N]"
5
- allowed-tools: [Read, Write, Edit, Bash, Glob, Grep]
6
- ---
7
-
8
- # /harness-evolver:import-traces
9
-
10
- Import production traces from LangSmith and convert them into eval tasks. This enriches the test suite with real-world inputs, prioritizing traces with negative user feedback.
11
-
12
- ## Prerequisites
13
-
14
- - `.harness-evolver/` must exist. If not, tell user to run `harness-evolver:init` first.
15
- - `langsmith-cli` must be available. Check:
16
-
17
- ```bash
18
- which langsmith-cli 2>/dev/null
19
- ```
20
-
21
- If not found: "Install langsmith-cli first: `uv tool install langsmith-cli && langsmith-cli auth login`"
22
-
23
- ## Resolve Tool Path
24
-
25
- ```bash
26
- TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo "$HOME/.harness-evolver/tools")
27
- ```
28
-
29
- ## Parse Arguments
30
-
31
- - `--project NAME` — LangSmith project name (if not provided, discover interactively)
32
- - `--limit N` — max traces to import (default: 20)
33
-
34
- ## Phase 1: Discover Projects
35
-
36
- If `--project` not provided, list available projects:
37
-
38
- ```bash
39
- langsmith-cli --json projects list --limit 20 2>/dev/null
40
- ```
41
-
42
- Show the user a list of projects with run counts. Let them pick one, or use the most recent.
43
-
44
- If `--project` is provided, use it directly.
45
-
46
- ## Phase 2: Fetch Traces
47
-
48
- ```bash
49
- langsmith-cli --json runs list \
50
- --project "{project_name}" \
51
- --limit {limit} \
52
- --fields id,name,inputs,outputs,error,feedback_stats,total_tokens \
53
- > /tmp/harness_import_traces.json 2>/dev/null
54
- ```
55
-
56
- Check the output has data:
57
- ```bash
58
- python3 -c "import json; data=json.load(open('/tmp/harness_import_traces.json')); print(f'{len(data)} traces fetched')"
59
- ```
60
-
61
- If no traces found, tell user the project may be empty or the name may be wrong.
62
-
63
- ## Phase 3: Convert to Tasks
64
-
65
- ```bash
66
- python3 $TOOLS/import_traces.py \
67
- --traces-json /tmp/harness_import_traces.json \
68
- --output-dir .harness-evolver/eval/tasks/ \
69
- --prefix imported \
70
- --max-tasks {limit}
71
- ```
72
-
73
- ## Phase 4: Report
74
-
75
- Read the tool output and report:
76
- - How many traces were imported
77
- - How many had negative feedback (high priority)
78
- - How many were skipped (no extractable input, duplicates)
79
- - Total tasks now in eval set
80
-
81
- ```bash
82
- ls .harness-evolver/eval/tasks/*.json | wc -l
83
- ```
84
-
85
- Print:
86
- ```
87
- Imported {N} production traces as eval tasks.
88
- {M} with negative user feedback (high priority)
89
- {K} skipped (no input or duplicates)
90
- Total eval tasks: {total}
91
-
92
- Next: run `harness-evolver:evolve` to optimize against real-world inputs.
93
- ```
94
-
95
- ## Gotchas
96
-
97
- - Traces with no extractable user input are skipped (e.g., system-only runs)
98
- - Duplicate traces (same run ID) are automatically skipped
99
- - Imported tasks are tagged with `metadata.source: "imported"` and `metadata.type: "production"`
100
- - Tasks with negative feedback get `metadata.user_feedback: "negative"` — the proposer should prioritize these
101
- - The `metadata.langsmith_run_id` field links back to the original trace for debugging
102
- - Cleanup: `rm /tmp/harness_import_traces.json` after import