harness-evolver 2.9.0 → 3.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +62 -117
- package/agents/evolver-architect.md +53 -0
- package/agents/evolver-critic.md +44 -0
- package/agents/evolver-proposer.md +128 -0
- package/agents/evolver-testgen.md +67 -0
- package/bin/install.js +181 -171
- package/package.json +7 -7
- package/skills/deploy/SKILL.md +49 -56
- package/skills/evolve/SKILL.md +180 -700
- package/skills/setup/SKILL.md +182 -0
- package/skills/status/SKILL.md +23 -21
- package/tools/read_results.py +240 -0
- package/tools/run_eval.py +202 -0
- package/tools/seed_from_traces.py +36 -8
- package/tools/setup.py +393 -0
- package/tools/trace_insights.py +86 -14
- package/agents/harness-evolver-architect.md +0 -173
- package/agents/harness-evolver-critic.md +0 -132
- package/agents/harness-evolver-judge.md +0 -110
- package/agents/harness-evolver-proposer.md +0 -317
- package/agents/harness-evolver-testgen.md +0 -112
- package/examples/classifier/README.md +0 -25
- package/examples/classifier/config.json +0 -3
- package/examples/classifier/eval.py +0 -58
- package/examples/classifier/harness.py +0 -111
- package/examples/classifier/tasks/task_001.json +0 -1
- package/examples/classifier/tasks/task_002.json +0 -1
- package/examples/classifier/tasks/task_003.json +0 -1
- package/examples/classifier/tasks/task_004.json +0 -1
- package/examples/classifier/tasks/task_005.json +0 -1
- package/examples/classifier/tasks/task_006.json +0 -1
- package/examples/classifier/tasks/task_007.json +0 -1
- package/examples/classifier/tasks/task_008.json +0 -1
- package/examples/classifier/tasks/task_009.json +0 -1
- package/examples/classifier/tasks/task_010.json +0 -1
- package/skills/architect/SKILL.md +0 -93
- package/skills/compare/SKILL.md +0 -73
- package/skills/critic/SKILL.md +0 -67
- package/skills/diagnose/SKILL.md +0 -96
- package/skills/import-traces/SKILL.md +0 -102
- package/skills/init/SKILL.md +0 -253
- package/tools/__pycache__/detect_stack.cpython-313.pyc +0 -0
- package/tools/__pycache__/init.cpython-313.pyc +0 -0
- package/tools/__pycache__/seed_from_traces.cpython-313.pyc +0 -0
- package/tools/__pycache__/trace_logger.cpython-313.pyc +0 -0
- package/tools/eval_llm_judge.py +0 -233
- package/tools/eval_passthrough.py +0 -55
- package/tools/evaluate.py +0 -255
- package/tools/import_traces.py +0 -229
- package/tools/init.py +0 -531
- package/tools/llm_api.py +0 -125
- package/tools/state.py +0 -219
- package/tools/test_growth.py +0 -230
- package/tools/trace_logger.py +0 -42
|
@@ -1,93 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
name: harness-evolver:architect
|
|
3
|
-
description: "Use when the user wants to analyze harness architecture, get a topology recommendation, understand if their agent pattern is optimal, or after stagnation in the evolution loop."
|
|
4
|
-
argument-hint: "[--force]"
|
|
5
|
-
allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent]
|
|
6
|
-
---
|
|
7
|
-
|
|
8
|
-
# /harness-evolver:architect
|
|
9
|
-
|
|
10
|
-
Analyze the current harness architecture and recommend the optimal multi-agent topology.
|
|
11
|
-
|
|
12
|
-
## Prerequisites
|
|
13
|
-
|
|
14
|
-
`.harness-evolver/` must exist. If not, tell user to run `harness-evolver:init` first.
|
|
15
|
-
|
|
16
|
-
```bash
|
|
17
|
-
if [ ! -d ".harness-evolver" ]; then
|
|
18
|
-
echo "ERROR: .harness-evolver/ not found. Run /harness-evolver:init first."
|
|
19
|
-
exit 1
|
|
20
|
-
fi
|
|
21
|
-
```
|
|
22
|
-
|
|
23
|
-
## Resolve Tool Path
|
|
24
|
-
|
|
25
|
-
```bash
|
|
26
|
-
TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo "$HOME/.harness-evolver/tools")
|
|
27
|
-
```
|
|
28
|
-
|
|
29
|
-
Use `$TOOLS` prefix for all tool calls below.
|
|
30
|
-
|
|
31
|
-
## What To Do
|
|
32
|
-
|
|
33
|
-
1. Check `.harness-evolver/` exists.
|
|
34
|
-
|
|
35
|
-
2. Run architecture analysis tool:
|
|
36
|
-
```bash
|
|
37
|
-
python3 $TOOLS/analyze_architecture.py \
|
|
38
|
-
--harness .harness-evolver/baseline/harness.py \
|
|
39
|
-
-o .harness-evolver/architecture_signals.json
|
|
40
|
-
```
|
|
41
|
-
|
|
42
|
-
If evolution has run, add trace and score data:
|
|
43
|
-
```bash
|
|
44
|
-
python3 $TOOLS/analyze_architecture.py \
|
|
45
|
-
--harness .harness-evolver/harnesses/{best}/harness.py \
|
|
46
|
-
--traces-dir .harness-evolver/harnesses/{best}/traces \
|
|
47
|
-
--summary .harness-evolver/summary.json \
|
|
48
|
-
-o .harness-evolver/architecture_signals.json
|
|
49
|
-
```
|
|
50
|
-
|
|
51
|
-
3. Dispatch using the Agent tool with `subagent_type: "harness-evolver-architect"`:
|
|
52
|
-
|
|
53
|
-
```
|
|
54
|
-
Agent(
|
|
55
|
-
subagent_type: "harness-evolver-architect",
|
|
56
|
-
description: "Architect: topology analysis",
|
|
57
|
-
prompt: |
|
|
58
|
-
<objective>
|
|
59
|
-
Analyze the harness architecture and recommend the optimal multi-agent topology.
|
|
60
|
-
{If called from evolve: "The evolution loop stagnated/regressed after N iterations."}
|
|
61
|
-
{If called by user: "The user requested an architecture analysis."}
|
|
62
|
-
</objective>
|
|
63
|
-
|
|
64
|
-
<files_to_read>
|
|
65
|
-
- .harness-evolver/architecture_signals.json
|
|
66
|
-
- .harness-evolver/config.json
|
|
67
|
-
- .harness-evolver/baseline/harness.py
|
|
68
|
-
- .harness-evolver/summary.json (if exists)
|
|
69
|
-
- .harness-evolver/PROPOSER_HISTORY.md (if exists)
|
|
70
|
-
</files_to_read>
|
|
71
|
-
|
|
72
|
-
<output>
|
|
73
|
-
Write:
|
|
74
|
-
- .harness-evolver/architecture.json
|
|
75
|
-
- .harness-evolver/architecture.md
|
|
76
|
-
</output>
|
|
77
|
-
|
|
78
|
-
<success_criteria>
|
|
79
|
-
- Classifies current topology correctly
|
|
80
|
-
- Recommendation includes migration path with concrete steps
|
|
81
|
-
- Considers detected stack and API key availability
|
|
82
|
-
- Confidence rating is honest (low/medium/high)
|
|
83
|
-
</success_criteria>
|
|
84
|
-
)
|
|
85
|
-
```
|
|
86
|
-
|
|
87
|
-
4. Wait for `## ARCHITECTURE ANALYSIS COMPLETE`.
|
|
88
|
-
|
|
89
|
-
5. Print summary: current -> recommended, confidence, migration steps.
|
|
90
|
-
|
|
91
|
-
## Arguments
|
|
92
|
-
|
|
93
|
-
- `--force` — re-run analysis even if `architecture.json` already exists. Without this flag, if `architecture.json` exists, just display the existing recommendation.
|
package/skills/compare/SKILL.md
DELETED
|
@@ -1,73 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
name: harness-evolver:compare
|
|
3
|
-
description: "Use when the user wants to compare two harness versions, understand what changed between iterations, see why one version scored better than another, or debug a regression."
|
|
4
|
-
argument-hint: "<vA> <vB>"
|
|
5
|
-
allowed-tools: [Read, Bash, Glob, Grep]
|
|
6
|
-
---
|
|
7
|
-
|
|
8
|
-
# /harness-evolver:compare
|
|
9
|
-
|
|
10
|
-
Compare two harness versions side by side.
|
|
11
|
-
|
|
12
|
-
## Arguments
|
|
13
|
-
|
|
14
|
-
- `vA` — first version (e.g., `v001`, `baseline`)
|
|
15
|
-
- `vB` — second version (e.g., `v003`)
|
|
16
|
-
|
|
17
|
-
If only one version given, compare it against the current best.
|
|
18
|
-
If no versions given, compare the two most recent.
|
|
19
|
-
|
|
20
|
-
## What To Do
|
|
21
|
-
|
|
22
|
-
### 1. Code Diff
|
|
23
|
-
|
|
24
|
-
```bash
|
|
25
|
-
diff .harness-evolver/harnesses/{vA}/harness.py .harness-evolver/harnesses/{vB}/harness.py
|
|
26
|
-
```
|
|
27
|
-
|
|
28
|
-
If config changed:
|
|
29
|
-
```bash
|
|
30
|
-
diff .harness-evolver/harnesses/{vA}/config.json .harness-evolver/harnesses/{vB}/config.json
|
|
31
|
-
```
|
|
32
|
-
|
|
33
|
-
### 2. Score Comparison
|
|
34
|
-
|
|
35
|
-
```bash
|
|
36
|
-
cat .harness-evolver/harnesses/{vA}/scores.json
|
|
37
|
-
cat .harness-evolver/harnesses/{vB}/scores.json
|
|
38
|
-
```
|
|
39
|
-
|
|
40
|
-
Report: combined_score delta, per-task wins/losses.
|
|
41
|
-
|
|
42
|
-
### 3. Per-Task Analysis
|
|
43
|
-
|
|
44
|
-
For tasks where scores diverge, show what each version produced:
|
|
45
|
-
|
|
46
|
-
```bash
|
|
47
|
-
cat .harness-evolver/harnesses/{vA}/traces/task_{ID}/output.json
|
|
48
|
-
cat .harness-evolver/harnesses/{vB}/traces/task_{ID}/output.json
|
|
49
|
-
```
|
|
50
|
-
|
|
51
|
-
### 4. Proposal Context
|
|
52
|
-
|
|
53
|
-
```bash
|
|
54
|
-
cat .harness-evolver/harnesses/{vB}/proposal.md
|
|
55
|
-
```
|
|
56
|
-
|
|
57
|
-
Show what the proposer intended and whether the result matched expectations.
|
|
58
|
-
|
|
59
|
-
## Report Format
|
|
60
|
-
|
|
61
|
-
```
|
|
62
|
-
v001 (0.62) vs v003 (0.71) — +0.09 improvement
|
|
63
|
-
|
|
64
|
-
Code changes:
|
|
65
|
-
+ Added few-shot examples (3 examples)
|
|
66
|
-
~ Changed prompt template
|
|
67
|
-
- Removed retry logic
|
|
68
|
-
|
|
69
|
-
Per-task:
|
|
70
|
-
task_001: 1.0 → 1.0 (unchanged)
|
|
71
|
-
task_007: 0.0 → 1.0 (FIXED — was cardiac, now correctly classified)
|
|
72
|
-
task_008: 1.0 → 0.0 (REGRESSION — was neurological, now wrong)
|
|
73
|
-
```
|
package/skills/critic/SKILL.md
DELETED
|
@@ -1,67 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
name: harness-evolver:critic
|
|
3
|
-
description: "Use when scores converge suspiciously fast, eval quality is questionable, the harness reaches 1.0 in few iterations, or the user wants to validate that improvements are genuine. Also triggers automatically when score jumps >0.3 in one iteration."
|
|
4
|
-
allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent]
|
|
5
|
-
---
|
|
6
|
-
|
|
7
|
-
# /harness-evolver:critic
|
|
8
|
-
|
|
9
|
-
Analyze eval quality and detect eval gaming.
|
|
10
|
-
|
|
11
|
-
## Resolve Tool Path
|
|
12
|
-
|
|
13
|
-
```bash
|
|
14
|
-
TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo "$HOME/.harness-evolver/tools")
|
|
15
|
-
```
|
|
16
|
-
|
|
17
|
-
## Prerequisites
|
|
18
|
-
|
|
19
|
-
`.harness-evolver/` must exist with at least one evaluated version (v001+).
|
|
20
|
-
|
|
21
|
-
## What To Do
|
|
22
|
-
|
|
23
|
-
1. Read `summary.json` and identify the suspicious pattern (score jump, premature convergence).
|
|
24
|
-
|
|
25
|
-
2. Dispatch using the Agent tool with `subagent_type: "harness-evolver-critic"`:
|
|
26
|
-
|
|
27
|
-
```
|
|
28
|
-
Agent(
|
|
29
|
-
subagent_type: "harness-evolver-critic",
|
|
30
|
-
description: "Critic: analyze eval quality",
|
|
31
|
-
prompt: |
|
|
32
|
-
<objective>
|
|
33
|
-
Analyze eval quality for this harness evolution project.
|
|
34
|
-
The best version is {version} with score {score} achieved in {iterations} iteration(s).
|
|
35
|
-
{Specific concern: "Score jumped from X to Y in one iteration" or "Perfect score in N iterations"}
|
|
36
|
-
</objective>
|
|
37
|
-
|
|
38
|
-
<files_to_read>
|
|
39
|
-
- .harness-evolver/eval/eval.py
|
|
40
|
-
- .harness-evolver/summary.json
|
|
41
|
-
- .harness-evolver/harnesses/{best_version}/scores.json
|
|
42
|
-
- .harness-evolver/harnesses/{best_version}/harness.py
|
|
43
|
-
- .harness-evolver/harnesses/{best_version}/proposal.md
|
|
44
|
-
- .harness-evolver/config.json
|
|
45
|
-
- .harness-evolver/langsmith_stats.json (if exists)
|
|
46
|
-
</files_to_read>
|
|
47
|
-
|
|
48
|
-
<output>
|
|
49
|
-
Write:
|
|
50
|
-
- .harness-evolver/critic_report.md (human-readable analysis)
|
|
51
|
-
- .harness-evolver/eval/eval_improved.py (if weaknesses found)
|
|
52
|
-
</output>
|
|
53
|
-
|
|
54
|
-
<success_criteria>
|
|
55
|
-
- Identifies specific weaknesses in eval.py with examples
|
|
56
|
-
- If gaming detected, shows exact tasks/outputs that expose the weakness
|
|
57
|
-
- Improved eval preserves the --results-dir/--tasks-dir/--scores interface
|
|
58
|
-
- Re-scores the best version with improved eval to quantify the difference
|
|
59
|
-
</success_criteria>
|
|
60
|
-
)
|
|
61
|
-
```
|
|
62
|
-
|
|
63
|
-
3. Wait for `## CRITIC REPORT COMPLETE`.
|
|
64
|
-
|
|
65
|
-
4. Report findings to user. If `eval_improved.py` was written:
|
|
66
|
-
- Show score comparison (current eval vs improved eval)
|
|
67
|
-
- Ask: "Adopt the improved eval? This will affect future iterations."
|
package/skills/diagnose/SKILL.md
DELETED
|
@@ -1,96 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
name: harness-evolver:diagnose
|
|
3
|
-
description: "Use when the user wants to understand why a specific harness version failed, investigate a regression, analyze trace data, or debug a low score. Also use when the user says 'why did v003 fail' or 'what went wrong'."
|
|
4
|
-
argument-hint: "[version]"
|
|
5
|
-
allowed-tools: [Read, Bash, Glob, Grep]
|
|
6
|
-
---
|
|
7
|
-
|
|
8
|
-
# /harness-evolver:diagnose
|
|
9
|
-
|
|
10
|
-
Deep analysis of a harness version's execution traces and scores.
|
|
11
|
-
|
|
12
|
-
## Arguments
|
|
13
|
-
|
|
14
|
-
- `version` — version to diagnose (e.g., `v003`). If not given, diagnose the worst or most recent regression.
|
|
15
|
-
|
|
16
|
-
## Resolve Tool Path
|
|
17
|
-
|
|
18
|
-
```bash
|
|
19
|
-
TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo "$HOME/.harness-evolver/tools")
|
|
20
|
-
```
|
|
21
|
-
|
|
22
|
-
## What To Do
|
|
23
|
-
|
|
24
|
-
### 1. Identify the Version
|
|
25
|
-
|
|
26
|
-
If not specified, find the worst or most recent regression:
|
|
27
|
-
|
|
28
|
-
```bash
|
|
29
|
-
python3 $TOOLS/state.py show --base-dir .harness-evolver
|
|
30
|
-
cat .harness-evolver/summary.json
|
|
31
|
-
```
|
|
32
|
-
|
|
33
|
-
### 2. Score Breakdown
|
|
34
|
-
|
|
35
|
-
```bash
|
|
36
|
-
cat .harness-evolver/harnesses/{version}/scores.json
|
|
37
|
-
```
|
|
38
|
-
|
|
39
|
-
Identify which tasks failed (`score: 0.0`) and which passed.
|
|
40
|
-
|
|
41
|
-
### 3. Trace Analysis (failed tasks)
|
|
42
|
-
|
|
43
|
-
For each failed task:
|
|
44
|
-
|
|
45
|
-
```bash
|
|
46
|
-
cat .harness-evolver/harnesses/{version}/traces/{task_id}/input.json
|
|
47
|
-
cat .harness-evolver/harnesses/{version}/traces/{task_id}/output.json
|
|
48
|
-
```
|
|
49
|
-
|
|
50
|
-
Look for patterns: wrong format? wrong category? empty output? crash?
|
|
51
|
-
|
|
52
|
-
### 4. Error Search
|
|
53
|
-
|
|
54
|
-
```bash
|
|
55
|
-
grep -r "error\|Error\|FAIL\|exception\|Traceback" .harness-evolver/harnesses/{version}/traces/
|
|
56
|
-
cat .harness-evolver/harnesses/{version}/traces/stderr.log
|
|
57
|
-
```
|
|
58
|
-
|
|
59
|
-
### 5. Compare with Parent
|
|
60
|
-
|
|
61
|
-
Read the proposal to find the parent version:
|
|
62
|
-
|
|
63
|
-
```bash
|
|
64
|
-
cat .harness-evolver/harnesses/{version}/proposal.md
|
|
65
|
-
```
|
|
66
|
-
|
|
67
|
-
Then diff:
|
|
68
|
-
|
|
69
|
-
```bash
|
|
70
|
-
diff .harness-evolver/harnesses/{parent}/harness.py .harness-evolver/harnesses/{version}/harness.py
|
|
71
|
-
```
|
|
72
|
-
|
|
73
|
-
### 6. LangSmith (if available)
|
|
74
|
-
|
|
75
|
-
If `langsmith-cli` is installed and LangSmith is configured:
|
|
76
|
-
|
|
77
|
-
```bash
|
|
78
|
-
langsmith-cli --json runs list --project harness-evolver-{version} --failed --fields id,name,error,inputs
|
|
79
|
-
langsmith-cli --json runs stats --project harness-evolver-{version}
|
|
80
|
-
```
|
|
81
|
-
|
|
82
|
-
### 7. Report
|
|
83
|
-
|
|
84
|
-
```
|
|
85
|
-
Diagnosis: v003 (score: 0.31) — REGRESSION from v001 (0.62)
|
|
86
|
-
|
|
87
|
-
Root cause: Prompt template change broke JSON parsing
|
|
88
|
-
- 4/10 tasks returned malformed output
|
|
89
|
-
- stderr shows: json.JSONDecodeError on 4 tasks
|
|
90
|
-
- The change on line 42 removed the "Reply with ONLY..." instruction
|
|
91
|
-
|
|
92
|
-
Affected tasks: task_002, task_005, task_007, task_010
|
|
93
|
-
Unaffected tasks: task_001, task_003, task_004, task_006, task_008, task_009
|
|
94
|
-
|
|
95
|
-
Recommendation: Revert the prompt change, keep the retry logic from v002
|
|
96
|
-
```
|
|
@@ -1,102 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
name: harness-evolver:import-traces
|
|
3
|
-
description: "Use when the user wants to import real production traces from LangSmith as test tasks, convert traces to eval tasks, enrich their eval set with real-world data, or pull production data into their harness evaluation."
|
|
4
|
-
argument-hint: "[--project NAME] [--limit N]"
|
|
5
|
-
allowed-tools: [Read, Write, Edit, Bash, Glob, Grep]
|
|
6
|
-
---
|
|
7
|
-
|
|
8
|
-
# /harness-evolver:import-traces
|
|
9
|
-
|
|
10
|
-
Import production traces from LangSmith and convert them into eval tasks. This enriches the test suite with real-world inputs, prioritizing traces with negative user feedback.
|
|
11
|
-
|
|
12
|
-
## Prerequisites
|
|
13
|
-
|
|
14
|
-
- `.harness-evolver/` must exist. If not, tell user to run `harness-evolver:init` first.
|
|
15
|
-
- `langsmith-cli` must be available. Check:
|
|
16
|
-
|
|
17
|
-
```bash
|
|
18
|
-
which langsmith-cli 2>/dev/null
|
|
19
|
-
```
|
|
20
|
-
|
|
21
|
-
If not found: "Install langsmith-cli first: `uv tool install langsmith-cli && langsmith-cli auth login`"
|
|
22
|
-
|
|
23
|
-
## Resolve Tool Path
|
|
24
|
-
|
|
25
|
-
```bash
|
|
26
|
-
TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo "$HOME/.harness-evolver/tools")
|
|
27
|
-
```
|
|
28
|
-
|
|
29
|
-
## Parse Arguments
|
|
30
|
-
|
|
31
|
-
- `--project NAME` — LangSmith project name (if not provided, discover interactively)
|
|
32
|
-
- `--limit N` — max traces to import (default: 20)
|
|
33
|
-
|
|
34
|
-
## Phase 1: Discover Projects
|
|
35
|
-
|
|
36
|
-
If `--project` not provided, list available projects:
|
|
37
|
-
|
|
38
|
-
```bash
|
|
39
|
-
langsmith-cli --json projects list --limit 20 2>/dev/null
|
|
40
|
-
```
|
|
41
|
-
|
|
42
|
-
Show the user a list of projects with run counts. Let them pick one, or use the most recent.
|
|
43
|
-
|
|
44
|
-
If `--project` is provided, use it directly.
|
|
45
|
-
|
|
46
|
-
## Phase 2: Fetch Traces
|
|
47
|
-
|
|
48
|
-
```bash
|
|
49
|
-
langsmith-cli --json runs list \
|
|
50
|
-
--project "{project_name}" \
|
|
51
|
-
--limit {limit} \
|
|
52
|
-
--fields id,name,inputs,outputs,error,feedback_stats,total_tokens \
|
|
53
|
-
> /tmp/harness_import_traces.json 2>/dev/null
|
|
54
|
-
```
|
|
55
|
-
|
|
56
|
-
Check the output has data:
|
|
57
|
-
```bash
|
|
58
|
-
python3 -c "import json; data=json.load(open('/tmp/harness_import_traces.json')); print(f'{len(data)} traces fetched')"
|
|
59
|
-
```
|
|
60
|
-
|
|
61
|
-
If no traces found, tell user the project may be empty or the name may be wrong.
|
|
62
|
-
|
|
63
|
-
## Phase 3: Convert to Tasks
|
|
64
|
-
|
|
65
|
-
```bash
|
|
66
|
-
python3 $TOOLS/import_traces.py \
|
|
67
|
-
--traces-json /tmp/harness_import_traces.json \
|
|
68
|
-
--output-dir .harness-evolver/eval/tasks/ \
|
|
69
|
-
--prefix imported \
|
|
70
|
-
--max-tasks {limit}
|
|
71
|
-
```
|
|
72
|
-
|
|
73
|
-
## Phase 4: Report
|
|
74
|
-
|
|
75
|
-
Read the tool output and report:
|
|
76
|
-
- How many traces were imported
|
|
77
|
-
- How many had negative feedback (high priority)
|
|
78
|
-
- How many were skipped (no extractable input, duplicates)
|
|
79
|
-
- Total tasks now in eval set
|
|
80
|
-
|
|
81
|
-
```bash
|
|
82
|
-
ls .harness-evolver/eval/tasks/*.json | wc -l
|
|
83
|
-
```
|
|
84
|
-
|
|
85
|
-
Print:
|
|
86
|
-
```
|
|
87
|
-
Imported {N} production traces as eval tasks.
|
|
88
|
-
{M} with negative user feedback (high priority)
|
|
89
|
-
{K} skipped (no input or duplicates)
|
|
90
|
-
Total eval tasks: {total}
|
|
91
|
-
|
|
92
|
-
Next: run `harness-evolver:evolve` to optimize against real-world inputs.
|
|
93
|
-
```
|
|
94
|
-
|
|
95
|
-
## Gotchas
|
|
96
|
-
|
|
97
|
-
- Traces with no extractable user input are skipped (e.g., system-only runs)
|
|
98
|
-
- Duplicate traces (same run ID) are automatically skipped
|
|
99
|
-
- Imported tasks are tagged with `metadata.source: "imported"` and `metadata.type: "production"`
|
|
100
|
-
- Tasks with negative feedback get `metadata.user_feedback: "negative"` — the proposer should prioritize these
|
|
101
|
-
- The `metadata.langsmith_run_id` field links back to the original trace for debugging
|
|
102
|
-
- Cleanup: `rm /tmp/harness_import_traces.json` after import
|
package/skills/init/SKILL.md
DELETED
|
@@ -1,253 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
name: harness-evolver:init
|
|
3
|
-
description: "Use when the user wants to set up harness optimization in their project, optimize an LLM agent, improve a harness, or mentions harness-evolver for the first time in a project without .harness-evolver/ directory."
|
|
4
|
-
argument-hint: "[directory]"
|
|
5
|
-
allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent, AskUserQuestion]
|
|
6
|
-
---
|
|
7
|
-
|
|
8
|
-
# /harness-evolve-init
|
|
9
|
-
|
|
10
|
-
Set up the Harness Evolver in a project. Scans the codebase, identifies the entry point, creates missing artifacts, runs baseline evaluation.
|
|
11
|
-
|
|
12
|
-
## Resolve Tool Path
|
|
13
|
-
|
|
14
|
-
```bash
|
|
15
|
-
TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo "$HOME/.harness-evolver/tools")
|
|
16
|
-
```
|
|
17
|
-
|
|
18
|
-
Use `$TOOLS` prefix for all tool calls below.
|
|
19
|
-
|
|
20
|
-
## Phase 1: Scan
|
|
21
|
-
|
|
22
|
-
```bash
|
|
23
|
-
find . -maxdepth 3 -type f -name "*.py" | head -30
|
|
24
|
-
python3 $TOOLS/detect_stack.py .
|
|
25
|
-
```
|
|
26
|
-
|
|
27
|
-
Look for:
|
|
28
|
-
- Entry points: files with `if __name__`, or named `main.py`, `app.py`, `agent.py`, `graph.py`, `pipeline.py`, `bot.py`
|
|
29
|
-
- Existing eval: `eval.py`, `score.py`, `judge.py`
|
|
30
|
-
- Existing tasks: directories with JSON files containing `id` + `input` fields
|
|
31
|
-
- Config: `config.json`, `config.yaml`, `.env`
|
|
32
|
-
|
|
33
|
-
## Phase 1.5: Confirm Detection (Interactive)
|
|
34
|
-
|
|
35
|
-
After scanning, present what was found and ask the user to confirm before proceeding.
|
|
36
|
-
|
|
37
|
-
Use AskUserQuestion:
|
|
38
|
-
|
|
39
|
-
```
|
|
40
|
-
Question: "Here's what I detected. Does this look right?"
|
|
41
|
-
Header: "Confirm"
|
|
42
|
-
Options:
|
|
43
|
-
- "Looks good, proceed" — Continue with detected paths
|
|
44
|
-
- "Let me adjust paths" — User will provide correct paths
|
|
45
|
-
- "Start over in different directory" — Abort and let user cd elsewhere
|
|
46
|
-
|
|
47
|
-
Show in the question description:
|
|
48
|
-
- Harness: {path or "not found"}
|
|
49
|
-
- Eval: {path or "not found — will use LLM-as-judge"}
|
|
50
|
-
- Tasks: {path with N files, or "not found — will generate"}
|
|
51
|
-
- Stack: {detected frameworks or "none detected"}
|
|
52
|
-
- Architecture: {topology or "unknown"}
|
|
53
|
-
```
|
|
54
|
-
|
|
55
|
-
If user chose "Let me adjust paths", ask which paths to change and update accordingly.
|
|
56
|
-
|
|
57
|
-
## Phase 1.8: Eval Mode (Interactive — only if NO eval found)
|
|
58
|
-
|
|
59
|
-
If no eval.py was detected, ask the user which evaluation mode to use:
|
|
60
|
-
|
|
61
|
-
```
|
|
62
|
-
Question: "No eval script found. How should outputs be scored?"
|
|
63
|
-
Header: "Eval mode"
|
|
64
|
-
Options:
|
|
65
|
-
- "LLM-as-judge (zero-config)" — Claude Code scores outputs on accuracy, completeness, relevance, hallucination. No expected answers needed.
|
|
66
|
-
- "Keyword matching" — Simple string matching against expected answers. Requires 'expected' field in tasks.
|
|
67
|
-
- "I'll provide my own eval.py" — Pause and let user create their eval script.
|
|
68
|
-
```
|
|
69
|
-
|
|
70
|
-
If "LLM-as-judge": copy eval_passthrough.py as eval.py.
|
|
71
|
-
If "Keyword matching": create a simple keyword eval (check if expected substrings appear in output).
|
|
72
|
-
If "I'll provide my own": print instructions for the eval contract and wait.
|
|
73
|
-
|
|
74
|
-
## Phase 1.9: LangSmith Project (Interactive — only if LANGSMITH_API_KEY detected)
|
|
75
|
-
|
|
76
|
-
If a LangSmith API key is available, discover projects and ask which one has production traces:
|
|
77
|
-
|
|
78
|
-
```bash
|
|
79
|
-
langsmith-cli --json projects list --limit 10 2>/dev/null
|
|
80
|
-
```
|
|
81
|
-
|
|
82
|
-
Use AskUserQuestion:
|
|
83
|
-
```
|
|
84
|
-
Question: "LangSmith detected. Which project has your production traces?"
|
|
85
|
-
Header: "LangSmith"
|
|
86
|
-
Options: (build from discovered projects — pick top 3-4 by recent activity)
|
|
87
|
-
- "{project_name_1}" — {run_count} runs, last active {date}
|
|
88
|
-
- "{project_name_2}" — {run_count} runs, last active {date}
|
|
89
|
-
- "{project_name_3}" — {run_count} runs, last active {date}
|
|
90
|
-
- "Skip — don't use production traces" — Proceed without production data
|
|
91
|
-
```
|
|
92
|
-
|
|
93
|
-
If a project is selected, pass it as `--langsmith-project` to init.py.
|
|
94
|
-
|
|
95
|
-
## Phase 2: Create What's Missing
|
|
96
|
-
|
|
97
|
-
Three artifacts needed. For each — use existing if found, create if not.
|
|
98
|
-
|
|
99
|
-
**Harness** (`harness.py`): If user's entry point doesn't match our CLI interface (`--input`, `--output`, `--traces-dir`, `--config`), create a thin wrapper that imports their code. Read their entry point first to understand the I/O format. Ask if unsure.
|
|
100
|
-
|
|
101
|
-
**Eval** (`eval.py`): If an eval script exists, use it. If the user already chose an eval mode in Phase 1.8, follow that choice.
|
|
102
|
-
|
|
103
|
-
If NO eval exists and no mode was chosen yet:
|
|
104
|
-
- Copy `eval_passthrough.py` from `$TOOLS/eval_passthrough.py` as the project's eval.py:
|
|
105
|
-
```bash
|
|
106
|
-
cp $TOOLS/eval_passthrough.py eval.py
|
|
107
|
-
```
|
|
108
|
-
- This passthrough eval collects outputs for the judge subagent to score during evolve.
|
|
109
|
-
- Print: "No eval found. Using LLM-as-judge (Claude Code scores outputs directly)."
|
|
110
|
-
|
|
111
|
-
**Tasks** (`tasks/`): If test tasks exist, use them.
|
|
112
|
-
|
|
113
|
-
If NO tasks exist, generate them. First, identify all relevant source files:
|
|
114
|
-
|
|
115
|
-
```bash
|
|
116
|
-
find . -name "*.py" -not -path "./.venv/*" -not -path "./.harness-evolver/*" | head -10
|
|
117
|
-
find . -name "*.json" -o -name "*.md" -o -name "*.txt" -o -name "*.yaml" -o -name "*.yml" | grep -v .venv | grep -v .harness-evolver | head -10
|
|
118
|
-
```
|
|
119
|
-
|
|
120
|
-
Then spawn testgen subagent with CONCRETE file paths (not placeholders):
|
|
121
|
-
|
|
122
|
-
```
|
|
123
|
-
Agent(
|
|
124
|
-
subagent_type: "harness-evolver-testgen",
|
|
125
|
-
description: "TestGen: generate 30 test cases",
|
|
126
|
-
prompt: |
|
|
127
|
-
<objective>
|
|
128
|
-
Generate 30 diverse test cases for this project. Write them to the tasks/ directory
|
|
129
|
-
in the current working directory.
|
|
130
|
-
</objective>
|
|
131
|
-
|
|
132
|
-
<project_context>
|
|
133
|
-
This project is at: {absolute path to project root}
|
|
134
|
-
Entry point: {the harness/agent file you identified, e.g., crew.py or pipeline/moderator.py}
|
|
135
|
-
Framework: {what you detected — CrewAI, LangGraph, etc.}
|
|
136
|
-
</project_context>
|
|
137
|
-
|
|
138
|
-
<files_to_read>
|
|
139
|
-
{LIST EVERY .py file and data file you found above — use ABSOLUTE PATHS}
|
|
140
|
-
Example:
|
|
141
|
-
- /home/rp/Desktop/test-crewai/crew.py
|
|
142
|
-
- /home/rp/Desktop/test-crewai/README.md
|
|
143
|
-
</files_to_read>
|
|
144
|
-
|
|
145
|
-
<production_traces>
|
|
146
|
-
{IF .harness-evolver/production_seed.md EXISTS, paste its full contents here.
|
|
147
|
-
This file contains real production inputs, traffic distribution, error patterns,
|
|
148
|
-
and user feedback from LangSmith. Use it to generate REALISTIC test cases that
|
|
149
|
-
match actual usage patterns instead of synthetic ones.
|
|
150
|
-
|
|
151
|
-
If the file does not exist, omit this entire block.}
|
|
152
|
-
</production_traces>
|
|
153
|
-
|
|
154
|
-
<output>
|
|
155
|
-
Create directory tasks/ (at project root) with 30 files: task_001.json through task_030.json.
|
|
156
|
-
Format: {"id": "task_001", "input": "...", "metadata": {"difficulty": "easy|medium|hard", "type": "standard|edge|cross_domain|adversarial"}}
|
|
157
|
-
No "expected" field needed — the judge subagent will score outputs.
|
|
158
|
-
Distribution: 40% standard, 20% edge, 20% cross-domain, 20% adversarial.
|
|
159
|
-
If production traces are available, match the real traffic distribution instead of uniform.
|
|
160
|
-
</output>
|
|
161
|
-
)
|
|
162
|
-
```
|
|
163
|
-
|
|
164
|
-
Wait for `## TESTGEN COMPLETE`. If the subagent fails or returns with no tasks, generate them yourself inline (fallback).
|
|
165
|
-
|
|
166
|
-
Print: "Generated {N} test cases from code analysis."
|
|
167
|
-
|
|
168
|
-
If `.harness-evolver/production_seed.md` exists, also print:
|
|
169
|
-
"Tasks enriched with production trace data from LangSmith."
|
|
170
|
-
|
|
171
|
-
## Phase 3: Run Init
|
|
172
|
-
|
|
173
|
-
First, check if the project has a LangSmith production project configured:
|
|
174
|
-
|
|
175
|
-
```bash
|
|
176
|
-
# Auto-detect from env vars or .env
|
|
177
|
-
PROD_PROJECT=$(python3 -c "
|
|
178
|
-
import os
|
|
179
|
-
for v in ('LANGCHAIN_PROJECT', 'LANGSMITH_PROJECT'):
|
|
180
|
-
p = os.environ.get(v, '')
|
|
181
|
-
if p: print(p); exit()
|
|
182
|
-
for f in ('.env', '.env.local'):
|
|
183
|
-
if os.path.exists(f):
|
|
184
|
-
for line in open(f):
|
|
185
|
-
line = line.strip()
|
|
186
|
-
if '=' in line and not line.startswith('#'):
|
|
187
|
-
k, _, val = line.partition('=')
|
|
188
|
-
if k.strip() in ('LANGCHAIN_PROJECT', 'LANGSMITH_PROJECT'):
|
|
189
|
-
print(val.strip().strip('\"').strip(\"'\"))
|
|
190
|
-
exit()
|
|
191
|
-
" 2>/dev/null)
|
|
192
|
-
```
|
|
193
|
-
|
|
194
|
-
```bash
|
|
195
|
-
python3 $TOOLS/init.py [directory] \
|
|
196
|
-
--harness harness.py --eval eval.py --tasks tasks/ \
|
|
197
|
-
--tools-dir $TOOLS \
|
|
198
|
-
${PROD_PROJECT:+--langsmith-project "$PROD_PROJECT"}
|
|
199
|
-
```
|
|
200
|
-
|
|
201
|
-
Add `--harness-config config.json` if a config exists.
|
|
202
|
-
|
|
203
|
-
For **LLM-powered agents** that make real API calls (LangGraph, CrewAI, etc.) and take
|
|
204
|
-
more than 30 seconds per invocation, increase the validation timeout:
|
|
205
|
-
|
|
206
|
-
```bash
|
|
207
|
-
python3 $TOOLS/init.py [directory] \
|
|
208
|
-
--harness harness.py --eval eval.py --tasks tasks/ \
|
|
209
|
-
--tools-dir $TOOLS \
|
|
210
|
-
--validation-timeout 120
|
|
211
|
-
```
|
|
212
|
-
|
|
213
|
-
If validation keeps timing out but you've verified the harness works manually, skip it:
|
|
214
|
-
|
|
215
|
-
```bash
|
|
216
|
-
python3 $TOOLS/init.py [directory] \
|
|
217
|
-
--harness harness.py --eval eval.py --tasks tasks/ \
|
|
218
|
-
--tools-dir $TOOLS \
|
|
219
|
-
--skip-validation
|
|
220
|
-
```
|
|
221
|
-
|
|
222
|
-
## After Init — Report
|
|
223
|
-
|
|
224
|
-
- What was detected vs created
|
|
225
|
-
- Stack + integrations (LangSmith, Context7)
|
|
226
|
-
- Baseline score
|
|
227
|
-
- Next: `harness-evolver:evolve` to start
|
|
228
|
-
|
|
229
|
-
## Architecture Hint
|
|
230
|
-
|
|
231
|
-
After init completes, run a quick architecture analysis:
|
|
232
|
-
|
|
233
|
-
```bash
|
|
234
|
-
python3 $TOOLS/analyze_architecture.py --harness .harness-evolver/baseline/harness.py
|
|
235
|
-
```
|
|
236
|
-
|
|
237
|
-
If the analysis suggests the current topology may not be optimal for the task complexity, mention it:
|
|
238
|
-
|
|
239
|
-
> Architecture note: Current topology is "{topology}". For tasks with {characteristics},
|
|
240
|
-
> consider running `/harness-evolver:architect` for a detailed recommendation.
|
|
241
|
-
|
|
242
|
-
This is advisory only — do not spawn the architect agent.
|
|
243
|
-
|
|
244
|
-
## Gotchas
|
|
245
|
-
|
|
246
|
-
- The harness must write valid JSON to `--output`. If the user's code returns non-JSON, the wrapper must serialize it.
|
|
247
|
-
- Tasks must have unique `id` fields. Duplicate IDs cause silent eval errors.
|
|
248
|
-
- The `expected` field is never shown to the harness — only the eval script sees it.
|
|
249
|
-
- If `.harness-evolver/` already exists, warn before overwriting.
|
|
250
|
-
- If no Python files exist in CWD, the user is probably in the wrong directory.
|
|
251
|
-
- **Monorepo / venv mismatch**: In monorepos with dedicated venvs per app, the system `python3` may differ from the project's Python version. The harness wrapper should re-exec with the correct venv Python. The tools now use `sys.executable` instead of hardcoded `python3`.
|
|
252
|
-
- **Stale site-packages**: If the project uses editable installs (`pip install -e .`), packages in `site-packages/` may have stale copies of data files (e.g. registry YAMLs). Run `uv pip install -e . --force-reinstall --no-deps` to sync.
|
|
253
|
-
- **Validation timeout**: LLM agents making real API calls typically take 15-60s per invocation. Use `--validation-timeout 120` or `--skip-validation` to handle this.
|
|
Binary file
|
|
Binary file
|
|
Binary file
|
|
Binary file
|