harness-evolver 2.1.0 → 2.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "harness-evolver",
3
- "version": "2.1.0",
3
+ "version": "2.3.0",
4
4
  "description": "Meta-Harness-style autonomous harness optimization for Claude Code",
5
5
  "author": "Raphael Valdetaro",
6
6
  "license": "MIT",
@@ -48,21 +48,13 @@ python3 $TOOLS/analyze_architecture.py \
48
48
  -o .harness-evolver/architecture_signals.json
49
49
  ```
50
50
 
51
- 3. Read the architect agent definition:
52
- ```bash
53
- cat ~/.claude/agents/harness-evolver-architect.md
54
- ```
55
-
56
- 4. Dispatch using the Agent tool — include the agent definition in the prompt:
51
+ 3. Dispatch using the Agent tool with `subagent_type: "harness-evolver-architect"`:
57
52
 
58
53
  ```
59
54
  Agent(
55
+ subagent_type: "harness-evolver-architect",
60
56
  description: "Architect: topology analysis",
61
57
  prompt: |
62
- <agent_instructions>
63
- {paste the FULL content of harness-evolver-architect.md here}
64
- </agent_instructions>
65
-
66
58
  <objective>
67
59
  Analyze the harness architecture and recommend the optimal multi-agent topology.
68
60
  {If called from evolve: "The evolution loop stagnated/regressed after N iterations."}
@@ -22,21 +22,13 @@ TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo
22
22
 
23
23
  1. Read `summary.json` and identify the suspicious pattern (score jump, premature convergence).
24
24
 
25
- 2. Read the critic agent definition:
26
- ```bash
27
- cat ~/.claude/agents/harness-evolver-critic.md
28
- ```
29
-
30
- 3. Dispatch using the Agent tool — include the agent definition in the prompt:
25
+ 2. Dispatch using the Agent tool with `subagent_type: "harness-evolver-critic"`:
31
26
 
32
27
  ```
33
28
  Agent(
29
+ subagent_type: "harness-evolver-critic",
34
30
  description: "Critic: analyze eval quality",
35
31
  prompt: |
36
- <agent_instructions>
37
- {paste the FULL content of harness-evolver-critic.md here}
38
- </agent_instructions>
39
-
40
32
  <objective>
41
33
  Analyze eval quality for this harness evolution project.
42
34
  The best version is {version} with score {score} achieved in {iterations} iteration(s).
@@ -52,49 +52,105 @@ LS_PROJECT=$(langsmith-cli --json projects list --name-pattern "harness-evolver*
52
52
 
53
53
  If `LS_PROJECT` is empty, langsmith-cli is not available or no projects exist — skip to step 2.
54
54
 
55
- **Step 2: Gather traces from the discovered project**
55
+ **Step 2: Gather raw traces from the discovered project**
56
56
 
57
57
  ```bash
58
58
  if [ -n "$LS_PROJECT" ]; then
59
- langsmith-cli --json runs list --project "$LS_PROJECT" --failed --fields id,name,error,inputs --limit 10 > .harness-evolver/langsmith_diagnosis.json 2>/dev/null || echo "[]" > .harness-evolver/langsmith_diagnosis.json
60
- langsmith-cli --json runs list --project "$LS_PROJECT" --fields id,name,inputs,outputs,latency_ms,total_tokens --limit 20 > .harness-evolver/langsmith_runs.json 2>/dev/null || echo "[]" > .harness-evolver/langsmith_runs.json
59
+ langsmith-cli --json runs list --project "$LS_PROJECT" --recent --fields id,name,inputs,outputs,error,total_tokens --limit 30 > /tmp/langsmith_raw.json 2>/dev/null || echo "[]" > /tmp/langsmith_raw.json
61
60
  langsmith-cli --json runs stats --project "$LS_PROJECT" > .harness-evolver/langsmith_stats.json 2>/dev/null || echo "{}" > .harness-evolver/langsmith_stats.json
62
61
  echo "$LS_PROJECT" > .harness-evolver/langsmith_project.txt
63
62
  else
64
- echo "[]" > .harness-evolver/langsmith_diagnosis.json
63
+ echo "[]" > /tmp/langsmith_raw.json
65
64
  echo "{}" > .harness-evolver/langsmith_stats.json
66
65
  fi
67
66
  ```
68
67
 
69
- These files are included in the proposer's `<files_to_read>` so it has real trace data for diagnosis.
68
+ **Step 3: Process raw LangSmith data into a readable format for proposers**
69
+
70
+ The raw langsmith data has LangChain-serialized messages that are hard to read. Process it into a clean summary:
71
+
72
+ ```bash
73
+ python3 -c "
74
+ import json, sys
75
+
76
+ raw = json.load(open('/tmp/langsmith_raw.json'))
77
+ if not raw:
78
+ json.dump([], open('.harness-evolver/langsmith_runs.json', 'w'))
79
+ sys.exit(0)
80
+
81
+ clean = []
82
+ for r in raw:
83
+ entry = {'name': r.get('name', '?'), 'tokens': r.get('total_tokens', 0), 'error': r.get('error')}
84
+
85
+ # Extract readable prompt from LangChain serialized inputs
86
+ inputs = r.get('inputs', {})
87
+ if isinstance(inputs, dict) and 'messages' in inputs:
88
+ msgs = inputs['messages']
89
+ for msg_group in (msgs if isinstance(msgs, list) else [msgs]):
90
+ for msg in (msg_group if isinstance(msg_group, list) else [msg_group]):
91
+ if isinstance(msg, dict):
92
+ kwargs = msg.get('kwargs', msg)
93
+ content = kwargs.get('content', '')
94
+ msg_type = msg.get('id', ['','','',''])[3] if isinstance(msg.get('id'), list) else 'unknown'
95
+ if 'Human' in str(msg_type) or 'user' in str(msg_type).lower():
96
+ entry['user_message'] = str(content)[:300]
97
+ elif 'System' in str(msg_type):
98
+ entry['system_prompt_preview'] = str(content)[:200]
99
+
100
+ # Extract readable output
101
+ outputs = r.get('outputs', {})
102
+ if isinstance(outputs, dict) and 'generations' in outputs:
103
+ gens = outputs['generations']
104
+ if gens and isinstance(gens, list) and gens[0]:
105
+ gen = gens[0][0] if isinstance(gens[0], list) else gens[0]
106
+ if isinstance(gen, dict):
107
+ msg = gen.get('message', gen)
108
+ if isinstance(msg, dict):
109
+ kwargs = msg.get('kwargs', msg)
110
+ entry['llm_response'] = str(kwargs.get('content', ''))[:300]
111
+
112
+ clean.append(entry)
113
+
114
+ json.dump(clean, open('.harness-evolver/langsmith_runs.json', 'w'), indent=2, ensure_ascii=False)
115
+ print(f'Processed {len(clean)} LangSmith runs into readable format')
116
+ " 2>/dev/null || echo "[]" > .harness-evolver/langsmith_runs.json
117
+ ```
118
+
119
+ The resulting `langsmith_runs.json` has clean, readable entries:
120
+ ```json
121
+ [
122
+ {
123
+ "name": "ChatGoogleGenerativeAI",
124
+ "tokens": 1332,
125
+ "error": null,
126
+ "user_message": "Analise este texto: Bom dia pessoal...",
127
+ "system_prompt_preview": "Você é um moderador de conteúdo...",
128
+ "llm_response": "{\"categories\": [\"safe\"], \"severity\": \"safe\"...}"
129
+ }
130
+ ]
131
+ ```
132
+
133
+ These files are included in the proposer's `<files_to_read>` so it has readable trace data for diagnosis.
70
134
 
71
135
  ### 2. Propose (3 parallel candidates)
72
136
 
73
137
  Spawn 3 proposer agents IN PARALLEL, each with a different evolutionary strategy.
74
138
  This follows the DGM/AlphaEvolve pattern: exploit + explore + crossover.
75
139
 
76
- First, read the proposer agent definition:
77
- ```bash
78
- cat ~/.claude/agents/harness-evolver-proposer.md
79
- ```
80
-
81
- Then determine parents for each strategy:
140
+ Determine parents for each strategy:
82
141
  - **Exploiter parent**: current best version (from summary.json `best.version`)
83
142
  - **Explorer parent**: a non-best version with low offspring count (read summary.json history, pick one that scored >0 but is NOT the best and has NOT been parent to many children)
84
143
  - **Crossover parents**: best version + a different high-scorer from a different lineage
85
144
 
86
- Spawn all 3 using the Agent tool. The first 2 use `run_in_background: true`, the 3rd blocks:
145
+ Spawn all 3 using the Agent tool with `subagent_type: "harness-evolver-proposer"`. The first 2 use `run_in_background: true`, the 3rd blocks:
87
146
 
88
147
  **Candidate A (Exploiter)** — `run_in_background: true`:
89
148
  ```
90
149
  Agent(
150
+ subagent_type: "harness-evolver-proposer",
91
151
  description: "Proposer A (exploit): targeted fix for {version}",
92
152
  run_in_background: true,
93
153
  prompt: |
94
- <agent_instructions>
95
- {FULL content of harness-evolver-proposer.md}
96
- </agent_instructions>
97
-
98
154
  <strategy>
99
155
  APPROACH: exploitation
100
156
  You are the EXPLOITER. Make the SMALLEST, most targeted change that fixes
@@ -130,13 +186,10 @@ Agent(
130
186
  **Candidate B (Explorer)** — `run_in_background: true`:
131
187
  ```
132
188
  Agent(
189
+ subagent_type: "harness-evolver-proposer",
133
190
  description: "Proposer B (explore): bold change from {explorer_parent}",
134
191
  run_in_background: true,
135
192
  prompt: |
136
- <agent_instructions>
137
- {FULL content of harness-evolver-proposer.md}
138
- </agent_instructions>
139
-
140
193
  <strategy>
141
194
  APPROACH: exploration
142
195
  You are the EXPLORER. Try a FUNDAMENTALLY DIFFERENT approach.
@@ -172,12 +225,9 @@ Agent(
172
225
  **Candidate C (Crossover)** — blocks (last one):
173
226
  ```
174
227
  Agent(
228
+ subagent_type: "harness-evolver-proposer",
175
229
  description: "Proposer C (crossover): combine {parent_a} + {parent_b}",
176
230
  prompt: |
177
- <agent_instructions>
178
- {FULL content of harness-evolver-proposer.md}
179
- </agent_instructions>
180
-
181
231
  <strategy>
182
232
  APPROACH: crossover
183
233
  You are the CROSSOVER agent. Combine the STRENGTHS of two different versions:
@@ -269,21 +319,13 @@ python3 $TOOLS/evaluate.py run \
269
319
 
270
320
  For each evaluated candidate, read its scores.json. If `eval_type` is `"pending-judge"` (combined_score == -1), the eval was a passthrough and needs judge scoring.
271
321
 
272
- Read the judge agent definition:
273
- ```bash
274
- cat ~/.claude/agents/harness-evolver-judge.md
275
- ```
276
-
277
- Spawn judge subagent for EACH candidate that needs judging:
322
+ Spawn judge subagent with `subagent_type: "harness-evolver-judge"` for EACH candidate that needs judging:
278
323
 
279
324
  ```
280
325
  Agent(
326
+ subagent_type: "harness-evolver-judge",
281
327
  description: "Judge: score {version}{suffix} outputs",
282
328
  prompt: |
283
- <agent_instructions>
284
- {FULL content of harness-evolver-judge.md}
285
- </agent_instructions>
286
-
287
329
  <objective>
288
330
  Score the outputs of harness version {version}{suffix} across all {N} tasks.
289
331
  </objective>
@@ -354,21 +396,13 @@ python3 $TOOLS/evaluate.py run \
354
396
  --timeout 60
355
397
  ```
356
398
 
357
- First read the critic agent definition:
358
- ```bash
359
- cat ~/.claude/agents/harness-evolver-critic.md
360
- ```
361
-
362
- Then dispatch:
399
+ Dispatch the critic agent:
363
400
 
364
401
  ```
365
402
  Agent(
403
+ subagent_type: "harness-evolver-critic",
366
404
  description: "Critic: analyze eval quality",
367
405
  prompt: |
368
- <agent_instructions>
369
- {paste the FULL content of harness-evolver-critic.md here}
370
- </agent_instructions>
371
-
372
406
  <objective>
373
407
  EVAL GAMING DETECTED: Score jumped from {parent_score} to {score} in one iteration.
374
408
  Analyze the eval quality and propose a stricter eval.
@@ -431,21 +465,13 @@ python3 $TOOLS/analyze_architecture.py \
431
465
  -o .harness-evolver/architecture_signals.json
432
466
  ```
433
467
 
434
- First read the architect agent definition:
435
- ```bash
436
- cat ~/.claude/agents/harness-evolver-architect.md
437
- ```
438
-
439
- Then dispatch:
468
+ Dispatch the architect agent:
440
469
 
441
470
  ```
442
471
  Agent(
472
+ subagent_type: "harness-evolver-architect",
443
473
  description: "Architect: analyze topology after {stagnation/regression}",
444
474
  prompt: |
445
- <agent_instructions>
446
- {paste the FULL content of harness-evolver-architect.md here}
447
- </agent_instructions>
448
-
449
475
  <objective>
450
476
  The evolution loop has {stagnated/regressed} after {iterations} iterations (best: {best_score}).
451
477
  Analyze the harness architecture and recommend a topology change.
@@ -49,19 +49,12 @@ If NO eval exists:
49
49
  **Tasks** (`tasks/`): If test tasks exist, use them.
50
50
 
51
51
  If NO tasks exist:
52
- - Read the testgen agent definition:
53
- ```bash
54
- cat ~/.claude/agents/harness-evolver-testgen.md
55
- ```
56
- - Spawn testgen subagent:
52
+ - Spawn testgen subagent with `subagent_type: "harness-evolver-testgen"`:
57
53
  ```
58
54
  Agent(
55
+ subagent_type: "harness-evolver-testgen",
59
56
  description: "TestGen: generate test cases for this project",
60
57
  prompt: |
61
- <agent_instructions>
62
- {FULL content of harness-evolver-testgen.md}
63
- </agent_instructions>
64
-
65
58
  <objective>
66
59
  Generate 30 diverse test cases for this project. Write them to tasks/ directory.
67
60
  </objective>