harness-evolver 1.9.0 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,110 @@
1
+ ---
2
+ name: harness-evolver-judge
3
+ description: |
4
+ Use this agent to evaluate harness outputs using multi-dimensional LLM-as-judge scoring.
5
+ Spawned by the evolve skill when eval returns pending scores (eval_type=pending-judge).
6
+ tools: Read, Write, Bash, Grep, Glob
7
+ color: yellow
8
+ ---
9
+
10
+ # Harness Evolver — Judge Agent
11
+
12
+ You are an expert evaluator. Your job is to score harness outputs on multiple quality dimensions.
13
+
14
+ ## Bootstrap
15
+
16
+ If your prompt contains a `<files_to_read>` block, you MUST use the Read tool to load
17
+ every file listed there before performing any other actions.
18
+
19
+ ## Return Protocol
20
+
21
+ When done, end your response with:
22
+
23
+ ## JUDGE COMPLETE
24
+ - **Tasks scored**: {N}
25
+ - **Combined score**: {score}
26
+ - **Dimensions**: accuracy={X}, completeness={X}, relevance={X}, no_hallucination={X}
27
+
28
+ ## Your Workflow
29
+
30
+ ### Phase 1: Load All Tasks and Outputs
31
+
32
+ Read the scores.json file (which has per_task entries with input/output but score=-1).
33
+ For each task, you have the input (what was asked) and the output (what the harness produced).
34
+
35
+ Also read the task files from eval/tasks/ to get any additional context (expected answers, metadata).
36
+
37
+ ### Phase 2: Score Each Task
38
+
39
+ For each task, evaluate the output on 4 dimensions (1-5 integer scale):
40
+
41
+ **1. Accuracy (weight 0.4)**
42
+ - 5: Perfectly correct, addresses the question precisely
43
+ - 4: Mostly correct, minor inaccuracies
44
+ - 3: Partially correct, significant gaps
45
+ - 2: Mostly incorrect, but shows some understanding
46
+ - 1: Completely wrong or irrelevant
47
+
48
+ **2. Completeness (weight 0.2)**
49
+ - 5: Covers all aspects of the question
50
+ - 4: Covers most aspects
51
+ - 3: Covers some aspects, misses important ones
52
+ - 2: Very incomplete
53
+ - 1: Barely addresses the question
54
+
55
+ **3. Relevance (weight 0.2)**
56
+ - 5: Entirely focused on the question
57
+ - 4: Mostly relevant with minor tangents
58
+ - 3: Somewhat relevant but includes irrelevant information
59
+ - 2: Mostly irrelevant
60
+ - 1: Completely off-topic
61
+
62
+ **4. No-hallucination (weight 0.2)**
63
+ - 5: All claims supported by context/facts
64
+ - 4: Minor unsupported details
65
+ - 3: Some fabricated information
66
+ - 2: Significant hallucination
67
+ - 1: Mostly fabricated
68
+
69
+ If the task has an `expected` field, use it as a reference for accuracy scoring.
70
+ If no `expected` field, judge based on the quality and correctness of the output alone.
71
+
72
+ ### Phase 3: Calculate Scores
73
+
74
+ For each task:
75
+ - Normalize each dimension: (score - 1) / 4 → 0.0 to 1.0
76
+ - Combined per-task score = accuracy*0.4 + completeness*0.2 + relevance*0.2 + no_hallucination*0.2
77
+
78
+ Overall combined_score = mean of all per-task combined scores.
79
+
80
+ ### Phase 4: Write scores.json
81
+
82
+ Overwrite `.harness-evolver/harnesses/{version}/scores.json` with:
83
+
84
+ ```json
85
+ {
86
+ "combined_score": 0.78,
87
+ "eval_type": "llm-judge",
88
+ "dimensions": {"accuracy": 0.85, "completeness": 0.72, "relevance": 0.80, "no_hallucination": 0.75},
89
+ "weights": {"accuracy": 0.4, "completeness": 0.2, "relevance": 0.2, "no_hallucination": 0.2},
90
+ "total_tasks": 30,
91
+ "per_task": {
92
+ "task_001": {
93
+ "score": 0.85,
94
+ "accuracy": 4,
95
+ "completeness": 3,
96
+ "relevance": 4,
97
+ "no_hallucination": 4,
98
+ "reasoning": "Brief explanation of scoring"
99
+ }
100
+ }
101
+ }
102
+ ```
103
+
104
+ ## Rules
105
+
106
+ 1. **Be consistent** — similar quality outputs should get similar scores across tasks
107
+ 2. **Be fair** — don't penalize for style/format if the content is correct
108
+ 3. **Be specific in reasoning** — cite what's wrong or right, don't just say "good" or "bad"
109
+ 4. **Don't score based on length** — a concise correct answer scores higher than a verbose wrong one
110
+ 5. **Handle edge cases** — empty output = score 1 on all dimensions; error output = score 1 on all dimensions
@@ -0,0 +1,97 @@
1
+ ---
2
+ name: harness-evolver-testgen
3
+ description: |
4
+ Use this agent to generate synthetic test cases from harness source code analysis.
5
+ Spawned by the init skill when no test cases exist in the project.
6
+ tools: Read, Write, Bash, Glob, Grep
7
+ color: cyan
8
+ ---
9
+
10
+ # Harness Evolver — Test Generation Agent
11
+
12
+ You are a test case generator. Your job is to read the harness source code, understand its domain, and generate diverse, challenging test cases.
13
+
14
+ ## Bootstrap
15
+
16
+ If your prompt contains a `<files_to_read>` block, you MUST use the Read tool to load
17
+ every file listed there before performing any other actions.
18
+
19
+ ## Return Protocol
20
+
21
+ When done, end your response with:
22
+
23
+ ## TESTGEN COMPLETE
24
+ - **Tasks generated**: {N}
25
+ - **Categories covered**: {list}
26
+ - **Distribution**: {N} standard, {N} edge, {N} cross-domain, {N} adversarial
27
+
28
+ ## Your Workflow
29
+
30
+ ### Phase 1: Understand the Domain
31
+
32
+ Read the harness source code to understand:
33
+ - What kind of agent is this? (Q&A bot, RAG, classifier, coding agent, etc.)
34
+ - What format does it expect for inputs?
35
+ - What categories/topics does it cover?
36
+ - What are its likely failure modes?
37
+ - Are there any data files (knowledge bases, docs, etc.) that define the domain?
38
+
39
+ ### Phase 2: Design Test Distribution
40
+
41
+ Plan 30 test cases with this distribution:
42
+ - **40% Standard** (12 tasks): typical, well-formed inputs representative of the domain
43
+ - **20% Edge Cases** (6 tasks): boundary conditions, minimal inputs, unusual but valid
44
+ - **20% Cross-Domain** (6 tasks): inputs spanning multiple categories or requiring nuanced judgment
45
+ - **20% Adversarial** (6 tasks): misleading, ambiguous, or designed to expose weaknesses
46
+
47
+ Ensure all categories/topics from the harness are covered.
48
+
49
+ ### Phase 3: Generate Tasks
50
+
51
+ Create each task as a JSON file in the tasks/ directory.
52
+
53
+ Format (WITHOUT expected — for LLM-as-judge eval):
54
+ ```json
55
+ {
56
+ "id": "task_001",
57
+ "input": "The actual question or request",
58
+ "metadata": {
59
+ "difficulty": "easy|medium|hard",
60
+ "category": "the domain category",
61
+ "type": "standard|edge|cross_domain|adversarial"
62
+ }
63
+ }
64
+ ```
65
+
66
+ Format (WITH expected — when using keyword eval):
67
+ ```json
68
+ {
69
+ "id": "task_001",
70
+ "input": "The actual question or request",
71
+ "expected": "The expected answer or key phrases",
72
+ "metadata": {
73
+ "difficulty": "easy|medium|hard",
74
+ "category": "the domain category",
75
+ "type": "standard|edge|cross_domain|adversarial"
76
+ }
77
+ }
78
+ ```
79
+
80
+ Use the Write tool to create each file. Name them task_001.json through task_030.json.
81
+
82
+ ### Phase 4: Validate
83
+
84
+ After generating all tasks:
85
+ - Verify each file is valid JSON
86
+ - Verify all IDs are unique
87
+ - Verify the distribution matches the target (40/20/20/20)
88
+ - Verify all domain categories are represented
89
+
90
+ ## Rules
91
+
92
+ 1. **Inputs must be realistic** — questions a real user would ask, not synthetic-sounding
93
+ 2. **Vary phrasing** — don't use the same sentence structure repeatedly
94
+ 3. **Include some hard questions** — questions that require reasoning, not just lookup
95
+ 4. **Include out-of-scope questions** — 2-3 questions the agent should NOT be able to answer
96
+ 5. **Test failure modes** — ambiguous questions, misspellings, multi-part questions
97
+ 6. **Use the domain's language** — if the harness handles Portuguese, write inputs in Portuguese
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "harness-evolver",
3
- "version": "1.9.0",
3
+ "version": "2.1.0",
4
4
  "description": "Meta-Harness-style autonomous harness optimization for Claude Code",
5
5
  "author": "Raphael Valdetaro",
6
6
  "license": "MIT",
@@ -57,6 +57,7 @@ If `LS_PROJECT` is empty, langsmith-cli is not available or no projects exist
57
57
  ```bash
58
58
  if [ -n "$LS_PROJECT" ]; then
59
59
  langsmith-cli --json runs list --project "$LS_PROJECT" --failed --fields id,name,error,inputs --limit 10 > .harness-evolver/langsmith_diagnosis.json 2>/dev/null || echo "[]" > .harness-evolver/langsmith_diagnosis.json
60
+ langsmith-cli --json runs list --project "$LS_PROJECT" --fields id,name,inputs,outputs,latency_ms,total_tokens --limit 20 > .harness-evolver/langsmith_runs.json 2>/dev/null || echo "[]" > .harness-evolver/langsmith_runs.json
60
61
  langsmith-cli --json runs stats --project "$LS_PROJECT" > .harness-evolver/langsmith_stats.json 2>/dev/null || echo "{}" > .harness-evolver/langsmith_stats.json
61
62
  echo "$LS_PROJECT" > .harness-evolver/langsmith_project.txt
62
63
  else
@@ -115,6 +116,7 @@ Agent(
115
116
  - .harness-evolver/harnesses/{best_version}/proposal.md
116
117
  - .harness-evolver/langsmith_diagnosis.json (if exists)
117
118
  - .harness-evolver/langsmith_stats.json (if exists)
119
+ - .harness-evolver/langsmith_runs.json (if exists)
118
120
  - .harness-evolver/architecture.json (if exists)
119
121
  </files_to_read>
120
122
 
@@ -156,6 +158,7 @@ Agent(
156
158
  - .harness-evolver/harnesses/{explorer_parent}/harness.py
157
159
  - .harness-evolver/harnesses/{explorer_parent}/scores.json
158
160
  - .harness-evolver/langsmith_diagnosis.json (if exists)
161
+ - .harness-evolver/langsmith_runs.json (if exists)
159
162
  - .harness-evolver/architecture.json (if exists)
160
163
  </files_to_read>
161
164
 
@@ -196,6 +199,7 @@ Agent(
196
199
  - .harness-evolver/harnesses/{parent_b}/harness.py
197
200
  - .harness-evolver/harnesses/{parent_b}/scores.json
198
201
  - .harness-evolver/langsmith_diagnosis.json (if exists)
202
+ - .harness-evolver/langsmith_runs.json (if exists)
199
203
  - .harness-evolver/architecture.json (if exists)
200
204
  </files_to_read>
201
205
 
@@ -206,15 +210,41 @@ Agent(
206
210
  )
207
211
  ```
208
212
 
209
- Wait for all 3 to complete. The background agents will notify when done.
213
+ **Also spawn these additional candidates:**
210
214
 
211
- **Special case — iteration 1**: Only the exploiter and explorer can run (no second parent for crossover yet). Spawn 2 agents: exploiter (from baseline) and explorer (also from baseline but with bold strategy). Skip crossover.
215
+ **Candidate D (Prompt Specialist)** `run_in_background: true`:
216
+ Same as Exploiter but with a different focus:
217
+ ```
218
+ <strategy>
219
+ APPROACH: prompt-engineering
220
+ You are the PROMPT SPECIALIST. Focus ONLY on improving the system prompt,
221
+ few-shot examples, output format instructions, and prompt structure.
222
+ Do NOT change the retrieval logic, pipeline structure, or code architecture.
223
+ </strategy>
224
+ ```
225
+ Output to: `.harness-evolver/harnesses/{version}d/`
226
+
227
+ **Candidate E (Data/Retrieval Specialist)** — `run_in_background: true`:
228
+ ```
229
+ <strategy>
230
+ APPROACH: retrieval-optimization
231
+ You are the RETRIEVAL SPECIALIST. Focus ONLY on improving how data is
232
+ retrieved, filtered, ranked, and presented to the LLM.
233
+ Do NOT change the system prompt text or output formatting.
234
+ Improve: search logic, relevance scoring, cross-domain retrieval, chunking.
235
+ </strategy>
236
+ ```
237
+ Output to: `.harness-evolver/harnesses/{version}e/`
238
+
239
+ Wait for all 5 to complete. The background agents will notify when done.
212
240
 
213
- **Special case iteration 2+**: All 3 strategies. Explorer parent = fitness-weighted random from history excluding current best.
241
+ **Minimum 3 candidates ALWAYS, even on iteration 1.** On iteration 1, the crossover agent uses baseline as both parents but with instruction to "combine the best retrieval strategy with the best prompt strategy from your analysis of the baseline." On iteration 2+, crossover uses two genuinely different parents.
242
+
243
+ **On iteration 3+**: If scores are improving, keep all 5 strategies. If stagnating, replace Candidate D with a "Radical" strategy that rewrites the harness from scratch.
214
244
 
215
245
  ### 3. Validate All Candidates
216
246
 
217
- For each candidate (a, b, c):
247
+ For each candidate (a, b, c, d, e):
218
248
  ```bash
219
249
  python3 $TOOLS/evaluate.py validate --harness .harness-evolver/harnesses/{version}{suffix}/harness.py --config .harness-evolver/harnesses/{version}{suffix}/config.json
220
250
  ```
@@ -235,6 +265,44 @@ python3 $TOOLS/evaluate.py run \
235
265
  --timeout 60
236
266
  ```
237
267
 
268
+ ### 4.5. Judge (if eval returned pending scores)
269
+
270
+ For each evaluated candidate, read its scores.json. If `eval_type` is `"pending-judge"` (combined_score == -1), the eval was a passthrough and needs judge scoring.
271
+
272
+ Read the judge agent definition:
273
+ ```bash
274
+ cat ~/.claude/agents/harness-evolver-judge.md
275
+ ```
276
+
277
+ Spawn judge subagent for EACH candidate that needs judging:
278
+
279
+ ```
280
+ Agent(
281
+ description: "Judge: score {version}{suffix} outputs",
282
+ prompt: |
283
+ <agent_instructions>
284
+ {FULL content of harness-evolver-judge.md}
285
+ </agent_instructions>
286
+
287
+ <objective>
288
+ Score the outputs of harness version {version}{suffix} across all {N} tasks.
289
+ </objective>
290
+
291
+ <files_to_read>
292
+ - .harness-evolver/harnesses/{version}{suffix}/scores.json
293
+ - .harness-evolver/eval/tasks/ (read all task files)
294
+ </files_to_read>
295
+
296
+ <output>
297
+ Overwrite .harness-evolver/harnesses/{version}{suffix}/scores.json with real scores.
298
+ </output>
299
+ )
300
+ ```
301
+
302
+ Wait for `## JUDGE COMPLETE`.
303
+
304
+ If eval_type is NOT "pending-judge", the eval.py already produced real scores — skip this step.
305
+
238
306
  ### 5. Select Winner + Update State
239
307
 
240
308
  Compare scores of all evaluated candidates. The winner is the one with highest combined_score.
@@ -36,9 +36,49 @@ Three artifacts needed. For each — use existing if found, create if not.
36
36
 
37
37
  **Harness** (`harness.py`): If user's entry point doesn't match our CLI interface (`--input`, `--output`, `--traces-dir`, `--config`), create a thin wrapper that imports their code. Read their entry point first to understand the I/O format. Ask if unsure.
38
38
 
39
- **Eval** (`eval.py`): Ask the user what "correct" means for their domain. Generate the simplest eval that gives signal. Even rough scoring works — the evolver iterates.
40
-
41
- **Tasks** (`tasks/`): If no test data exists, ask the user for 5-10 example input/output pairs. Each task is `{"id": "task_001", "input": "...", "expected": "...", "metadata": {}}`.
39
+ **Eval** (`eval.py`): If an eval script exists, use it.
40
+
41
+ If NO eval exists:
42
+ - Copy `eval_passthrough.py` from `$TOOLS/eval_passthrough.py` as the project's eval.py:
43
+ ```bash
44
+ cp $TOOLS/eval_passthrough.py eval.py
45
+ ```
46
+ - This passthrough eval collects outputs for the judge subagent to score during evolve.
47
+ - Print: "No eval found. Using LLM-as-judge (Claude Code scores outputs directly)."
48
+
49
+ **Tasks** (`tasks/`): If test tasks exist, use them.
50
+
51
+ If NO tasks exist:
52
+ - Read the testgen agent definition:
53
+ ```bash
54
+ cat ~/.claude/agents/harness-evolver-testgen.md
55
+ ```
56
+ - Spawn testgen subagent:
57
+ ```
58
+ Agent(
59
+ description: "TestGen: generate test cases for this project",
60
+ prompt: |
61
+ <agent_instructions>
62
+ {FULL content of harness-evolver-testgen.md}
63
+ </agent_instructions>
64
+
65
+ <objective>
66
+ Generate 30 diverse test cases for this project. Write them to tasks/ directory.
67
+ </objective>
68
+
69
+ <files_to_read>
70
+ - {harness source file path}
71
+ - {any data files found in the project}
72
+ </files_to_read>
73
+
74
+ <output>
75
+ Create tasks/ directory with task_001.json through task_030.json.
76
+ No expected field needed (judge subagent will score outputs).
77
+ </output>
78
+ )
79
+ ```
80
+ - Wait for `## TESTGEN COMPLETE`.
81
+ - Print: "Generated {N} test cases from code analysis."
42
82
 
43
83
  ## Phase 3: Run Init
44
84
 
@@ -0,0 +1,233 @@
1
+ #!/usr/bin/env python3
2
+ """LLM-as-judge evaluation script for Harness Evolver.
3
+
4
+ Scores harness outputs using an LLM judge across multiple quality dimensions:
5
+ accuracy, completeness, relevance, no_hallucination.
6
+
7
+ CLI interface matches existing evals: --results-dir, --tasks-dir, --scores.
8
+ Stdlib-only. No external dependencies.
9
+ """
10
+
11
+ import argparse
12
+ import json
13
+ import os
14
+ import re
15
+ import sys
16
+
17
+ sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
18
+ from llm_api import detect_provider, call_llm
19
+
20
+ DIMENSIONS = ["accuracy", "completeness", "relevance", "no_hallucination"]
21
+
22
+ WEIGHTS = {
23
+ "accuracy": 0.4,
24
+ "completeness": 0.2,
25
+ "relevance": 0.2,
26
+ "no_hallucination": 0.2,
27
+ }
28
+
29
+
30
+ def build_judge_prompt(task, result):
31
+ """Build the evaluation prompt for the LLM judge."""
32
+ prompt_parts = [
33
+ "You are an expert evaluator. Assess the quality of the following output.",
34
+ "",
35
+ "QUESTION/INPUT:",
36
+ str(task.get("input", "")),
37
+ "",
38
+ "OUTPUT TO EVALUATE:",
39
+ str(result.get("output", "")),
40
+ ]
41
+
42
+ if "expected" in task:
43
+ prompt_parts.extend([
44
+ "",
45
+ "REFERENCE ANSWER:",
46
+ str(task["expected"]),
47
+ ])
48
+
49
+ prompt_parts.extend([
50
+ "",
51
+ "Score each dimension from 1 (worst) to 5 (best):",
52
+ "- accuracy: Is the output factually correct and properly addresses the input?",
53
+ "- completeness: Does it cover all relevant aspects?",
54
+ "- relevance: Is it focused and on-topic?",
55
+ "- no_hallucination: Does it avoid fabricating information not supported by context?",
56
+ "",
57
+ "Think step by step, then respond with ONLY this JSON:",
58
+ '{"reasoning": "your analysis", "accuracy": N, "completeness": N, "relevance": N, "no_hallucination": N}',
59
+ ])
60
+
61
+ return "\n".join(prompt_parts)
62
+
63
+
64
+ def extract_json_scores(response):
65
+ """Extract scoring JSON from LLM response. Handles fenced and bare JSON."""
66
+ # Try direct parse
67
+ try:
68
+ data = json.loads(response.strip())
69
+ if "accuracy" in data:
70
+ return data
71
+ except (json.JSONDecodeError, ValueError):
72
+ pass
73
+
74
+ # Try extracting from markdown fences
75
+ fence_match = re.search(r'```(?:json)?\s*(\{.*?\})\s*```', response, re.DOTALL)
76
+ if fence_match:
77
+ try:
78
+ data = json.loads(fence_match.group(1))
79
+ if "accuracy" in data:
80
+ return data
81
+ except (json.JSONDecodeError, ValueError):
82
+ pass
83
+
84
+ # Try regex extraction for JSON with accuracy key
85
+ json_match = re.search(r'\{[^{}]*"accuracy"\s*:\s*\d[^{}]*\}', response)
86
+ if json_match:
87
+ try:
88
+ data = json.loads(json_match.group(0))
89
+ if "accuracy" in data:
90
+ return data
91
+ except (json.JSONDecodeError, ValueError):
92
+ pass
93
+
94
+ return None
95
+
96
+
97
+ def normalize_score(raw_score):
98
+ """Normalize a 1-5 score to 0.0-1.0 range."""
99
+ clamped = max(1, min(5, int(raw_score)))
100
+ return (clamped - 1) / 4.0
101
+
102
+
103
+ def compute_combined_score(scores_dict):
104
+ """Compute weighted combined score from normalized dimension scores."""
105
+ total = 0.0
106
+ for dim in DIMENSIONS:
107
+ total += scores_dict.get(dim, 0.0) * WEIGHTS[dim]
108
+ return total
109
+
110
+
111
+ def evaluate_task(provider, api_key, model, task, result):
112
+ """Evaluate a single task with the LLM judge. Returns per-task score dict."""
113
+ prompt = build_judge_prompt(task, result)
114
+
115
+ try:
116
+ response = call_llm(provider, api_key, model, prompt, max_tokens=2048)
117
+ except Exception as e:
118
+ return {
119
+ "score": 0.0,
120
+ "accuracy": 1, "completeness": 1, "relevance": 1, "no_hallucination": 1,
121
+ "reasoning": f"LLM call failed: {e}",
122
+ "error": str(e),
123
+ }
124
+
125
+ parsed = extract_json_scores(response)
126
+ if parsed is None:
127
+ return {
128
+ "score": 0.0,
129
+ "accuracy": 1, "completeness": 1, "relevance": 1, "no_hallucination": 1,
130
+ "reasoning": f"Failed to parse judge response: {response[:200]}",
131
+ "error": "parse_failed",
132
+ }
133
+
134
+ # Extract raw scores
135
+ raw = {}
136
+ normalized = {}
137
+ for dim in DIMENSIONS:
138
+ raw[dim] = parsed.get(dim, 1)
139
+ normalized[dim] = normalize_score(raw[dim])
140
+
141
+ combined = compute_combined_score(normalized)
142
+
143
+ return {
144
+ "score": round(combined, 4),
145
+ "accuracy": raw["accuracy"],
146
+ "completeness": raw["completeness"],
147
+ "relevance": raw["relevance"],
148
+ "no_hallucination": raw["no_hallucination"],
149
+ "reasoning": parsed.get("reasoning", ""),
150
+ }
151
+
152
+
153
+ def main():
154
+ parser = argparse.ArgumentParser(description="LLM-as-judge evaluation")
155
+ parser.add_argument("--results-dir", required=True,
156
+ help="Directory with harness output JSON files")
157
+ parser.add_argument("--tasks-dir", required=True,
158
+ help="Directory with task JSON files")
159
+ parser.add_argument("--scores", required=True,
160
+ help="Output path for scores JSON")
161
+ args = parser.parse_args()
162
+
163
+ # Detect LLM provider
164
+ provider, api_key, model = detect_provider()
165
+
166
+ # Collect tasks
167
+ task_files = sorted(f for f in os.listdir(args.tasks_dir) if f.endswith(".json"))
168
+ if not task_files:
169
+ print(f"FAIL: no .json task files in {args.tasks_dir}", file=sys.stderr)
170
+ sys.exit(1)
171
+
172
+ per_task = {}
173
+ dimension_totals = {dim: 0.0 for dim in DIMENSIONS}
174
+ total_combined = 0.0
175
+ total_tasks = 0
176
+
177
+ for task_file in task_files:
178
+ # Load task
179
+ task_path = os.path.join(args.tasks_dir, task_file)
180
+ with open(task_path) as f:
181
+ task = json.load(f)
182
+ task_id = task["id"]
183
+
184
+ # Load result
185
+ result_path = os.path.join(args.results_dir, task_file)
186
+ if os.path.exists(result_path):
187
+ with open(result_path) as f:
188
+ result = json.load(f)
189
+ else:
190
+ result = {"id": task_id, "output": "", "error": "no output file"}
191
+
192
+ # Evaluate
193
+ task_scores = evaluate_task(provider, api_key, model, task, result)
194
+ per_task[task_id] = task_scores
195
+
196
+ # Accumulate
197
+ total_combined += task_scores["score"]
198
+ for dim in DIMENSIONS:
199
+ dimension_totals[dim] += normalize_score(task_scores[dim])
200
+ total_tasks += 1
201
+
202
+ # Compute averages
203
+ if total_tasks > 0:
204
+ combined_score = round(total_combined / total_tasks, 4)
205
+ avg_dimensions = {
206
+ dim: round(dimension_totals[dim] / total_tasks, 4) for dim in DIMENSIONS
207
+ }
208
+ else:
209
+ combined_score = 0.0
210
+ avg_dimensions = {dim: 0.0 for dim in DIMENSIONS}
211
+
212
+ scores = {
213
+ "combined_score": combined_score,
214
+ "eval_type": "llm-judge",
215
+ "judge_provider": provider,
216
+ "judge_model": model,
217
+ "dimensions": avg_dimensions,
218
+ "weights": WEIGHTS,
219
+ "total_tasks": total_tasks,
220
+ "per_task": per_task,
221
+ }
222
+
223
+ # Write scores
224
+ os.makedirs(os.path.dirname(os.path.abspath(args.scores)), exist_ok=True)
225
+ with open(args.scores, "w") as f:
226
+ json.dump(scores, f, indent=2)
227
+
228
+ print(f"LLM judge evaluation complete. combined_score: {combined_score} "
229
+ f"({total_tasks} tasks, provider: {provider}/{model})")
230
+
231
+
232
+ if __name__ == "__main__":
233
+ main()
@@ -0,0 +1,55 @@
1
+ #!/usr/bin/env python3
2
+ """Passthrough eval — collects outputs for judge subagent scoring.
3
+
4
+ When no custom eval.py exists, this is used as the default. It does NOT score
5
+ outputs — it collects them and marks them for the judge subagent to evaluate.
6
+ The evolve skill detects eval_type=pending-judge and spawns the judge agent.
7
+ """
8
+
9
+ import argparse
10
+ import json
11
+ import os
12
+
13
+
14
+ def main():
15
+ parser = argparse.ArgumentParser()
16
+ parser.add_argument("--results-dir", required=True)
17
+ parser.add_argument("--tasks-dir", required=True)
18
+ parser.add_argument("--scores", required=True)
19
+ args = parser.parse_args()
20
+
21
+ per_task = {}
22
+ for fname in sorted(os.listdir(args.tasks_dir)):
23
+ if not fname.endswith(".json"):
24
+ continue
25
+ with open(os.path.join(args.tasks_dir, fname)) as f:
26
+ task = json.load(f)
27
+ task_id = task["id"]
28
+
29
+ result_path = os.path.join(args.results_dir, fname)
30
+ output = ""
31
+ if os.path.exists(result_path):
32
+ with open(result_path) as f:
33
+ result = json.load(f)
34
+ output = str(result.get("output", ""))
35
+
36
+ per_task[task_id] = {
37
+ "score": -1,
38
+ "input": str(task.get("input", ""))[:500],
39
+ "output": output[:500],
40
+ }
41
+
42
+ scores = {
43
+ "combined_score": -1,
44
+ "eval_type": "pending-judge",
45
+ "total_tasks": len(per_task),
46
+ "per_task": per_task,
47
+ }
48
+ with open(args.scores, "w") as f:
49
+ json.dump(scores, f, indent=2)
50
+
51
+ print(f"Collected {len(per_task)} task outputs for judge scoring.")
52
+
53
+
54
+ if __name__ == "__main__":
55
+ main()
@@ -0,0 +1,125 @@
1
+ #!/usr/bin/env python3
2
+ """Shared LLM API calling utility. Stdlib-only (urllib).
3
+
4
+ Auto-detects the best available provider from environment variables.
5
+ Supports: Gemini, OpenAI, Anthropic, OpenRouter.
6
+ """
7
+
8
+ import json
9
+ import os
10
+ import time
11
+ from urllib.request import Request, urlopen
12
+ from urllib.error import HTTPError
13
+
14
+ PROVIDER_PRIORITY = [
15
+ ("GEMINI_API_KEY", "gemini", "gemini-2.5-flash"),
16
+ ("GOOGLE_API_KEY", "gemini", "gemini-2.5-flash"),
17
+ ("OPENROUTER_API_KEY", "openrouter", "google/gemini-2.5-flash"),
18
+ ("OPENAI_API_KEY", "openai", "gpt-4o-mini"),
19
+ ("ANTHROPIC_API_KEY", "anthropic", "claude-haiku-4-5-20251001"),
20
+ ]
21
+
22
+
23
+ def detect_provider():
24
+ """Auto-detect best available LLM provider from env vars.
25
+ Returns (provider_name, api_key, model) or raises RuntimeError."""
26
+ for env_var, provider, model in PROVIDER_PRIORITY:
27
+ key = os.environ.get(env_var, "")
28
+ if key:
29
+ return provider, key, model
30
+ raise RuntimeError(
31
+ "No LLM API key found. Set one of: " +
32
+ ", ".join(e for e, _, _ in PROVIDER_PRIORITY)
33
+ )
34
+
35
+
36
+ def call_llm(provider, api_key, model, prompt, max_tokens=4096, temperature=0.0):
37
+ """Call LLM API via urllib. Returns response text. Retries 3x with backoff."""
38
+ for attempt in range(3):
39
+ try:
40
+ if provider == "gemini":
41
+ return _call_gemini(api_key, model, prompt, max_tokens, temperature)
42
+ elif provider == "openai":
43
+ return _call_openai(api_key, model, prompt, max_tokens, temperature)
44
+ elif provider == "anthropic":
45
+ return _call_anthropic(api_key, model, prompt, max_tokens, temperature)
46
+ elif provider == "openrouter":
47
+ return _call_openrouter(api_key, model, prompt, max_tokens, temperature)
48
+ else:
49
+ raise ValueError(f"Unknown provider: {provider}")
50
+ except ValueError:
51
+ raise
52
+ except Exception as e:
53
+ if attempt == 2:
54
+ raise
55
+ time.sleep(2 ** attempt)
56
+ raise RuntimeError("All retries failed")
57
+
58
+
59
+ def _call_gemini(api_key, model, prompt, max_tokens, temperature):
60
+ url = (
61
+ f"https://generativelanguage.googleapis.com/v1beta/models/"
62
+ f"{model}:generateContent?key={api_key}"
63
+ )
64
+ body = json.dumps({
65
+ "contents": [{"parts": [{"text": prompt}]}],
66
+ "generationConfig": {
67
+ "maxOutputTokens": max_tokens,
68
+ "temperature": max(temperature, 0.0),
69
+ },
70
+ }).encode()
71
+ req = Request(url, data=body, headers={"Content-Type": "application/json"})
72
+ with urlopen(req, timeout=60) as resp:
73
+ data = json.loads(resp.read())
74
+ return data["candidates"][0]["content"]["parts"][0]["text"]
75
+
76
+
77
+ def _call_openai(api_key, model, prompt, max_tokens, temperature):
78
+ url = "https://api.openai.com/v1/chat/completions"
79
+ body = json.dumps({
80
+ "model": model,
81
+ "max_tokens": max_tokens,
82
+ "temperature": temperature,
83
+ "messages": [{"role": "user", "content": prompt}],
84
+ }).encode()
85
+ req = Request(url, data=body, headers={
86
+ "Content-Type": "application/json",
87
+ "Authorization": f"Bearer {api_key}",
88
+ })
89
+ with urlopen(req, timeout=60) as resp:
90
+ data = json.loads(resp.read())
91
+ return data["choices"][0]["message"]["content"]
92
+
93
+
94
+ def _call_anthropic(api_key, model, prompt, max_tokens, temperature):
95
+ url = "https://api.anthropic.com/v1/messages"
96
+ body = json.dumps({
97
+ "model": model,
98
+ "max_tokens": max_tokens,
99
+ "messages": [{"role": "user", "content": prompt}],
100
+ }).encode()
101
+ req = Request(url, data=body, headers={
102
+ "Content-Type": "application/json",
103
+ "x-api-key": api_key,
104
+ "anthropic-version": "2023-06-01",
105
+ })
106
+ with urlopen(req, timeout=60) as resp:
107
+ data = json.loads(resp.read())
108
+ return data["content"][0]["text"]
109
+
110
+
111
+ def _call_openrouter(api_key, model, prompt, max_tokens, temperature):
112
+ url = "https://openrouter.ai/api/v1/chat/completions"
113
+ body = json.dumps({
114
+ "model": model,
115
+ "max_tokens": max_tokens,
116
+ "temperature": temperature,
117
+ "messages": [{"role": "user", "content": prompt}],
118
+ }).encode()
119
+ req = Request(url, data=body, headers={
120
+ "Content-Type": "application/json",
121
+ "Authorization": f"Bearer {api_key}",
122
+ })
123
+ with urlopen(req, timeout=60) as resp:
124
+ data = json.loads(resp.read())
125
+ return data["choices"][0]["message"]["content"]