npm - harness-evolver - Versions diffs - 1.9.0 → 2.1.0 - Mend

harness-evolver 1.9.0 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

package/agents/harness-evolver-judge.md +110 -0
package/agents/harness-evolver-testgen.md +97 -0
package/package.json +1 -1
package/skills/evolve/SKILL.md +72 -4
package/skills/init/SKILL.md +43 -3
package/tools/eval_llm_judge.py +233 -0
package/tools/eval_passthrough.py +55 -0
package/tools/llm_api.py +125 -0

package/agents/harness-evolver-judge.md ADDED Viewed

@@ -0,0 +1,110 @@
+---
+name: harness-evolver-judge
+description: |
+  Use this agent to evaluate harness outputs using multi-dimensional LLM-as-judge scoring.
+  Spawned by the evolve skill when eval returns pending scores (eval_type=pending-judge).
+tools: Read, Write, Bash, Grep, Glob
+color: yellow
+---
+# Harness Evolver — Judge Agent
+You are an expert evaluator. Your job is to score harness outputs on multiple quality dimensions.
+## Bootstrap
+If your prompt contains a `<files_to_read>` block, you MUST use the Read tool to load
+every file listed there before performing any other actions.
+## Return Protocol
+When done, end your response with:
+## JUDGE COMPLETE
+- **Tasks scored**: {N}
+- **Combined score**: {score}
+- **Dimensions**: accuracy={X}, completeness={X}, relevance={X}, no_hallucination={X}
+## Your Workflow
+### Phase 1: Load All Tasks and Outputs
+Read the scores.json file (which has per_task entries with input/output but score=-1).
+For each task, you have the input (what was asked) and the output (what the harness produced).
+Also read the task files from eval/tasks/ to get any additional context (expected answers, metadata).
+### Phase 2: Score Each Task
+For each task, evaluate the output on 4 dimensions (1-5 integer scale):
+**1. Accuracy (weight 0.4)**
+- 5: Perfectly correct, addresses the question precisely
+- 4: Mostly correct, minor inaccuracies
+- 3: Partially correct, significant gaps
+- 2: Mostly incorrect, but shows some understanding
+- 1: Completely wrong or irrelevant
+**2. Completeness (weight 0.2)**
+- 5: Covers all aspects of the question
+- 4: Covers most aspects
+- 3: Covers some aspects, misses important ones
+- 2: Very incomplete
+- 1: Barely addresses the question
+**3. Relevance (weight 0.2)**
+- 5: Entirely focused on the question
+- 4: Mostly relevant with minor tangents
+- 3: Somewhat relevant but includes irrelevant information
+- 2: Mostly irrelevant
+- 1: Completely off-topic
+**4. No-hallucination (weight 0.2)**
+- 5: All claims supported by context/facts
+- 4: Minor unsupported details
+- 3: Some fabricated information
+- 2: Significant hallucination
+- 1: Mostly fabricated
+If the task has an `expected` field, use it as a reference for accuracy scoring.
+If no `expected` field, judge based on the quality and correctness of the output alone.
+### Phase 3: Calculate Scores
+For each task:
+- Normalize each dimension: (score - 1) / 4 → 0.0 to 1.0
+- Combined per-task score = accuracy*0.4 + completeness*0.2 + relevance*0.2 + no_hallucination*0.2
+Overall combined_score = mean of all per-task combined scores.
+### Phase 4: Write scores.json
+Overwrite `.harness-evolver/harnesses/{version}/scores.json` with:
+```json
+{
+  "combined_score": 0.78,
+  "eval_type": "llm-judge",
+  "dimensions": {"accuracy": 0.85, "completeness": 0.72, "relevance": 0.80, "no_hallucination": 0.75},
+  "weights": {"accuracy": 0.4, "completeness": 0.2, "relevance": 0.2, "no_hallucination": 0.2},
+  "total_tasks": 30,
+  "per_task": {
+    "task_001": {
+      "score": 0.85,
+      "accuracy": 4,
+      "completeness": 3,
+      "relevance": 4,
+      "no_hallucination": 4,
+      "reasoning": "Brief explanation of scoring"
+    }
+  }
+}
+```
+## Rules
+1. **Be consistent** — similar quality outputs should get similar scores across tasks
+2. **Be fair** — don't penalize for style/format if the content is correct
+3. **Be specific in reasoning** — cite what's wrong or right, don't just say "good" or "bad"
+4. **Don't score based on length** — a concise correct answer scores higher than a verbose wrong one
+5. **Handle edge cases** — empty output = score 1 on all dimensions; error output = score 1 on all dimensions

package/agents/harness-evolver-testgen.md ADDED Viewed

@@ -0,0 +1,97 @@
+---
+name: harness-evolver-testgen
+description: |
+  Use this agent to generate synthetic test cases from harness source code analysis.
+  Spawned by the init skill when no test cases exist in the project.
+tools: Read, Write, Bash, Glob, Grep
+color: cyan
+---
+# Harness Evolver — Test Generation Agent
+You are a test case generator. Your job is to read the harness source code, understand its domain, and generate diverse, challenging test cases.
+## Bootstrap
+If your prompt contains a `<files_to_read>` block, you MUST use the Read tool to load
+every file listed there before performing any other actions.
+## Return Protocol
+When done, end your response with:
+## TESTGEN COMPLETE
+- **Tasks generated**: {N}
+- **Categories covered**: {list}
+- **Distribution**: {N} standard, {N} edge, {N} cross-domain, {N} adversarial
+## Your Workflow
+### Phase 1: Understand the Domain
+Read the harness source code to understand:
+- What kind of agent is this? (Q&A bot, RAG, classifier, coding agent, etc.)
+- What format does it expect for inputs?
+- What categories/topics does it cover?
+- What are its likely failure modes?
+- Are there any data files (knowledge bases, docs, etc.) that define the domain?
+### Phase 2: Design Test Distribution
+Plan 30 test cases with this distribution:
+- **40% Standard** (12 tasks): typical, well-formed inputs representative of the domain
+- **20% Edge Cases** (6 tasks): boundary conditions, minimal inputs, unusual but valid
+- **20% Cross-Domain** (6 tasks): inputs spanning multiple categories or requiring nuanced judgment
+- **20% Adversarial** (6 tasks): misleading, ambiguous, or designed to expose weaknesses
+Ensure all categories/topics from the harness are covered.
+### Phase 3: Generate Tasks
+Create each task as a JSON file in the tasks/ directory.
+Format (WITHOUT expected — for LLM-as-judge eval):
+```json
+{
+  "id": "task_001",
+  "input": "The actual question or request",
+  "metadata": {
+    "difficulty": "easy|medium|hard",
+    "category": "the domain category",
+    "type": "standard|edge|cross_domain|adversarial"
+  }
+}
+```
+Format (WITH expected — when using keyword eval):
+```json
+{
+  "id": "task_001",
+  "input": "The actual question or request",
+  "expected": "The expected answer or key phrases",
+  "metadata": {
+    "difficulty": "easy|medium|hard",
+    "category": "the domain category",
+    "type": "standard|edge|cross_domain|adversarial"
+  }
+}
+```
+Use the Write tool to create each file. Name them task_001.json through task_030.json.
+### Phase 4: Validate
+After generating all tasks:
+- Verify each file is valid JSON
+- Verify all IDs are unique
+- Verify the distribution matches the target (40/20/20/20)
+- Verify all domain categories are represented
+## Rules
+1. **Inputs must be realistic** — questions a real user would ask, not synthetic-sounding
+2. **Vary phrasing** — don't use the same sentence structure repeatedly
+3. **Include some hard questions** — questions that require reasoning, not just lookup
+4. **Include out-of-scope questions** — 2-3 questions the agent should NOT be able to answer
+5. **Test failure modes** — ambiguous questions, misspellings, multi-part questions
+6. **Use the domain's language** — if the harness handles Portuguese, write inputs in Portuguese

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "harness-evolver",
-  "version": "1.9.0",
+  "version": "2.1.0",
   "description": "Meta-Harness-style autonomous harness optimization for Claude Code",
   "author": "Raphael Valdetaro",
   "license": "MIT",

package/skills/evolve/SKILL.md CHANGED Viewed

@@ -57,6 +57,7 @@ If `LS_PROJECT` is empty, langsmith-cli is not available or no projects exist
 ```bash
 if [ -n "$LS_PROJECT" ]; then
   langsmith-cli --json runs list --project "$LS_PROJECT" --failed --fields id,name,error,inputs --limit 10 > .harness-evolver/langsmith_diagnosis.json 2>/dev/null || echo "[]" > .harness-evolver/langsmith_diagnosis.json
+  langsmith-cli --json runs list --project "$LS_PROJECT" --fields id,name,inputs,outputs,latency_ms,total_tokens --limit 20 > .harness-evolver/langsmith_runs.json 2>/dev/null || echo "[]" > .harness-evolver/langsmith_runs.json
   langsmith-cli --json runs stats --project "$LS_PROJECT" > .harness-evolver/langsmith_stats.json 2>/dev/null || echo "{}" > .harness-evolver/langsmith_stats.json
   echo "$LS_PROJECT" > .harness-evolver/langsmith_project.txt
 else
@@ -115,6 +116,7 @@ Agent(
     - .harness-evolver/harnesses/{best_version}/proposal.md
     - .harness-evolver/langsmith_diagnosis.json (if exists)
     - .harness-evolver/langsmith_stats.json (if exists)
+    - .harness-evolver/langsmith_runs.json (if exists)
     - .harness-evolver/architecture.json (if exists)
     </files_to_read>
@@ -156,6 +158,7 @@ Agent(
     - .harness-evolver/harnesses/{explorer_parent}/harness.py
     - .harness-evolver/harnesses/{explorer_parent}/scores.json
     - .harness-evolver/langsmith_diagnosis.json (if exists)
+    - .harness-evolver/langsmith_runs.json (if exists)
     - .harness-evolver/architecture.json (if exists)
     </files_to_read>
@@ -196,6 +199,7 @@ Agent(
     - .harness-evolver/harnesses/{parent_b}/harness.py
     - .harness-evolver/harnesses/{parent_b}/scores.json
     - .harness-evolver/langsmith_diagnosis.json (if exists)
+    - .harness-evolver/langsmith_runs.json (if exists)
     - .harness-evolver/architecture.json (if exists)
     </files_to_read>
@@ -206,15 +210,41 @@ Agent(
 )
 ```
-Wait for all 3 to complete. The background agents will notify when done.
+**Also spawn these additional candidates:**
-**Special case — iteration 1**: Only the exploiter and explorer can run (no second parent for crossover yet). Spawn 2 agents: exploiter (from baseline) and explorer (also from baseline but with bold strategy). Skip crossover.
+**Candidate D (Prompt Specialist)** — `run_in_background: true`:
+Same as Exploiter but with a different focus:
+```
+<strategy>
+APPROACH: prompt-engineering
+You are the PROMPT SPECIALIST. Focus ONLY on improving the system prompt,
+few-shot examples, output format instructions, and prompt structure.
+Do NOT change the retrieval logic, pipeline structure, or code architecture.
+</strategy>
+```
+Output to: `.harness-evolver/harnesses/{version}d/`
+**Candidate E (Data/Retrieval Specialist)** — `run_in_background: true`:
+```
+<strategy>
+APPROACH: retrieval-optimization
+You are the RETRIEVAL SPECIALIST. Focus ONLY on improving how data is
+retrieved, filtered, ranked, and presented to the LLM.
+Do NOT change the system prompt text or output formatting.
+Improve: search logic, relevance scoring, cross-domain retrieval, chunking.
+</strategy>
+```
+Output to: `.harness-evolver/harnesses/{version}e/`
+Wait for all 5 to complete. The background agents will notify when done.
-**Special case — iteration 2+**: All 3 strategies. Explorer parent = fitness-weighted random from history excluding current best.
+**Minimum 3 candidates ALWAYS, even on iteration 1.** On iteration 1, the crossover agent uses baseline as both parents but with instruction to "combine the best retrieval strategy with the best prompt strategy from your analysis of the baseline." On iteration 2+, crossover uses two genuinely different parents.
+**On iteration 3+**: If scores are improving, keep all 5 strategies. If stagnating, replace Candidate D with a "Radical" strategy that rewrites the harness from scratch.
 ### 3. Validate All Candidates
-For each candidate (a, b, c):
+For each candidate (a, b, c, d, e):
 ```bash
 python3 $TOOLS/evaluate.py validate --harness .harness-evolver/harnesses/{version}{suffix}/harness.py --config .harness-evolver/harnesses/{version}{suffix}/config.json
 ```
@@ -235,6 +265,44 @@ python3 $TOOLS/evaluate.py run \
     --timeout 60
 ```
+### 4.5. Judge (if eval returned pending scores)
+For each evaluated candidate, read its scores.json. If `eval_type` is `"pending-judge"` (combined_score == -1), the eval was a passthrough and needs judge scoring.
+Read the judge agent definition:
+```bash
+cat ~/.claude/agents/harness-evolver-judge.md
+```
+Spawn judge subagent for EACH candidate that needs judging:
+```
+Agent(
+  description: "Judge: score {version}{suffix} outputs",
+  prompt: |
+    <agent_instructions>
+    {FULL content of harness-evolver-judge.md}
+    </agent_instructions>
+    <objective>
+    Score the outputs of harness version {version}{suffix} across all {N} tasks.
+    </objective>
+    <files_to_read>
+    - .harness-evolver/harnesses/{version}{suffix}/scores.json
+    - .harness-evolver/eval/tasks/ (read all task files)
+    </files_to_read>
+    <output>
+    Overwrite .harness-evolver/harnesses/{version}{suffix}/scores.json with real scores.
+    </output>
+)
+```
+Wait for `## JUDGE COMPLETE`.
+If eval_type is NOT "pending-judge", the eval.py already produced real scores — skip this step.
 ### 5. Select Winner + Update State
 Compare scores of all evaluated candidates. The winner is the one with highest combined_score.

package/skills/init/SKILL.md CHANGED Viewed

@@ -36,9 +36,49 @@ Three artifacts needed. For each — use existing if found, create if not.
 **Harness** (`harness.py`): If user's entry point doesn't match our CLI interface (`--input`, `--output`, `--traces-dir`, `--config`), create a thin wrapper that imports their code. Read their entry point first to understand the I/O format. Ask if unsure.
-**Eval** (`eval.py`): Ask the user what "correct" means for their domain. Generate the simplest eval that gives signal. Even rough scoring works — the evolver iterates.
-**Tasks** (`tasks/`): If no test data exists, ask the user for 5-10 example input/output pairs. Each task is `{"id": "task_001", "input": "...", "expected": "...", "metadata": {}}`.
+**Eval** (`eval.py`): If an eval script exists, use it.
+If NO eval exists:
+- Copy `eval_passthrough.py` from `$TOOLS/eval_passthrough.py` as the project's eval.py:
+  ```bash
+  cp $TOOLS/eval_passthrough.py eval.py
+  ```
+- This passthrough eval collects outputs for the judge subagent to score during evolve.
+- Print: "No eval found. Using LLM-as-judge (Claude Code scores outputs directly)."
+**Tasks** (`tasks/`): If test tasks exist, use them.
+If NO tasks exist:
+- Read the testgen agent definition:
+  ```bash
+  cat ~/.claude/agents/harness-evolver-testgen.md
+  ```
+- Spawn testgen subagent:
+  ```
+  Agent(
+    description: "TestGen: generate test cases for this project",
+    prompt: |
+      <agent_instructions>
+      {FULL content of harness-evolver-testgen.md}
+      </agent_instructions>
+      <objective>
+      Generate 30 diverse test cases for this project. Write them to tasks/ directory.
+      </objective>
+      <files_to_read>
+      - {harness source file path}
+      - {any data files found in the project}
+      </files_to_read>
+      <output>
+      Create tasks/ directory with task_001.json through task_030.json.
+      No expected field needed (judge subagent will score outputs).
+      </output>
+  )
+  ```
+- Wait for `## TESTGEN COMPLETE`.
+- Print: "Generated {N} test cases from code analysis."
 ## Phase 3: Run Init

package/tools/eval_llm_judge.py ADDED Viewed

@@ -0,0 +1,233 @@
+#!/usr/bin/env python3
+"""LLM-as-judge evaluation script for Harness Evolver.
+Scores harness outputs using an LLM judge across multiple quality dimensions:
+accuracy, completeness, relevance, no_hallucination.
+CLI interface matches existing evals: --results-dir, --tasks-dir, --scores.
+Stdlib-only. No external dependencies.
+"""
+import argparse
+import json
+import os
+import re
+import sys
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+from llm_api import detect_provider, call_llm
+DIMENSIONS = ["accuracy", "completeness", "relevance", "no_hallucination"]
+WEIGHTS = {
+    "accuracy": 0.4,
+    "completeness": 0.2,
+    "relevance": 0.2,
+    "no_hallucination": 0.2,
+}
+def build_judge_prompt(task, result):
+    """Build the evaluation prompt for the LLM judge."""
+    prompt_parts = [
+        "You are an expert evaluator. Assess the quality of the following output.",
+        "",
+        "QUESTION/INPUT:",
+        str(task.get("input", "")),
+        "",
+        "OUTPUT TO EVALUATE:",
+        str(result.get("output", "")),
+    ]
+    if "expected" in task:
+        prompt_parts.extend([
+            "",
+            "REFERENCE ANSWER:",
+            str(task["expected"]),
+        ])
+    prompt_parts.extend([
+        "",
+        "Score each dimension from 1 (worst) to 5 (best):",
+        "- accuracy: Is the output factually correct and properly addresses the input?",
+        "- completeness: Does it cover all relevant aspects?",
+        "- relevance: Is it focused and on-topic?",
+        "- no_hallucination: Does it avoid fabricating information not supported by context?",
+        "",
+        "Think step by step, then respond with ONLY this JSON:",
+        '{"reasoning": "your analysis", "accuracy": N, "completeness": N, "relevance": N, "no_hallucination": N}',
+    ])
+    return "\n".join(prompt_parts)
+def extract_json_scores(response):
+    """Extract scoring JSON from LLM response. Handles fenced and bare JSON."""
+    # Try direct parse
+    try:
+        data = json.loads(response.strip())
+        if "accuracy" in data:
+            return data
+    except (json.JSONDecodeError, ValueError):
+        pass
+    # Try extracting from markdown fences
+    fence_match = re.search(r'```(?:json)?\s*(\{.*?\})\s*```', response, re.DOTALL)
+    if fence_match:
+        try:
+            data = json.loads(fence_match.group(1))
+            if "accuracy" in data:
+                return data
+        except (json.JSONDecodeError, ValueError):
+            pass
+    # Try regex extraction for JSON with accuracy key
+    json_match = re.search(r'\{[^{}]*"accuracy"\s*:\s*\d[^{}]*\}', response)
+    if json_match:
+        try:
+            data = json.loads(json_match.group(0))
+            if "accuracy" in data:
+                return data
+        except (json.JSONDecodeError, ValueError):
+            pass
+    return None
+def normalize_score(raw_score):
+    """Normalize a 1-5 score to 0.0-1.0 range."""
+    clamped = max(1, min(5, int(raw_score)))
+    return (clamped - 1) / 4.0
+def compute_combined_score(scores_dict):
+    """Compute weighted combined score from normalized dimension scores."""
+    total = 0.0
+    for dim in DIMENSIONS:
+        total += scores_dict.get(dim, 0.0) * WEIGHTS[dim]
+    return total
+def evaluate_task(provider, api_key, model, task, result):
+    """Evaluate a single task with the LLM judge. Returns per-task score dict."""
+    prompt = build_judge_prompt(task, result)
+    try:
+        response = call_llm(provider, api_key, model, prompt, max_tokens=2048)
+    except Exception as e:
+        return {
+            "score": 0.0,
+            "accuracy": 1, "completeness": 1, "relevance": 1, "no_hallucination": 1,
+            "reasoning": f"LLM call failed: {e}",
+            "error": str(e),
+        }
+    parsed = extract_json_scores(response)
+    if parsed is None:
+        return {
+            "score": 0.0,
+            "accuracy": 1, "completeness": 1, "relevance": 1, "no_hallucination": 1,
+            "reasoning": f"Failed to parse judge response: {response[:200]}",
+            "error": "parse_failed",
+        }
+    # Extract raw scores
+    raw = {}
+    normalized = {}
+    for dim in DIMENSIONS:
+        raw[dim] = parsed.get(dim, 1)
+        normalized[dim] = normalize_score(raw[dim])
+    combined = compute_combined_score(normalized)
+    return {
+        "score": round(combined, 4),
+        "accuracy": raw["accuracy"],
+        "completeness": raw["completeness"],
+        "relevance": raw["relevance"],
+        "no_hallucination": raw["no_hallucination"],
+        "reasoning": parsed.get("reasoning", ""),
+    }
+def main():
+    parser = argparse.ArgumentParser(description="LLM-as-judge evaluation")
+    parser.add_argument("--results-dir", required=True,
+                        help="Directory with harness output JSON files")
+    parser.add_argument("--tasks-dir", required=True,
+                        help="Directory with task JSON files")
+    parser.add_argument("--scores", required=True,
+                        help="Output path for scores JSON")
+    args = parser.parse_args()
+    # Detect LLM provider
+    provider, api_key, model = detect_provider()
+    # Collect tasks
+    task_files = sorted(f for f in os.listdir(args.tasks_dir) if f.endswith(".json"))
+    if not task_files:
+        print(f"FAIL: no .json task files in {args.tasks_dir}", file=sys.stderr)
+        sys.exit(1)
+    per_task = {}
+    dimension_totals = {dim: 0.0 for dim in DIMENSIONS}
+    total_combined = 0.0
+    total_tasks = 0
+    for task_file in task_files:
+        # Load task
+        task_path = os.path.join(args.tasks_dir, task_file)
+        with open(task_path) as f:
+            task = json.load(f)
+        task_id = task["id"]
+        # Load result
+        result_path = os.path.join(args.results_dir, task_file)
+        if os.path.exists(result_path):
+            with open(result_path) as f:
+                result = json.load(f)
+        else:
+            result = {"id": task_id, "output": "", "error": "no output file"}
+        # Evaluate
+        task_scores = evaluate_task(provider, api_key, model, task, result)
+        per_task[task_id] = task_scores
+        # Accumulate
+        total_combined += task_scores["score"]
+        for dim in DIMENSIONS:
+            dimension_totals[dim] += normalize_score(task_scores[dim])
+        total_tasks += 1
+    # Compute averages
+    if total_tasks > 0:
+        combined_score = round(total_combined / total_tasks, 4)
+        avg_dimensions = {
+            dim: round(dimension_totals[dim] / total_tasks, 4) for dim in DIMENSIONS
+        }
+    else:
+        combined_score = 0.0
+        avg_dimensions = {dim: 0.0 for dim in DIMENSIONS}
+    scores = {
+        "combined_score": combined_score,
+        "eval_type": "llm-judge",
+        "judge_provider": provider,
+        "judge_model": model,
+        "dimensions": avg_dimensions,
+        "weights": WEIGHTS,
+        "total_tasks": total_tasks,
+        "per_task": per_task,
+    }
+    # Write scores
+    os.makedirs(os.path.dirname(os.path.abspath(args.scores)), exist_ok=True)
+    with open(args.scores, "w") as f:
+        json.dump(scores, f, indent=2)
+    print(f"LLM judge evaluation complete. combined_score: {combined_score} "
+          f"({total_tasks} tasks, provider: {provider}/{model})")
+if __name__ == "__main__":
+    main()

package/tools/eval_passthrough.py ADDED Viewed

@@ -0,0 +1,55 @@
+#!/usr/bin/env python3
+"""Passthrough eval — collects outputs for judge subagent scoring.
+When no custom eval.py exists, this is used as the default. It does NOT score
+outputs — it collects them and marks them for the judge subagent to evaluate.
+The evolve skill detects eval_type=pending-judge and spawns the judge agent.
+"""
+import argparse
+import json
+import os
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--results-dir", required=True)
+    parser.add_argument("--tasks-dir", required=True)
+    parser.add_argument("--scores", required=True)
+    args = parser.parse_args()
+    per_task = {}
+    for fname in sorted(os.listdir(args.tasks_dir)):
+        if not fname.endswith(".json"):
+            continue
+        with open(os.path.join(args.tasks_dir, fname)) as f:
+            task = json.load(f)
+        task_id = task["id"]
+        result_path = os.path.join(args.results_dir, fname)
+        output = ""
+        if os.path.exists(result_path):
+            with open(result_path) as f:
+                result = json.load(f)
+            output = str(result.get("output", ""))
+        per_task[task_id] = {
+            "score": -1,
+            "input": str(task.get("input", ""))[:500],
+            "output": output[:500],
+        }
+    scores = {
+        "combined_score": -1,
+        "eval_type": "pending-judge",
+        "total_tasks": len(per_task),
+        "per_task": per_task,
+    }
+    with open(args.scores, "w") as f:
+        json.dump(scores, f, indent=2)
+    print(f"Collected {len(per_task)} task outputs for judge scoring.")
+if __name__ == "__main__":
+    main()

package/tools/llm_api.py ADDED Viewed

@@ -0,0 +1,125 @@
+#!/usr/bin/env python3
+"""Shared LLM API calling utility. Stdlib-only (urllib).
+Auto-detects the best available provider from environment variables.
+Supports: Gemini, OpenAI, Anthropic, OpenRouter.
+"""
+import json
+import os
+import time
+from urllib.request import Request, urlopen
+from urllib.error import HTTPError
+PROVIDER_PRIORITY = [
+    ("GEMINI_API_KEY", "gemini", "gemini-2.5-flash"),
+    ("GOOGLE_API_KEY", "gemini", "gemini-2.5-flash"),
+    ("OPENROUTER_API_KEY", "openrouter", "google/gemini-2.5-flash"),
+    ("OPENAI_API_KEY", "openai", "gpt-4o-mini"),
+    ("ANTHROPIC_API_KEY", "anthropic", "claude-haiku-4-5-20251001"),
+]
+def detect_provider():
+    """Auto-detect best available LLM provider from env vars.
+    Returns (provider_name, api_key, model) or raises RuntimeError."""
+    for env_var, provider, model in PROVIDER_PRIORITY:
+        key = os.environ.get(env_var, "")
+        if key:
+            return provider, key, model
+    raise RuntimeError(
+        "No LLM API key found. Set one of: " +
+        ", ".join(e for e, _, _ in PROVIDER_PRIORITY)
+    )
+def call_llm(provider, api_key, model, prompt, max_tokens=4096, temperature=0.0):
+    """Call LLM API via urllib. Returns response text. Retries 3x with backoff."""
+    for attempt in range(3):
+        try:
+            if provider == "gemini":
+                return _call_gemini(api_key, model, prompt, max_tokens, temperature)
+            elif provider == "openai":
+                return _call_openai(api_key, model, prompt, max_tokens, temperature)
+            elif provider == "anthropic":
+                return _call_anthropic(api_key, model, prompt, max_tokens, temperature)
+            elif provider == "openrouter":
+                return _call_openrouter(api_key, model, prompt, max_tokens, temperature)
+            else:
+                raise ValueError(f"Unknown provider: {provider}")
+        except ValueError:
+            raise
+        except Exception as e:
+            if attempt == 2:
+                raise
+            time.sleep(2 ** attempt)
+    raise RuntimeError("All retries failed")
+def _call_gemini(api_key, model, prompt, max_tokens, temperature):
+    url = (
+        f"https://generativelanguage.googleapis.com/v1beta/models/"
+        f"{model}:generateContent?key={api_key}"
+    )
+    body = json.dumps({
+        "contents": [{"parts": [{"text": prompt}]}],
+        "generationConfig": {
+            "maxOutputTokens": max_tokens,
+            "temperature": max(temperature, 0.0),
+        },
+    }).encode()
+    req = Request(url, data=body, headers={"Content-Type": "application/json"})
+    with urlopen(req, timeout=60) as resp:
+        data = json.loads(resp.read())
+    return data["candidates"][0]["content"]["parts"][0]["text"]
+def _call_openai(api_key, model, prompt, max_tokens, temperature):
+    url = "https://api.openai.com/v1/chat/completions"
+    body = json.dumps({
+        "model": model,
+        "max_tokens": max_tokens,
+        "temperature": temperature,
+        "messages": [{"role": "user", "content": prompt}],
+    }).encode()
+    req = Request(url, data=body, headers={
+        "Content-Type": "application/json",
+        "Authorization": f"Bearer {api_key}",
+    })
+    with urlopen(req, timeout=60) as resp:
+        data = json.loads(resp.read())
+    return data["choices"][0]["message"]["content"]
+def _call_anthropic(api_key, model, prompt, max_tokens, temperature):
+    url = "https://api.anthropic.com/v1/messages"
+    body = json.dumps({
+        "model": model,
+        "max_tokens": max_tokens,
+        "messages": [{"role": "user", "content": prompt}],
+    }).encode()
+    req = Request(url, data=body, headers={
+        "Content-Type": "application/json",
+        "x-api-key": api_key,
+        "anthropic-version": "2023-06-01",
+    })
+    with urlopen(req, timeout=60) as resp:
+        data = json.loads(resp.read())
+    return data["content"][0]["text"]
+def _call_openrouter(api_key, model, prompt, max_tokens, temperature):
+    url = "https://openrouter.ai/api/v1/chat/completions"
+    body = json.dumps({
+        "model": model,
+        "max_tokens": max_tokens,
+        "temperature": temperature,
+        "messages": [{"role": "user", "content": prompt}],
+    }).encode()
+    req = Request(url, data=body, headers={
+        "Content-Type": "application/json",
+        "Authorization": f"Bearer {api_key}",
+    })
+    with urlopen(req, timeout=60) as resp:
+        data = json.loads(resp.read())
+    return data["choices"][0]["message"]["content"]