npm - harness-evolver - Versions diffs - 0.2.0 → 0.4.0 - Mend

harness-evolver 0.2.0 → 0.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

package/package.json +1 -1
package/skills/compare/SKILL.md +73 -0
package/skills/deploy/SKILL.md +53 -0
package/skills/diagnose/SKILL.md +96 -0
package/skills/evolve/SKILL.md +94 -0
package/skills/init/SKILL.md +66 -0
package/skills/status/SKILL.md +34 -0
package/skills/harness-evolve/SKILL.md +0 -93
package/skills/harness-evolve-init/SKILL.md +0 -50
package/skills/harness-evolve-status/SKILL.md +0 -25

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "harness-evolver",
-  "version": "0.2.0",
+  "version": "0.4.0",
   "description": "Meta-Harness-style autonomous harness optimization for Claude Code",
   "author": "Raphael Valdetaro Christi Cordeiro",
   "license": "MIT",

package/skills/compare/SKILL.md ADDED Viewed

@@ -0,0 +1,73 @@
+---
+name: compare
+description: "Use when the user wants to compare two harness versions, understand what changed between iterations, see why one version scored better than another, or debug a regression."
+argument-hint: "<vA> <vB>"
+allowed-tools: [Read, Bash, Glob, Grep]
+---
+# /harness-evolver:compare
+Compare two harness versions side by side.
+## Arguments
+- `vA` — first version (e.g., `v001`, `baseline`)
+- `vB` — second version (e.g., `v003`)
+If only one version given, compare it against the current best.
+If no versions given, compare the two most recent.
+## What To Do
+### 1. Code Diff
+```bash
+diff .harness-evolver/harnesses/{vA}/harness.py .harness-evolver/harnesses/{vB}/harness.py
+```
+If config changed:
+```bash
+diff .harness-evolver/harnesses/{vA}/config.json .harness-evolver/harnesses/{vB}/config.json
+```
+### 2. Score Comparison
+```bash
+cat .harness-evolver/harnesses/{vA}/scores.json
+cat .harness-evolver/harnesses/{vB}/scores.json
+```
+Report: combined_score delta, per-task wins/losses.
+### 3. Per-Task Analysis
+For tasks where scores diverge, show what each version produced:
+```bash
+cat .harness-evolver/harnesses/{vA}/traces/task_{ID}/output.json
+cat .harness-evolver/harnesses/{vB}/traces/task_{ID}/output.json
+```
+### 4. Proposal Context
+```bash
+cat .harness-evolver/harnesses/{vB}/proposal.md
+```
+Show what the proposer intended and whether the result matched expectations.
+## Report Format
+```
+v001 (0.62) vs v003 (0.71) — +0.09 improvement
+Code changes:
+  + Added few-shot examples (3 examples)
+  ~ Changed prompt template
+  - Removed retry logic
+Per-task:
+  task_001: 1.0 → 1.0 (unchanged)
+  task_007: 0.0 → 1.0 (FIXED — was cardiac, now correctly classified)
+  task_008: 1.0 → 0.0 (REGRESSION — was neurological, now wrong)
+```

package/skills/deploy/SKILL.md ADDED Viewed

@@ -0,0 +1,53 @@
+---
+name: deploy
+description: "Use when the user wants to use the best evolved harness in their project, promote a version to production, copy the winning harness back, or is done evolving and wants to apply the result."
+argument-hint: "[version]"
+allowed-tools: [Read, Write, Bash, Glob]
+---
+# /harness-evolver:deploy
+Promote the best (or specified) harness version back to the user's project.
+## Arguments
+- `version` — optional. If not given, deploys the best version from `summary.json`.
+## What To Do
+### 1. Identify Best Version
+```bash
+python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(s['best']['version'], s['best']['combined_score'])"
+```
+Or use the user-specified version.
+### 2. Show What Will Be Deployed
+```bash
+cat .harness-evolver/harnesses/{version}/proposal.md
+cat .harness-evolver/harnesses/{version}/scores.json
+```
+Report: version, score, improvement over baseline, what changed.
+### 3. Ask for Confirmation
+> Deploy `{version}` (score: {score}, +{delta} over baseline) to your project?
+> This will copy `harness.py` and `config.json` to the project root.
+### 4. Copy Files
+```bash
+cp .harness-evolver/harnesses/{version}/harness.py ./harness.py
+cp .harness-evolver/harnesses/{version}/config.json ./config.json  # if exists
+```
+If the original entry point had a different name (e.g., `graph.py`), ask the user where to put it.
+### 5. Report
+- What was copied and where
+- Score improvement: baseline → deployed version
+- Suggest: review the diff before committing

package/skills/diagnose/SKILL.md ADDED Viewed

@@ -0,0 +1,96 @@
+---
+name: diagnose
+description: "Use when the user wants to understand why a specific harness version failed, investigate a regression, analyze trace data, or debug a low score. Also use when the user says 'why did v003 fail' or 'what went wrong'."
+argument-hint: "[version]"
+allowed-tools: [Read, Bash, Glob, Grep]
+---
+# /harness-evolver:diagnose
+Deep analysis of a harness version's execution traces and scores.
+## Arguments
+- `version` — version to diagnose (e.g., `v003`). If not given, diagnose the worst or most recent regression.
+## Resolve Tool Path
+```bash
+TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo "$HOME/.harness-evolver/tools")
+```
+## What To Do
+### 1. Identify the Version
+If not specified, find the worst or most recent regression:
+```bash
+python3 $TOOLS/state.py show --base-dir .harness-evolver
+cat .harness-evolver/summary.json
+```
+### 2. Score Breakdown
+```bash
+cat .harness-evolver/harnesses/{version}/scores.json
+```
+Identify which tasks failed (`score: 0.0`) and which passed.
+### 3. Trace Analysis (failed tasks)
+For each failed task:
+```bash
+cat .harness-evolver/harnesses/{version}/traces/{task_id}/input.json
+cat .harness-evolver/harnesses/{version}/traces/{task_id}/output.json
+```
+Look for patterns: wrong format? wrong category? empty output? crash?
+### 4. Error Search
+```bash
+grep -r "error\|Error\|FAIL\|exception\|Traceback" .harness-evolver/harnesses/{version}/traces/
+cat .harness-evolver/harnesses/{version}/traces/stderr.log
+```
+### 5. Compare with Parent
+Read the proposal to find the parent version:
+```bash
+cat .harness-evolver/harnesses/{version}/proposal.md
+```
+Then diff:
+```bash
+diff .harness-evolver/harnesses/{parent}/harness.py .harness-evolver/harnesses/{version}/harness.py
+```
+### 6. LangSmith (if available)
+If `langsmith-cli` is installed and LangSmith is configured:
+```bash
+langsmith-cli --json runs list --project harness-evolver-{version} --failed --fields id,name,error,inputs
+langsmith-cli --json runs stats --project harness-evolver-{version}
+```
+### 7. Report
+```
+Diagnosis: v003 (score: 0.31) — REGRESSION from v001 (0.62)
+Root cause: Prompt template change broke JSON parsing
+  - 4/10 tasks returned malformed output
+  - stderr shows: json.JSONDecodeError on 4 tasks
+  - The change on line 42 removed the "Reply with ONLY..." instruction
+Affected tasks: task_002, task_005, task_007, task_010
+Unaffected tasks: task_001, task_003, task_004, task_006, task_008, task_009
+Recommendation: Revert the prompt change, keep the retry logic from v002
+```

package/skills/evolve/SKILL.md ADDED Viewed

@@ -0,0 +1,94 @@
+---
+name: evolve
+description: "Use when the user wants to run the optimization loop, improve harness performance, evolve the harness, or iterate on harness quality. Requires .harness-evolver/ to exist (run harness-evolver:init first)."
+argument-hint: "[--iterations N]"
+allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent]
+---
+# /harness-evolve
+Run the autonomous propose-evaluate-iterate loop.
+## Prerequisites
+`.harness-evolver/summary.json` must exist. If not, tell user to run `harness-evolver:init`.
+## Resolve Tool Path
+```bash
+TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo "$HOME/.harness-evolver/tools")
+```
+## Parse Arguments
+- `--iterations N` (default: 10)
+- Read `config.json` for `evolution.stagnation_limit` (default: 3) and `evolution.target_score`
+## The Loop
+For each iteration:
+### 1. Get Next Version
+```bash
+python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(f'v{s[\"iterations\"]+1:03d}')"
+```
+### 2. Propose
+Spawn the `harness-evolver-proposer` agent:
+> You are proposing iteration {i}. Create version {version} in `.harness-evolver/harnesses/{version}/`.
+> Working directory contains `.harness-evolver/` with all prior candidates and traces.
+The proposer creates: `harness.py`, `config.json`, `proposal.md`.
+### 3. Validate
+```bash
+python3 $TOOLS/evaluate.py validate \
+    --harness .harness-evolver/harnesses/{version}/harness.py \
+    --config .harness-evolver/harnesses/{version}/config.json
+```
+If fails: one retry via proposer. If still fails: score 0.0, continue.
+### 4. Evaluate
+```bash
+python3 $TOOLS/evaluate.py run \
+    --harness .harness-evolver/harnesses/{version}/harness.py \
+    --config .harness-evolver/harnesses/{version}/config.json \
+    --tasks-dir .harness-evolver/eval/tasks/ \
+    --eval .harness-evolver/eval/eval.py \
+    --traces-dir .harness-evolver/harnesses/{version}/traces/ \
+    --scores .harness-evolver/harnesses/{version}/scores.json \
+    --timeout 60
+```
+### 5. Update State
+```bash
+python3 $TOOLS/state.py update \
+    --base-dir .harness-evolver \
+    --version {version} \
+    --scores .harness-evolver/harnesses/{version}/scores.json \
+    --proposal .harness-evolver/harnesses/{version}/proposal.md
+```
+### 6. Report
+Read `summary.json`. Print: `Iteration {i}/{N}: {version} scored {score} (best: {best} at {best_score})`
+### 7. Check Stop Conditions
+- **Stagnation**: last 3 scores within 1% of each other → stop
+- **Target**: `combined_score >= target_score` → stop
+- **N reached**: done
+## When Loop Ends — Final Report
+- Best version and score
+- Improvement over baseline (absolute and %)
+- Total iterations run
+- Suggest: "The best harness is at `.harness-evolver/harnesses/{best}/harness.py`. Copy it to your project."

package/skills/init/SKILL.md ADDED Viewed

@@ -0,0 +1,66 @@
+---
+name: init
+description: "Use when the user wants to set up harness optimization in their project, optimize an LLM agent, improve a harness, or mentions harness-evolver for the first time in a project without .harness-evolver/ directory."
+argument-hint: "[directory]"
+allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent]
+---
+# /harness-evolve-init
+Set up the Harness Evolver in a project. Scans the codebase, identifies the entry point, creates missing artifacts, runs baseline evaluation.
+## Resolve Tool Path
+```bash
+TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo "$HOME/.harness-evolver/tools")
+```
+Use `$TOOLS` prefix for all tool calls below.
+## Phase 1: Scan
+```bash
+find . -maxdepth 3 -type f -name "*.py" | head -30
+python3 $TOOLS/detect_stack.py .
+```
+Look for:
+- Entry points: files with `if __name__`, or named `main.py`, `app.py`, `agent.py`, `graph.py`, `pipeline.py`, `bot.py`
+- Existing eval: `eval.py`, `score.py`, `judge.py`
+- Existing tasks: directories with JSON files containing `id` + `input` fields
+- Config: `config.json`, `config.yaml`, `.env`
+## Phase 2: Create What's Missing
+Three artifacts needed. For each — use existing if found, create if not.
+**Harness** (`harness.py`): If user's entry point doesn't match our CLI interface (`--input`, `--output`, `--traces-dir`, `--config`), create a thin wrapper that imports their code. Read their entry point first to understand the I/O format. Ask if unsure.
+**Eval** (`eval.py`): Ask the user what "correct" means for their domain. Generate the simplest eval that gives signal. Even rough scoring works — the evolver iterates.
+**Tasks** (`tasks/`): If no test data exists, ask the user for 5-10 example input/output pairs. Each task is `{"id": "task_001", "input": "...", "expected": "...", "metadata": {}}`.
+## Phase 3: Run Init
+```bash
+python3 $TOOLS/init.py [directory] \
+    --harness harness.py --eval eval.py --tasks tasks/ \
+    --tools-dir $TOOLS
+```
+Add `--harness-config config.json` if a config exists.
+## After Init — Report
+- What was detected vs created
+- Stack + integrations (LangSmith, Context7)
+- Baseline score
+- Next: `harness-evolver:evolve` to start
+## Gotchas
+- The harness must write valid JSON to `--output`. If the user's code returns non-JSON, the wrapper must serialize it.
+- Tasks must have unique `id` fields. Duplicate IDs cause silent eval errors.
+- The `expected` field is never shown to the harness — only the eval script sees it.
+- If `.harness-evolver/` already exists, warn before overwriting.
+- If no Python files exist in CWD, the user is probably in the wrong directory.

package/skills/status/SKILL.md ADDED Viewed

@@ -0,0 +1,34 @@
+---
+name: status
+description: "Use when the user asks about evolution progress, current scores, best harness version, how many iterations ran, or whether the loop is stagnating. Also use when the user says 'status', 'progress', or 'how is it going'."
+allowed-tools: [Read, Bash]
+---
+# /harness-evolve-status
+Show evolution progress.
+## Resolve Tool Path
+```bash
+TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo "$HOME/.harness-evolver/tools")
+```
+## What To Do
+If `.harness-evolver/` does not exist, tell user to run `harness-evolver:init` first.
+Otherwise:
+```bash
+python3 $TOOLS/state.py show --base-dir .harness-evolver
+```
+Then read and display `.harness-evolver/STATE.md` for the full history table.
+## If User Wants More Detail
+- Scores per task: `cat .harness-evolver/harnesses/{version}/scores.json`
+- What changed: `cat .harness-evolver/harnesses/{version}/proposal.md`
+- Compare two versions: `diff .harness-evolver/harnesses/{vA}/harness.py .harness-evolver/harnesses/{vB}/harness.py`
+- Full history: `cat .harness-evolver/PROPOSER_HISTORY.md`

package/skills/harness-evolve/SKILL.md DELETED Viewed

@@ -1,93 +0,0 @@
----
-name: harness-evolve
-description: "Run the harness evolution loop. Autonomously proposes, evaluates, and iterates on harness designs using full execution traces as feedback."
-argument-hint: "[--iterations N] [--candidates-per-iter N]"
-allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent]
----
-# /harness-evolve
-Run the Meta-Harness optimization loop.
-## Arguments
-- `--iterations N` (default: 10) — number of evolution iterations
-- `--candidates-per-iter N` (default: 1) — harnesses per iteration
-## Prerequisites
-Run `/harness-evolve-init` first. The `.harness-evolver/` directory must exist with a valid `summary.json`.
-## The Loop
-For each iteration i from 1 to N:
-### 1. PROPOSE
-Determine the next version number by reading `summary.json`:
-```bash
-python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(f'v{s[\"iterations\"]+1:03d}')"
-```
-Spawn the `harness-evolver-proposer` agent with this prompt:
-> You are proposing iteration {i}. Create version {version_number} in `.harness-evolver/harnesses/{version_number}/`.
-> Working directory contains `.harness-evolver/` with all prior candidates and traces.
-The proposer agent will create:
-- `.harness-evolver/harnesses/v{NNN}/harness.py`
-- `.harness-evolver/harnesses/v{NNN}/config.json`
-- `.harness-evolver/harnesses/v{NNN}/proposal.md`
-### 2. VALIDATE
-```bash
-python3 ~/.harness-evolver/tools/evaluate.py validate \
-    --harness .harness-evolver/harnesses/v{NNN}/harness.py \
-    --config .harness-evolver/harnesses/v{NNN}/config.json
-```
-If validation fails, ask the proposer to fix (1 retry). If it fails again, set score to 0.0 and continue.
-### 3. EVALUATE
-```bash
-python3 ~/.harness-evolver/tools/evaluate.py run \
-    --harness .harness-evolver/harnesses/v{NNN}/harness.py \
-    --config .harness-evolver/harnesses/v{NNN}/config.json \
-    --tasks-dir .harness-evolver/eval/tasks/ \
-    --eval .harness-evolver/eval/eval.py \
-    --traces-dir .harness-evolver/harnesses/v{NNN}/traces/ \
-    --scores .harness-evolver/harnesses/v{NNN}/scores.json \
-    --timeout 60
-```
-### 4. UPDATE STATE
-```bash
-python3 ~/.harness-evolver/tools/state.py update \
-    --base-dir .harness-evolver \
-    --version v{NNN} \
-    --scores .harness-evolver/harnesses/v{NNN}/scores.json \
-    --proposal .harness-evolver/harnesses/v{NNN}/proposal.md
-```
-### 5. REPORT
-Read the updated `summary.json` and report:
-- `Iteration {i}/{N}: v{NNN} scored {score} (best: v{best} at {best_score})`
-- If regression (score < parent score): warn
-- If new best: celebrate
-### Stop Conditions
-- All N iterations completed
-- **Stagnation**: 3 consecutive iterations without >1% improvement. Read `summary.json` history to check.
-- **Target reached**: if `config.json` has `target_score` set and achieved.
-When stopping, report final summary: best version, score, number of iterations, improvement over baseline.
-## Tool Path Resolution
-Check `.harness-evolver/tools/` first (local override), then `~/.harness-evolver/tools/` (global install).

package/skills/harness-evolve-init/SKILL.md DELETED Viewed

@@ -1,50 +0,0 @@
----
-name: harness-evolve-init
-description: "Initialize harness evolution in the current project. Auto-detects harness.py, eval.py, and tasks/ in the working directory."
-argument-hint: "[directory] [--harness <path>] [--eval <path>] [--tasks <path>]"
-allowed-tools: [Read, Write, Bash, Glob]
----
-# /harness-evolve-init
-Initialize the Harness Evolver for this project.
-## Usage
-```
-/harness-evolve-init                    # auto-detect everything in CWD
-/harness-evolve-init ./my-project       # auto-detect in a specific directory
-/harness-evolve-init --harness run.py   # override one path, auto-detect the rest
-```
-## How Auto-Detection Works
-The tool scans the directory for:
-1. **Exact names:** `harness.py`, `eval.py`, `tasks/`, `config.json`
-2. **Fuzzy fallback:** `*harness*`, `*agent*`, `*run*` for harness; `*eval*`, `*score*` for eval; any dir with JSON files containing `id`/`input` fields for tasks
-If all 3 are found, init proceeds immediately. If something is missing, it reports what's needed.
-## What To Do
-Run the init tool:
-```bash
-python3 ~/.harness-evolver/tools/init.py {directory if provided} \
-    --tools-dir ~/.harness-evolver/tools
-```
-Add explicit flags only if the user provided them:
-- `--harness PATH` — override harness auto-detection
-- `--eval PATH` — override eval auto-detection
-- `--tasks PATH` — override tasks auto-detection
-- `--harness-config PATH` — optional config for the harness
-If `~/.harness-evolver/tools/init.py` does not exist, check `.harness-evolver/tools/init.py` (local override).
-After init completes, report:
-- What was detected (harness, eval, tasks)
-- Baseline score
-- Number of tasks
-- Integrations detected (LangSmith, Context7, stack)
-- Next step: run `/harness-evolve` to start the optimization loop

package/skills/harness-evolve-status/SKILL.md DELETED Viewed

@@ -1,25 +0,0 @@
----
-name: harness-evolve-status
-description: "Show the current status of harness evolution: best score, iteration count, progress history."
-allowed-tools: [Read, Bash]
----
-# /harness-evolve-status
-Show the current evolution status.
-## What To Do
-```bash
-python3 ~/.harness-evolver/tools/state.py show --base-dir .harness-evolver
-```
-If that doesn't exist, try:
-```bash
-python3 .harness-evolver/tools/state.py show --base-dir .harness-evolver
-```
-Also read and display the contents of `.harness-evolver/STATE.md` for the full status table.
-If `.harness-evolver/` doesn't exist, tell the user to run `/harness-evolve-init` first.