npm - harness-evolver - Versions diffs - 1.1.0 → 1.3.0 - Mend

harness-evolver 1.1.0 → 1.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

package/agents/harness-evolver-architect.md +18 -4
package/agents/harness-evolver-critic.md +19 -5
package/agents/harness-evolver-proposer.md +19 -4
package/package.json +1 -1
package/skills/architect/SKILL.md +44 -64
package/skills/critic/SKILL.md +37 -13
package/skills/evolve/SKILL.md +93 -12

package/agents/harness-evolver-architect.md CHANGED Viewed

@@ -1,12 +1,26 @@
 ---
 name: harness-evolver-architect
 description: |
-  Use this agent when the harness-evolver:architect skill needs to analyze a harness
-  and recommend the optimal multi-agent topology. Reads code analysis signals, traces,
-  and scores to produce a migration plan from current to recommended architecture.
-model: opus
+  Use this agent to analyze harness architecture and recommend optimal multi-agent topology.
+  Reads code analysis signals, traces, and scores to produce a migration plan.
+tools: Read, Write, Bash, Grep, Glob
 ---
+## Bootstrap
+If your prompt contains a `<files_to_read>` block, you MUST use the Read tool to load
+every file listed there before performing any other actions.
+## Return Protocol
+When done, end your response with:
+## ARCHITECTURE ANALYSIS COMPLETE
+- **Current topology**: {topology}
+- **Recommended**: {topology}
+- **Confidence**: {low|medium|high}
+- **Migration steps**: {N}
 # Harness Evolver — Architect Agent
 You are the architect in a Meta-Harness optimization system. Your job is to analyze a harness's current agent topology, assess whether it matches the task complexity, and recommend the optimal topology with a concrete migration plan.

package/agents/harness-evolver-critic.md CHANGED Viewed

@@ -1,13 +1,27 @@
 ---
 name: harness-evolver-critic
 description: |
-  Use this agent when scores converge suspiciously fast (>0.3 jump in one iteration
-  or 1.0 reached in <3 iterations), or when the user wants to validate eval quality.
-  Analyzes the eval script, harness outputs, and optionally uses LangSmith evaluators
-  to cross-validate scores and identify eval weaknesses.
-model: opus
+  Use this agent to assess eval quality, detect eval gaming, and propose stricter evaluation.
+  Triggered when scores converge suspiciously fast or on user request.
+tools: Read, Write, Bash, Grep, Glob
 ---
+## Bootstrap
+If your prompt contains a `<files_to_read>` block, you MUST use the Read tool to load
+every file listed there before performing any other actions.
+## Return Protocol
+When done, end your response with:
+## CRITIC REPORT COMPLETE
+- **Eval quality**: {weak|moderate|strong}
+- **Gaming detected**: {yes|no}
+- **Weaknesses found**: {N}
+- **Improved eval written**: {yes|no}
+- **Score with improved eval**: {score or N/A}
 # Harness Evolver — Critic Agent
 You are the critic in the Harness Evolver loop. Your job is to assess whether the eval

package/agents/harness-evolver-proposer.md CHANGED Viewed

@@ -1,12 +1,27 @@
 ---
 name: harness-evolver-proposer
 description: |
-  Use this agent when the harness-evolve skill needs to propose a new harness candidate.
-  This agent navigates the .harness-evolver/ filesystem to diagnose failures in prior
-  candidates and propose an improved harness. It is the core of the Meta-Harness optimization loop.
-model: opus
+  Use this agent when the evolve skill needs to propose a new harness candidate.
+  Navigates the .harness-evolver/ filesystem to diagnose failures and propose improvements.
+tools: Read, Write, Edit, Bash, Glob, Grep
+permissionMode: acceptEdits
 ---
+## Bootstrap
+If your prompt contains a `<files_to_read>` block, you MUST use the Read tool to load
+every file listed there before performing any other actions. These files are your context.
+## Return Protocol
+When done, end your response with:
+## PROPOSAL COMPLETE
+- **Version**: v{NNN}
+- **Parent**: v{PARENT}
+- **Change**: {one-sentence summary}
+- **Expected impact**: {score prediction}
 # Harness Evolver — Proposer Agent
 You are the proposer in a Meta-Harness optimization loop. Your job is to analyze all prior harness candidates — their code, execution traces, and scores — and propose a new harness that improves on them.

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "harness-evolver",
-  "version": "1.1.0",
+  "version": "1.3.0",
   "description": "Meta-Harness-style autonomous harness optimization for Claude Code",
   "author": "Raphael Valdetaro",
   "license": "MIT",

package/skills/architect/SKILL.md CHANGED Viewed

@@ -28,80 +28,60 @@ TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo
 Use `$TOOLS` prefix for all tool calls below.
-## Step 1: Run Architecture Analysis
+## What To Do
-Build the command based on what exists:
+1. Check `.harness-evolver/` exists.
+2. Run architecture analysis tool:
 ```bash
-CMD="python3 $TOOLS/analyze_architecture.py --harness .harness-evolver/baseline/harness.py"
-# Add traces from best version if evolution has run
-if [ -f ".harness-evolver/summary.json" ]; then
-  BEST=$(python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(s.get('best',{}).get('version',''))")
-  if [ -n "$BEST" ] && [ -d ".harness-evolver/harnesses/$BEST/traces" ]; then
-    CMD="$CMD --traces-dir .harness-evolver/harnesses/$BEST/traces"
-  fi
-  CMD="$CMD --summary .harness-evolver/summary.json"
-fi
-CMD="$CMD -o .harness-evolver/architecture_signals.json"
-eval $CMD
+python3 $TOOLS/analyze_architecture.py \
+    --harness .harness-evolver/baseline/harness.py \
+    -o .harness-evolver/architecture_signals.json
 ```
-Check exit code. If it fails, report the error and stop.
-## Step 2: Spawn Architect Agent
-Spawn the `harness-evolver-architect` agent with:
-> Analyze the harness and recommend the optimal multi-agent topology.
-> Raw signals are at `.harness-evolver/architecture_signals.json`.
-> Write `.harness-evolver/architecture.json` and `.harness-evolver/architecture.md`.
-The architect agent will:
-1. Read the signals JSON
-2. Read the harness code and config
-3. Classify the current topology
-4. Assess if it matches task complexity
-5. Recommend the optimal topology with migration steps
-6. Write `architecture.json` and `architecture.md`
-## Step 3: Report
-After the architect agent completes, read the outputs and print a summary:
+If evolution has run, add trace and score data:
+```bash
+python3 $TOOLS/analyze_architecture.py \
+    --harness .harness-evolver/harnesses/{best}/harness.py \
+    --traces-dir .harness-evolver/harnesses/{best}/traces \
+    --summary .harness-evolver/summary.json \
+    -o .harness-evolver/architecture_signals.json
 ```
-Architecture Analysis Complete
-==============================
-Current topology:     {current_topology}
-Recommended topology: {recommended_topology}
-Confidence:           {confidence}
-Reasoning: {reasoning}
-Migration Path:
-  1. {step 1 description}
-  2. {step 2 description}
-  ...
-Risks:
-  - {risk 1}
-  - {risk 2}
-Next: Run /harness-evolver:evolve — the proposer will follow the migration path.
+3. Spawn the `harness-evolver-architect` agent:
+```xml
+<objective>
+Analyze the harness architecture and recommend the optimal multi-agent topology.
+{If called from evolve: "The evolution loop stagnated/regressed after N iterations."}
+{If called by user: "The user requested an architecture analysis."}
+</objective>
+<files_to_read>
+- .harness-evolver/architecture_signals.json
+- .harness-evolver/config.json
+- .harness-evolver/baseline/harness.py
+- .harness-evolver/summary.json (if exists)
+- .harness-evolver/PROPOSER_HISTORY.md (if exists)
+</files_to_read>
+<output>
+Write:
+- .harness-evolver/architecture.json
+- .harness-evolver/architecture.md
+</output>
+<success_criteria>
+- Classifies current topology correctly
+- Recommendation includes migration path with concrete steps
+- Considers detected stack and API key availability
+- Confidence rating is honest (low/medium/high)
+</success_criteria>
 ```
-If the architect recommends no change (current = recommended), report:
-```
-Architecture Analysis Complete
-==============================
-Current topology: {topology} — looks optimal for these tasks.
-No architecture change recommended. Score: {score}
+4. Wait for `## ARCHITECTURE ANALYSIS COMPLETE`.
-The proposer can continue evolving within the current topology.
-```
+5. Print summary: current -> recommended, confidence, migration steps.
 ## Arguments

package/skills/critic/SKILL.md CHANGED Viewed

@@ -20,18 +20,42 @@ TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo
 ## What To Do
-1. Read `summary.json` to check for suspicious patterns:
-   - Score jump >0.3 in a single iteration
-   - Score reached 1.0 in <3 iterations
-   - All tasks suddenly pass after failing
+1. Read `summary.json` and identify the suspicious pattern (score jump, premature convergence).
 2. Spawn the `harness-evolver-critic` agent:
-   > Analyze the eval quality for this harness evolution project.
-   > Check if the eval at `.harness-evolver/eval/eval.py` is rigorous enough.
-   > The best version is {version} with score {score} achieved in {iterations} iterations.
-3. After the critic reports:
-   - Show the eval quality assessment
-   - If `eval_improved.py` was created, show the score comparison
-   - Ask user: "Adopt the improved eval? This will re-baseline all scores."
-   - If adopted: copy `eval_improved.py` to `eval/eval.py`, re-run baseline, update state
+```xml
+<objective>
+Analyze eval quality for this harness evolution project.
+The best version is {version} with score {score} achieved in {iterations} iteration(s).
+{Specific concern: "Score jumped from X to Y in one iteration" or "Perfect score in N iterations"}
+</objective>
+<files_to_read>
+- .harness-evolver/eval/eval.py
+- .harness-evolver/summary.json
+- .harness-evolver/harnesses/{best_version}/scores.json
+- .harness-evolver/harnesses/{best_version}/harness.py
+- .harness-evolver/harnesses/{best_version}/proposal.md
+- .harness-evolver/config.json
+</files_to_read>
+<output>
+Write:
+- .harness-evolver/critic_report.md (human-readable analysis)
+- .harness-evolver/eval/eval_improved.py (if weaknesses found)
+</output>
+<success_criteria>
+- Identifies specific weaknesses in eval.py with examples
+- If gaming detected, shows exact tasks/outputs that expose the weakness
+- Improved eval preserves the --results-dir/--tasks-dir/--scores interface
+- Re-scores the best version with improved eval to quantify the difference
+</success_criteria>
+```
+3. Wait for `## CRITIC REPORT COMPLETE`.
+4. Report findings to user. If `eval_improved.py` was written:
+   - Show score comparison (current eval vs improved eval)
+   - Ask: "Adopt the improved eval? This will affect future iterations."

package/skills/evolve/SKILL.md CHANGED Viewed

@@ -36,12 +36,38 @@ python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); pri
 ### 2. Propose
-Spawn the `harness-evolver-proposer` agent:
-> You are proposing iteration {i}. Create version {version} in `.harness-evolver/harnesses/{version}/`.
-> Working directory contains `.harness-evolver/` with all prior candidates and traces.
+Spawn the `harness-evolver-proposer` agent with a structured prompt:
+```xml
+<objective>
+Propose harness version {version} that improves on the current best score of {best_score}.
+</objective>
+<files_to_read>
+- .harness-evolver/summary.json
+- .harness-evolver/PROPOSER_HISTORY.md
+- .harness-evolver/config.json
+- .harness-evolver/baseline/harness.py
+- .harness-evolver/harnesses/{best_version}/harness.py
+- .harness-evolver/harnesses/{best_version}/scores.json
+- .harness-evolver/harnesses/{best_version}/proposal.md
+</files_to_read>
+<output>
+Create directory .harness-evolver/harnesses/{version}/ containing:
+- harness.py (the improved harness)
+- config.json (parameters, copy from parent if unchanged)
+- proposal.md (reasoning, must start with "Based on v{PARENT}")
+</output>
+<success_criteria>
+- harness.py maintains CLI interface (--input, --output, --traces-dir, --config)
+- proposal.md documents evidence-based reasoning
+- Changes are motivated by trace analysis, not guesswork
+</success_criteria>
+```
-The proposer creates: `harness.py`, `config.json`, `proposal.md`.
+Wait for the agent to complete. Look for `## PROPOSAL COMPLETE` in the response.
 ### 3. Validate
@@ -97,20 +123,75 @@ If score is 1.0 and iterations < 3, STOP the loop and strongly recommend the cri
 > the eval is too easy, not that the harness is perfect. Run `/harness-evolver:critic`
 > before continuing.
-### 7. Check Stop Conditions
+### 7. Auto-trigger Architect (on stagnation or regression)
+Check if the architect should be auto-spawned. This happens when:
+- **Stagnation**: 3 consecutive iterations within 1% of each other
+- **Regression**: score dropped below parent score (even once)
+AND `.harness-evolver/architecture.json` does NOT already exist.
+If triggered:
+```bash
+python3 $TOOLS/analyze_architecture.py \
+    --harness .harness-evolver/harnesses/{best_version}/harness.py \
+    --traces-dir .harness-evolver/harnesses/{best_version}/traces \
+    --summary .harness-evolver/summary.json \
+    -o .harness-evolver/architecture_signals.json
+```
+Then spawn the `harness-evolver-architect` agent:
+```xml
+<objective>
+The evolution loop has {stagnated/regressed} after {iterations} iterations (best: {best_score}).
+Analyze the harness architecture and recommend a topology change.
+</objective>
+<files_to_read>
+- .harness-evolver/architecture_signals.json
+- .harness-evolver/summary.json
+- .harness-evolver/PROPOSER_HISTORY.md
+- .harness-evolver/config.json
+- .harness-evolver/harnesses/{best_version}/harness.py
+- .harness-evolver/harnesses/{best_version}/scores.json
+</files_to_read>
+<output>
+Write:
+- .harness-evolver/architecture.json (structured recommendation)
+- .harness-evolver/architecture.md (human-readable analysis)
+</output>
+<success_criteria>
+- Recommendation includes concrete migration steps
+- Each step is implementable in one proposer iteration
+- Considers detected stack and available API keys
+</success_criteria>
+```
+Wait for `## ARCHITECTURE ANALYSIS COMPLETE`.
+After the architect completes, report:
+> Architect recommends: {current} → {recommended} ({confidence} confidence)
+> Migration path: {N} steps. Continuing evolution with architecture guidance.
+Then **continue the loop** — the proposer will read `architecture.json` in the next iteration.
+If `architecture.json` already exists (architect already ran), skip — don't re-run.
+### 8. Check Stop Conditions
-- **Stagnation**: last 3 scores within 1% of each other → stop
 - **Target**: `combined_score >= target_score` → stop
 - **N reached**: done
+- **Stagnation post-architect**: 3 more iterations without improvement AFTER architect ran → stop (architecture change didn't help)
 ## When Loop Ends — Final Report
 - Best version and score
 - Improvement over baseline (absolute and %)
 - Total iterations run
+- Whether architect was triggered and what it recommended
 - Suggest: "The best harness is at `.harness-evolver/harnesses/{best}/harness.py`. Copy it to your project."
-If the loop stopped due to stagnation AND `.harness-evolver/architecture.json` does NOT exist:
-> The proposer may have hit an architectural ceiling. Run `/harness-evolver:architect`
-> to analyze whether a different agent topology could help.