npm - harness-evolver - Versions diffs - 1.4.0 → 1.6.0 - Mend

harness-evolver 1.4.0 → 1.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

package/agents/harness-evolver-architect.md +1 -0
package/agents/harness-evolver-critic.md +1 -0
package/agents/harness-evolver-proposer.md +19 -0
package/package.json +1 -1
package/skills/architect/SKILL.md +34 -29
package/skills/critic/SKILL.md +36 -30
package/skills/evolve/SKILL.md +116 -121
package/tools/init.py +19 -3

package/agents/harness-evolver-architect.md CHANGED Viewed

@@ -4,6 +4,7 @@ description: |
   Use this agent to analyze harness architecture and recommend optimal multi-agent topology.
   Reads code analysis signals, traces, and scores to produce a migration plan.
 tools: Read, Write, Bash, Grep, Glob
+color: blue
 ---
 ## Bootstrap

package/agents/harness-evolver-critic.md CHANGED Viewed

@@ -4,6 +4,7 @@ description: |
   Use this agent to assess eval quality, detect eval gaming, and propose stricter evaluation.
   Triggered when scores converge suspiciously fast or on user request.
 tools: Read, Write, Bash, Grep, Glob
+color: red
 ---
 ## Bootstrap

package/agents/harness-evolver-proposer.md CHANGED Viewed

@@ -4,6 +4,7 @@ description: |
   Use this agent when the evolve skill needs to propose a new harness candidate.
   Navigates the .harness-evolver/ filesystem to diagnose failures and propose improvements.
 tools: Read, Write, Edit, Bash, Glob, Grep
+color: green
 permissionMode: acceptEdits
 ---
@@ -12,6 +13,24 @@ permissionMode: acceptEdits
 If your prompt contains a `<files_to_read>` block, you MUST use the Read tool to load
 every file listed there before performing any other actions. These files are your context.
+## Context7 — Enrich Your Knowledge
+You have access to Context7 MCP tools (`resolve-library-id` and `get-library-docs`) for looking up **current, version-specific documentation** of any library.
+**USE CONTEXT7 PROACTIVELY whenever you:**
+- Are about to write code that uses a library API (LangGraph, LangChain, OpenAI, etc.)
+- Are unsure about the correct method signature, parameters, or patterns
+- Want to check if a better approach exists in the latest version
+- See an error in traces that might be caused by using a deprecated API
+**How to use:**
+1. `resolve-library-id` with the library name (e.g., "langchain", "langgraph")
+2. `get-library-docs` with a specific query (e.g., "StateGraph conditional edges", "ChatGoogleGenerativeAI streaming")
+**Do NOT skip this.** Your training data may be outdated. Context7 gives you the current docs. Even if you're confident about an API, a quick check takes seconds and prevents proposing deprecated patterns.
+If Context7 is not available, proceed with model knowledge but note in `proposal.md`: "API not verified against current docs."
 ## Return Protocol
 When done, end your response with:

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "harness-evolver",
-  "version": "1.4.0",
+  "version": "1.6.0",
   "description": "Meta-Harness-style autonomous harness optimization for Claude Code",
   "author": "Raphael Valdetaro",
   "license": "MIT",

package/skills/architect/SKILL.md CHANGED Viewed

@@ -48,35 +48,40 @@ python3 $TOOLS/analyze_architecture.py \
     -o .harness-evolver/architecture_signals.json
 ```
-3. Spawn the `harness-evolver-architect` agent:
-```xml
-<objective>
-Analyze the harness architecture and recommend the optimal multi-agent topology.
-{If called from evolve: "The evolution loop stagnated/regressed after N iterations."}
-{If called by user: "The user requested an architecture analysis."}
-</objective>
-<files_to_read>
-- .harness-evolver/architecture_signals.json
-- .harness-evolver/config.json
-- .harness-evolver/baseline/harness.py
-- .harness-evolver/summary.json (if exists)
-- .harness-evolver/PROPOSER_HISTORY.md (if exists)
-</files_to_read>
-<output>
-Write:
-- .harness-evolver/architecture.json
-- .harness-evolver/architecture.md
-</output>
-<success_criteria>
-- Classifies current topology correctly
-- Recommendation includes migration path with concrete steps
-- Considers detected stack and API key availability
-- Confidence rating is honest (low/medium/high)
-</success_criteria>
+3. Dispatch subagent using the **Agent tool** with `subagent_type: "harness-evolver-architect"`:
+```
+Agent(
+  subagent_type: "harness-evolver-architect",
+  description: "Architect: topology analysis",
+  prompt: |
+    <objective>
+    Analyze the harness architecture and recommend the optimal multi-agent topology.
+    {If called from evolve: "The evolution loop stagnated/regressed after N iterations."}
+    {If called by user: "The user requested an architecture analysis."}
+    </objective>
+    <files_to_read>
+    - .harness-evolver/architecture_signals.json
+    - .harness-evolver/config.json
+    - .harness-evolver/baseline/harness.py
+    - .harness-evolver/summary.json (if exists)
+    - .harness-evolver/PROPOSER_HISTORY.md (if exists)
+    </files_to_read>
+    <output>
+    Write:
+    - .harness-evolver/architecture.json
+    - .harness-evolver/architecture.md
+    </output>
+    <success_criteria>
+    - Classifies current topology correctly
+    - Recommendation includes migration path with concrete steps
+    - Considers detected stack and API key availability
+    - Confidence rating is honest (low/medium/high)
+    </success_criteria>
+)
 ```
 4. Wait for `## ARCHITECTURE ANALYSIS COMPLETE`.

package/skills/critic/SKILL.md CHANGED Viewed

@@ -22,36 +22,42 @@ TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo
 1. Read `summary.json` and identify the suspicious pattern (score jump, premature convergence).
-2. Spawn the `harness-evolver-critic` agent:
-```xml
-<objective>
-Analyze eval quality for this harness evolution project.
-The best version is {version} with score {score} achieved in {iterations} iteration(s).
-{Specific concern: "Score jumped from X to Y in one iteration" or "Perfect score in N iterations"}
-</objective>
-<files_to_read>
-- .harness-evolver/eval/eval.py
-- .harness-evolver/summary.json
-- .harness-evolver/harnesses/{best_version}/scores.json
-- .harness-evolver/harnesses/{best_version}/harness.py
-- .harness-evolver/harnesses/{best_version}/proposal.md
-- .harness-evolver/config.json
-</files_to_read>
-<output>
-Write:
-- .harness-evolver/critic_report.md (human-readable analysis)
-- .harness-evolver/eval/eval_improved.py (if weaknesses found)
-</output>
-<success_criteria>
-- Identifies specific weaknesses in eval.py with examples
-- If gaming detected, shows exact tasks/outputs that expose the weakness
-- Improved eval preserves the --results-dir/--tasks-dir/--scores interface
-- Re-scores the best version with improved eval to quantify the difference
-</success_criteria>
+2. Dispatch subagent using the **Agent tool** with `subagent_type: "harness-evolver-critic"`:
+```
+Agent(
+  subagent_type: "harness-evolver-critic",
+  description: "Critic: analyze eval quality",
+  prompt: |
+    <objective>
+    Analyze eval quality for this harness evolution project.
+    The best version is {version} with score {score} achieved in {iterations} iteration(s).
+    {Specific concern: "Score jumped from X to Y in one iteration" or "Perfect score in N iterations"}
+    </objective>
+    <files_to_read>
+    - .harness-evolver/eval/eval.py
+    - .harness-evolver/summary.json
+    - .harness-evolver/harnesses/{best_version}/scores.json
+    - .harness-evolver/harnesses/{best_version}/harness.py
+    - .harness-evolver/harnesses/{best_version}/proposal.md
+    - .harness-evolver/config.json
+    - .harness-evolver/langsmith_stats.json (if exists)
+    </files_to_read>
+    <output>
+    Write:
+    - .harness-evolver/critic_report.md (human-readable analysis)
+    - .harness-evolver/eval/eval_improved.py (if weaknesses found)
+    </output>
+    <success_criteria>
+    - Identifies specific weaknesses in eval.py with examples
+    - If gaming detected, shows exact tasks/outputs that expose the weakness
+    - Improved eval preserves the --results-dir/--tasks-dir/--scores interface
+    - Re-scores the best version with improved eval to quantify the difference
+    </success_criteria>
+)
 ```
 3. Wait for `## CRITIC REPORT COMPLETE`.

package/skills/evolve/SKILL.md CHANGED Viewed

@@ -34,81 +34,66 @@ For each iteration:
 python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(f'v{s[\"iterations\"]+1:03d}')"
 ```
-### 1.5. Gather Diagnostic Context (LangSmith + Context7)
+### 1.5. Gather LangSmith Traces (MANDATORY after every evaluation)
-**This step is MANDATORY before every propose.** The orchestrator gathers data so the proposer receives it as files.
+**Run these commands unconditionally after EVERY evaluation** (including baseline). If langsmith-cli is not installed or there are no runs, the commands fail silently — that's fine. But you MUST attempt them.
-**LangSmith (if enabled):**
-Check if LangSmith is enabled and langsmith-cli is available:
-```bash
-cat .harness-evolver/config.json | python3 -c "import sys,json; print(json.load(sys.stdin).get('eval',{}).get('langsmith',{}).get('enabled',False))"
-which langsmith-cli 2>/dev/null
-```
-If BOTH are true AND at least one iteration has run, gather LangSmith data:
 ```bash
-langsmith-cli --json runs list --project harness-evolver-{best_version} --failed --fields id,name,error,inputs --limit 10 > .harness-evolver/langsmith_diagnosis.json 2>/dev/null || echo "[]" > .harness-evolver/langsmith_diagnosis.json
+langsmith-cli --json runs list --project harness-evolver-{last_evaluated_version} --failed --fields id,name,error,inputs --limit 10 > .harness-evolver/langsmith_diagnosis.json 2>/dev/null || echo "[]" > .harness-evolver/langsmith_diagnosis.json
-langsmith-cli --json runs stats --project harness-evolver-{best_version} > .harness-evolver/langsmith_stats.json 2>/dev/null || echo "{}" > .harness-evolver/langsmith_stats.json
+langsmith-cli --json runs stats --project harness-evolver-{last_evaluated_version} > .harness-evolver/langsmith_stats.json 2>/dev/null || echo "{}" > .harness-evolver/langsmith_stats.json
 ```
-**Context7 (if available):**
+For the first iteration, use `baseline` as the version. For subsequent iterations, use the latest evaluated version.
-Check `config.json` field `stack.detected`. For each detected library, use the Context7 MCP tools to fetch relevant documentation:
-```
-For each library in stack.detected:
-  1. resolve-library-id with the context7_id
-  2. get-library-docs with a query relevant to the current failure modes
-  3. Save output to .harness-evolver/context7_docs.md (append each library's docs)
-```
+These files are included in the proposer's `<files_to_read>` so it has real trace data for diagnosis.
-This runs ONCE per iteration, not per library. Focus on the library most relevant to the current failures.
+### 2. Propose
-If Context7 MCP is not available, skip silently.
+Dispatch a subagent using the **Agent tool** with `subagent_type: "harness-evolver-proposer"`.
-### 2. Propose
+The prompt MUST be the full XML block below. The `<files_to_read>` MUST include the LangSmith/Context7 files if they were gathered in step 1.5.
-Spawn the `harness-evolver-proposer` agent with a structured prompt.
-The `<files_to_read>` MUST include the LangSmith/Context7 files if they were gathered:
-```xml
-<objective>
-Propose harness version {version} that improves on the current best score of {best_score}.
-</objective>
-<files_to_read>
-- .harness-evolver/summary.json
-- .harness-evolver/PROPOSER_HISTORY.md
-- .harness-evolver/config.json
-- .harness-evolver/baseline/harness.py
-- .harness-evolver/harnesses/{best_version}/harness.py
-- .harness-evolver/harnesses/{best_version}/scores.json
-- .harness-evolver/harnesses/{best_version}/proposal.md
-- .harness-evolver/langsmith_diagnosis.json (if exists — LangSmith failure analysis)
-- .harness-evolver/langsmith_stats.json (if exists — LangSmith aggregate stats)
-- .harness-evolver/context7_docs.md (if exists — current library documentation)
-- .harness-evolver/architecture.json (if exists — architect topology recommendation)
-</files_to_read>
-<output>
-Create directory .harness-evolver/harnesses/{version}/ containing:
-- harness.py (the improved harness)
-- config.json (parameters, copy from parent if unchanged)
-- proposal.md (reasoning, must start with "Based on v{PARENT}")
-</output>
-<success_criteria>
-- harness.py maintains CLI interface (--input, --output, --traces-dir, --config)
-- proposal.md documents evidence-based reasoning
-- Changes are motivated by trace analysis (LangSmith data if available), not guesswork
-- If context7_docs.md was provided, API usage must match current documentation
-</success_criteria>
+```
+Agent(
+  subagent_type: "harness-evolver-proposer",
+  description: "Propose harness {version}",
+  prompt: |
+    <objective>
+    Propose harness version {version} that improves on the current best score of {best_score}.
+    </objective>
+    <files_to_read>
+    - .harness-evolver/summary.json
+    - .harness-evolver/PROPOSER_HISTORY.md
+    - .harness-evolver/config.json
+    - .harness-evolver/baseline/harness.py
+    - .harness-evolver/harnesses/{best_version}/harness.py
+    - .harness-evolver/harnesses/{best_version}/scores.json
+    - .harness-evolver/harnesses/{best_version}/proposal.md
+    - .harness-evolver/langsmith_diagnosis.json (if exists — LangSmith failure analysis)
+    - .harness-evolver/langsmith_stats.json (if exists — LangSmith aggregate stats)
+    - .harness-evolver/context7_docs.md (if exists — current library documentation)
+    - .harness-evolver/architecture.json (if exists — architect topology recommendation)
+    </files_to_read>
+    <output>
+    Create directory .harness-evolver/harnesses/{version}/ containing:
+    - harness.py (the improved harness)
+    - config.json (parameters, copy from parent if unchanged)
+    - proposal.md (reasoning, must start with "Based on v{PARENT}")
+    </output>
+    <success_criteria>
+    - harness.py maintains CLI interface (--input, --output, --traces-dir, --config)
+    - proposal.md documents evidence-based reasoning
+    - Changes are motivated by trace analysis (LangSmith data if available), not guesswork
+    - If context7_docs.md was provided, API usage must match current documentation
+    </success_criteria>
+)
 ```
-Wait for the agent to complete. Look for `## PROPOSAL COMPLETE` in the response.
+Wait for `## PROPOSAL COMPLETE` in the response. The subagent gets a fresh context window and loads the agent definition from `~/.claude/agents/harness-evolver-proposer.md`.
 ### 3. Validate
@@ -165,36 +150,41 @@ python3 $TOOLS/evaluate.py run \
     --timeout 60
 ```
-Spawn the `harness-evolver-critic` agent:
-```xml
-<objective>
-EVAL GAMING DETECTED: Score jumped from {parent_score} to {score} in one iteration.
-Analyze the eval quality and propose a stricter eval.
-</objective>
-<files_to_read>
-- .harness-evolver/eval/eval.py
-- .harness-evolver/summary.json
-- .harness-evolver/harnesses/{version}/scores.json
-- .harness-evolver/harnesses/{version}/harness.py
-- .harness-evolver/harnesses/{version}/proposal.md
-- .harness-evolver/config.json
-- .harness-evolver/langsmith_stats.json (if exists)
-</files_to_read>
-<output>
-Write:
-- .harness-evolver/critic_report.md
-- .harness-evolver/eval/eval_improved.py (if weaknesses found)
-</output>
-<success_criteria>
-- Identifies specific weaknesses in eval.py with task/output examples
-- If gaming detected, shows exact tasks that expose the weakness
-- Improved eval preserves the --results-dir/--tasks-dir/--scores interface
-- Re-scores the best version with improved eval to show the difference
-</success_criteria>
+Dispatch critic subagent using the **Agent tool** with `subagent_type: "harness-evolver-critic"`:
+```
+Agent(
+  subagent_type: "harness-evolver-critic",
+  description: "Critic: analyze eval quality",
+  prompt: |
+    <objective>
+    EVAL GAMING DETECTED: Score jumped from {parent_score} to {score} in one iteration.
+    Analyze the eval quality and propose a stricter eval.
+    </objective>
+    <files_to_read>
+    - .harness-evolver/eval/eval.py
+    - .harness-evolver/summary.json
+    - .harness-evolver/harnesses/{version}/scores.json
+    - .harness-evolver/harnesses/{version}/harness.py
+    - .harness-evolver/harnesses/{version}/proposal.md
+    - .harness-evolver/config.json
+    - .harness-evolver/langsmith_stats.json (if exists)
+    </files_to_read>
+    <output>
+    Write:
+    - .harness-evolver/critic_report.md
+    - .harness-evolver/eval/eval_improved.py (if weaknesses found)
+    </output>
+    <success_criteria>
+    - Identifies specific weaknesses in eval.py with task/output examples
+    - If gaming detected, shows exact tasks that expose the weakness
+    - Improved eval preserves the --results-dir/--tasks-dir/--scores interface
+    - Re-scores the best version with improved eval to show the difference
+    </success_criteria>
+)
 ```
 Wait for `## CRITIC REPORT COMPLETE`.
@@ -229,35 +219,40 @@ python3 $TOOLS/analyze_architecture.py \
     -o .harness-evolver/architecture_signals.json
 ```
-Spawn the `harness-evolver-architect` agent:
-```xml
-<objective>
-The evolution loop has {stagnated/regressed} after {iterations} iterations (best: {best_score}).
-Analyze the harness architecture and recommend a topology change.
-</objective>
-<files_to_read>
-- .harness-evolver/architecture_signals.json
-- .harness-evolver/summary.json
-- .harness-evolver/PROPOSER_HISTORY.md
-- .harness-evolver/config.json
-- .harness-evolver/harnesses/{best_version}/harness.py
-- .harness-evolver/harnesses/{best_version}/scores.json
-- .harness-evolver/context7_docs.md (if exists)
-</files_to_read>
-<output>
-Write:
-- .harness-evolver/architecture.json (structured recommendation)
-- .harness-evolver/architecture.md (human-readable analysis)
-</output>
-<success_criteria>
-- Recommendation includes concrete migration steps
-- Each step is implementable in one proposer iteration
-- Considers detected stack and available API keys
-</success_criteria>
+Dispatch architect subagent using the **Agent tool** with `subagent_type: "harness-evolver-architect"`:
+```
+Agent(
+  subagent_type: "harness-evolver-architect",
+  description: "Architect: analyze topology after {stagnation/regression}",
+  prompt: |
+    <objective>
+    The evolution loop has {stagnated/regressed} after {iterations} iterations (best: {best_score}).
+    Analyze the harness architecture and recommend a topology change.
+    </objective>
+    <files_to_read>
+    - .harness-evolver/architecture_signals.json
+    - .harness-evolver/summary.json
+    - .harness-evolver/PROPOSER_HISTORY.md
+    - .harness-evolver/config.json
+    - .harness-evolver/harnesses/{best_version}/harness.py
+    - .harness-evolver/harnesses/{best_version}/scores.json
+    - .harness-evolver/context7_docs.md (if exists)
+    </files_to_read>
+    <output>
+    Write:
+    - .harness-evolver/architecture.json (structured recommendation)
+    - .harness-evolver/architecture.md (human-readable analysis)
+    </output>
+    <success_criteria>
+    - Recommendation includes concrete migration steps
+    - Each step is implementable in one proposer iteration
+    - Considers detected stack and available API keys
+    </success_criteria>
+)
 ```
 Wait for `## ARCHITECTURE ANALYSIS COMPLETE`.

package/tools/init.py CHANGED Viewed

@@ -298,10 +298,26 @@ def main():
             print("  Recommendation: install langsmith-cli for rich trace analysis:")
             print("    uv tool install langsmith-cli && langsmith-cli auth login")
-    # Detect stack
-    stack = _detect_stack(args.harness)
+    # Detect stack — try original harness first, then baseline copy, then scan entire source dir
+    stack = _detect_stack(os.path.abspath(args.harness))
+    if not stack:
+        stack = _detect_stack(os.path.join(base, "baseline", "harness.py"))
+    if not stack:
+        # Scan the original directory for any .py files with known imports
+        harness_dir = os.path.dirname(os.path.abspath(args.harness))
+        detect_stack_py = os.path.join(os.path.dirname(__file__), "detect_stack.py")
+        if os.path.exists(detect_stack_py):
+            try:
+                r = subprocess.run(
+                    ["python3", detect_stack_py, harness_dir],
+                    capture_output=True, text=True, timeout=30,
+                )
+                if r.returncode == 0 and r.stdout.strip():
+                    stack = json.loads(r.stdout)
+            except Exception:
+                pass
     config["stack"] = {
-        "detected": stack,
+        "detected": stack if stack else {},
         "documentation_hint": "use context7",
         "auto_detected": True,
     }