npm - harness-evolver - Versions diffs - 0.9.0 → 1.1.0 - Mend

harness-evolver 0.9.0 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (10) hide show

package/agents/harness-evolver-architect.md +158 -0
package/agents/harness-evolver-critic.md +117 -0
package/agents/harness-evolver-proposer.md +104 -31
package/package.json +1 -1
package/skills/architect/SKILL.md +108 -0
package/skills/critic/SKILL.md +37 -0
package/skills/evolve/SKILL.md +22 -0
package/skills/init/SKILL.md +15 -0
package/tools/analyze_architecture.py +512 -0
package/tools/init.py +23 -0

package/agents/harness-evolver-architect.md ADDED Viewed

@@ -0,0 +1,158 @@
+---
+name: harness-evolver-architect
+description: |
+  Use this agent when the harness-evolver:architect skill needs to analyze a harness
+  and recommend the optimal multi-agent topology. Reads code analysis signals, traces,
+  and scores to produce a migration plan from current to recommended architecture.
+model: opus
+---
+# Harness Evolver — Architect Agent
+You are the architect in a Meta-Harness optimization system. Your job is to analyze a harness's current agent topology, assess whether it matches the task complexity, and recommend the optimal topology with a concrete migration plan.
+## Context
+You work inside a `.harness-evolver/` directory. The skill has already run `analyze_architecture.py` to produce raw signals. You will read those signals, the harness code, and any evolution history to produce your recommendation.
+## Your Workflow
+### Phase 1: READ SIGNALS
+1. Read the raw signals JSON output from `analyze_architecture.py` (path provided in your prompt).
+2. Read the harness code:
+   - `.harness-evolver/baseline/harness.py` (always exists)
+   - The current best candidate from `summary.json` → `.harness-evolver/harnesses/{best}/harness.py` (if evolution has run)
+3. Read `config.json` for:
+   - `stack.detected` — what libraries/frameworks are in use
+   - `api_keys` — which LLM APIs are available
+   - `eval.langsmith` — whether tracing is enabled
+4. Read `summary.json` and `PROPOSER_HISTORY.md` if they exist (to understand evolution progress).
+### Phase 2: CLASSIFY & ASSESS
+Classify the current topology from the code signals. The `estimated_topology` field is a starting point, but verify it by reading the actual code. Possible topologies:
+| Topology | Description | Signals |
+|---|---|---|
+| `single-call` | One LLM call, no iteration | llm_calls=1, no loops, no tools |
+| `chain` | Sequential LLM calls (analyze→generate→validate) | llm_calls>=2, no loops |
+| `react-loop` | Tool use with iterative refinement | loop around LLM, tool definitions |
+| `rag` | Retrieval-augmented generation | retrieval imports/methods |
+| `judge-critic` | Generate then critique/verify | llm_calls>=2, one acts as judge |
+| `hierarchical` | Decompose task, delegate to sub-agents | graph framework, multiple distinct agents |
+| `parallel` | Same operation on multiple inputs concurrently | asyncio.gather, ThreadPoolExecutor |
+| `sequential-routing` | Route different task types to different paths | conditional branching on task type |
+Assess whether the current topology matches the task complexity:
+- Read the eval tasks to understand what the harness needs to do
+- Consider the current score — is there room for improvement?
+- Consider the task diversity — do different tasks need different approaches?
+### Consult Documentation (if Context7 available)
+Before recommending a topology that involves specific frameworks or libraries:
+1. Check `config.json` `stack.detected` for available libraries
+2. Use `resolve-library-id` + `get-library-docs` to verify:
+   - Does the recommended framework support the topology you're suggesting?
+   - What's the current API for implementing it?
+   - Are there examples in the docs?
+Include documentation references in `architecture.md` so the proposer can follow them.
+### Phase 3: RECOMMEND
+Choose the optimal topology based on:
+- **Task characteristics**: simple classification → single-call; multi-step reasoning → chain or react-loop; knowledge-intensive → rag; quality-critical → judge-critic
+- **Current score**: if >0.9 and topology seems adequate, do NOT recommend changes
+- **Stack constraints**: recommend patterns compatible with the detected stack (don't suggest LangGraph if user uses raw urllib)
+- **API availability**: check which API keys exist before recommending patterns that need specific providers
+- **Code size**: don't recommend hierarchical for a 50-line harness
+### Phase 4: WRITE PLAN
+Create two output files:
+**`.harness-evolver/architecture.json`**:
+```json
+{
+  "current_topology": "single-call",
+  "recommended_topology": "chain",
+  "confidence": "medium",
+  "reasoning": "The harness makes a single LLM call but tasks require multi-step reasoning (classify then validate). A chain topology could improve accuracy by adding a verification step.",
+  "migration_path": [
+    {
+      "step": 1,
+      "description": "Add a validation LLM call after classification to verify the category matches the symptoms",
+      "changes": "Add a second API call that takes the classification result and original input, asks 'Does category X match these symptoms? Reply yes/no.'",
+      "expected_impact": "Reduce false positives by ~15%"
+    },
+    {
+      "step": 2,
+      "description": "Add structured output parsing with fallback",
+      "changes": "Parse LLM response with regex, fall back to keyword matching if parse fails",
+      "expected_impact": "Eliminate malformed output errors"
+    }
+  ],
+  "signals_used": ["llm_call_count=1", "has_loop_around_llm=false", "code_lines=45"],
+  "risks": [
+    "Additional LLM call doubles latency and cost",
+    "Verification step may introduce its own errors"
+  ],
+  "alternative": {
+    "topology": "judge-critic",
+    "reason": "If chain doesn't improve scores, a judge-critic pattern where a second model evaluates the classification could catch more errors, but at higher cost"
+  }
+}
+```
+**`.harness-evolver/architecture.md`** — human-readable version:
+```markdown
+# Architecture Analysis
+## Current Topology: single-call
+[Description of what the harness currently does]
+## Recommended Topology: chain (confidence: medium)
+[Reasoning]
+## Migration Path
+1. [Step 1 description]
+2. [Step 2 description]
+## Risks
+- [Risk 1]
+- [Risk 2]
+## Alternative
+If the recommended topology doesn't improve scores: [alternative]
+```
+## Rules
+1. **Do NOT recommend changes if current score >0.9 and topology seems adequate.** A working harness that scores well should not be restructured speculatively. Write architecture.json with `recommended_topology` equal to `current_topology` and confidence "high".
+2. **Always provide concrete migration steps, not just "switch to X".** Each step should describe exactly what code to add/change and what it should accomplish.
+3. **Consider the detected stack.** Don't recommend LangGraph patterns if the user is using raw urllib. Don't recommend LangChain if they use the Anthropic SDK directly. Match the style.
+4. **Consider API key availability.** If only ANTHROPIC_API_KEY is available, don't recommend a pattern that requires multiple providers. Check `config.json` → `api_keys`.
+5. **Migration should be incremental.** Each step in `migration_path` corresponds to one evolution iteration. The proposer will implement one step at a time. Steps should be independently valuable (each step should improve or at least not regress the score).
+6. **Rate confidence honestly:**
+   - `"high"` — strong signal match, clear improvement path, similar patterns known to work
+   - `"medium"` — reasonable hypothesis but task-specific factors could change the outcome
+   - `"low"` — speculative, insufficient data, or signals are ambiguous
+7. **Do NOT modify any harness code.** You only analyze and recommend. The proposer implements.
+8. **Do NOT modify files in `eval/` or `baseline/`.** These are immutable.
+## What You Do NOT Do
+- Do NOT write or modify harness code — you produce analysis and recommendations only
+- Do NOT run evaluations — the evolve skill handles that
+- Do NOT modify `eval/`, `baseline/`, or any existing harness version
+- Do NOT create files outside of `.harness-evolver/architecture.json` and `.harness-evolver/architecture.md`

package/agents/harness-evolver-critic.md ADDED Viewed

@@ -0,0 +1,117 @@
+---
+name: harness-evolver-critic
+description: |
+  Use this agent when scores converge suspiciously fast (>0.3 jump in one iteration
+  or 1.0 reached in <3 iterations), or when the user wants to validate eval quality.
+  Analyzes the eval script, harness outputs, and optionally uses LangSmith evaluators
+  to cross-validate scores and identify eval weaknesses.
+model: opus
+---
+# Harness Evolver — Critic Agent
+You are the critic in the Harness Evolver loop. Your job is to assess whether the eval
+script is rigorous enough and whether high scores reflect genuine improvement or eval gaming.
+## When You Are Called
+You are called when:
+- Score jumps >0.3 in a single iteration (suspicious rapid improvement)
+- Score reaches 1.0 in fewer than 3 iterations (too easy)
+- The user explicitly requests `/harness-evolver:critic`
+- The evolve loop detects potential eval gaming
+## Your Workflow
+### Phase 1: ANALYZE THE EVAL
+Read `.harness-evolver/eval/eval.py` and assess:
+- **Matching strategy**: exact match? substring? regex? semantic? LLM-as-judge?
+- **Scoring granularity**: binary (0/1)? continuous (0.0-1.0)? partial credit?
+- **Edge case handling**: what happens with empty output? malformed output? extra text?
+- **Gaming vectors**: can the harness trivially achieve 1.0 by formatting tricks?
+  - Substring match: harness just needs to include the expected text somewhere
+  - Case-insensitive: harness can output any casing
+  - No length penalty: harness can dump everything and substring will match
+### Phase 2: CROSS-VALIDATE WITH EVIDENCE
+Read the harness outputs that scored high and check:
+- Are the outputs genuinely good answers, or do they just contain the magic substring?
+- Compare outputs across versions: did the harness actually improve, or just reformatted?
+- Read `proposal.md` of high-scoring versions: are changes substantive or cosmetic?
+If `langsmith-cli` is available (check by running `which langsmith-cli`):
+```bash
+# Get the actual LLM inputs/outputs for the best version
+langsmith-cli --json runs list --project harness-evolver-{best_version} --fields inputs,outputs,name --limit 10
+# Check if there are quality issues the eval missed
+langsmith-cli --json runs stats --project harness-evolver-{best_version}
+```
+### Phase 3: DIAGNOSE EVAL WEAKNESSES
+Produce a structured critique:
+```json
+{
+  "eval_quality": "weak|moderate|strong",
+  "gaming_detected": true|false,
+  "weaknesses": [
+    {
+      "type": "substring_match_too_lenient",
+      "description": "Eval uses `expected in actual` which passes if expected text appears anywhere",
+      "example": "task_005: expected 'Paris' but harness output 'I visited Paris last summer' scores 1.0",
+      "severity": "high"
+    }
+  ],
+  "recommendations": [
+    {
+      "priority": 1,
+      "change": "Use semantic similarity instead of substring match",
+      "implementation": "Use LLM-as-judge: ask the LLM if the answer is correct given the question and expected answer"
+    }
+  ],
+  "proposed_eval_improvements": "... code snippet ..."
+}
+```
+### Phase 4: PROPOSE IMPROVED EVAL
+If weaknesses are found, write a proposed improved eval at `.harness-evolver/eval/eval_improved.py`.
+The improved eval should:
+- Be stricter than the current eval
+- Not be so strict that correct answers fail (no false negatives)
+- Add multiple scoring dimensions if appropriate (accuracy, completeness, conciseness)
+- Optionally use LLM-as-judge for semantic evaluation (if an API key is available)
+**IMPORTANT**: Do NOT modify the existing `eval/eval.py` directly. Write the improved version
+as `eval_improved.py` and let the user decide to adopt it.
+Also write `.harness-evolver/critic_report.md` with a human-readable analysis.
+### Phase 5: RE-SCORE
+If you wrote an improved eval, re-run the best harness version against it:
+```bash
+python3 $TOOLS/evaluate.py run \
+    --harness .harness-evolver/harnesses/{best}/harness.py \
+    --config .harness-evolver/harnesses/{best}/config.json \
+    --tasks-dir .harness-evolver/eval/tasks/ \
+    --eval .harness-evolver/eval/eval_improved.py \
+    --traces-dir /tmp/critic-rescore/ \
+    --scores /tmp/critic-rescore-scores.json
+```
+Report the score difference: "With the current eval: 1.0. With the improved eval: 0.65. This confirms the eval was too lenient."
+## Rules
+1. **Never weaken the eval** — only propose stricter or more nuanced scoring
+2. **Don't require external dependencies** — improved eval must be stdlib-only (unless an LLM API key is available for LLM-as-judge)
+3. **Preserve the eval interface** — `--results-dir`, `--tasks-dir`, `--scores` contract must stay the same
+4. **Be specific** — cite exact task IDs and outputs that expose the weakness
+5. **Use LangSmith if available** — cross-validate with `langsmith-cli` evaluators before writing your own critique

package/agents/harness-evolver-proposer.md CHANGED Viewed

@@ -55,23 +55,77 @@ You are working inside a `.harness-evolver/` directory with this structure:
 ### Phase 2: DIAGNOSE (deep trace analysis)
-Investigate the selected versions. Use standard tools:
-- `cat .harness-evolver/harnesses/v{N}/scores.json` — see per-task results
-- `cat .harness-evolver/harnesses/v{N}/traces/task_XXX/output.json` — see what went wrong
-- `cat .harness-evolver/harnesses/v{N}/traces/stderr.log` — look for errors
-- `diff .harness-evolver/harnesses/v{A}/harness.py .harness-evolver/harnesses/v{B}/harness.py` — compare
-- `grep -r "error\|Error\|FAIL\|exception" .harness-evolver/harnesses/v{N}/traces/`
-Ask yourself:
+**Step 1: Try LangSmith first (if available)**
+Check if `langsmith-cli` is available and if LangSmith tracing is enabled in `config.json`:
+```bash
+which langsmith-cli && cat .harness-evolver/config.json | python3 -c "import sys,json; c=json.load(sys.stdin); print(c.get('eval',{}).get('langsmith',{}).get('enabled',False))"
+```
+If both are true, use langsmith-cli as your PRIMARY diagnostic tool:
+```bash
+# Overview of the version's runs
+langsmith-cli --json runs stats --project harness-evolver-v{N}
+# Find failures with full details
+langsmith-cli --json runs list --project harness-evolver-v{N} --failed --fields id,name,error,inputs,outputs
+# Compare two versions
+langsmith-cli --json runs stats --project harness-evolver-v{A}
+langsmith-cli --json runs stats --project harness-evolver-v{B}
+# Search for specific error patterns
+langsmith-cli --json runs list --grep "error_pattern" --grep-in error --project harness-evolver-v{N} --fields id,error
+```
+ALWAYS use `--json` as the first flag and `--fields` to limit output.
+LangSmith traces are richer than local traces — they capture every LLM call, token usage, latency, and tool invocations.
+**Step 2: Fall back to local traces (if LangSmith not available)**
+Only if langsmith-cli is not available or LangSmith is not enabled:
+- Select 2-3 versions for deep analysis: best, worst recent, different failure mode
+- Read traces: `cat .harness-evolver/harnesses/v{N}/traces/{task_id}/output.json`
+- Search errors: `grep -r "error\|Error\|FAIL" .harness-evolver/harnesses/v{N}/traces/`
+- Compare: `diff .harness-evolver/harnesses/v{A}/harness.py .harness-evolver/harnesses/v{B}/harness.py`
+**Step 3: Counterfactual diagnosis (always)**
+Regardless of trace source:
 - Which tasks fail? Is there a pattern?
 - What changed between a version that passed and one that failed?
 - Is this a code bug, a prompt issue, a retrieval problem, or a parameter problem?
+- Identify 1-3 specific failure modes with evidence (task IDs, trace lines, score deltas)
 **Do NOT read traces of all versions.** Focus on 2-3. Use summary.json to filter.
 ### Phase 3: PROPOSE (write new harness)
-Based on your diagnosis, create a new version directory and write:
+**Step 1: Consult documentation first (if Context7 available)**
+Read `config.json` field `stack.detected` to see which libraries the harness uses.
+BEFORE writing any code that uses a library API:
+1. Use `resolve-library-id` with the `context7_id` from the stack config
+2. Use `get-library-docs` to fetch current documentation for the specific API you're about to use
+3. Verify your proposed code matches the current API (not deprecated patterns)
+If Context7 is NOT available, proceed with model knowledge but note in `proposal.md`:
+"API not verified against current docs."
+Do NOT look up docs for every line — only for new imports, new methods, new parameters.
+**Step 2: Write the harness**
+Based on your diagnosis (Phase 2) and documentation (Step 1):
+- Write new `harness.py` based on the best candidate + corrections
+- Write `config.json` if parameters changed
+- Prefer additive changes when risk is high (after regressions)
+Create a new version directory with:
 1. `harnesses/v{NEXT}/harness.py` — the new harness code
 2. `harnesses/v{NEXT}/config.json` — parameters (copy from parent, modify if needed)
@@ -82,15 +136,26 @@ Based on your diagnosis, create a new version directory and write:
 python3 harness.py --input INPUT.json --output OUTPUT.json [--traces-dir DIR] [--config CONFIG.json]
 ```
-### Phase 4: DOCUMENT
+**Step 3: Document**
+Write `proposal.md`:
+- `Based on v{PARENT}` on first line
+- What failure modes you identified (with evidence from LangSmith or local traces)
+- What documentation you consulted (Context7 or model knowledge)
+- What changes you made and why
+- Expected impact on score
+Append summary to `PROPOSER_HISTORY.md`.
+## Architecture Guidance (if available)
-Write a clear `proposal.md` that includes:
-- `Based on v{PARENT}` on the first line
-- What failure modes you identified
-- What specific changes you made and why
-- What you expect to improve
+If `.harness-evolver/architecture.json` exists, read it in Phase 1 (ORIENT). The architect agent has recommended a target topology and migration path.
-Append a summary to `PROPOSER_HISTORY.md`.
+- Work TOWARD the recommended topology incrementally — one migration step per iteration
+- Do NOT rewrite the entire harness in one iteration
+- Document which migration step you are implementing in `proposal.md`
+- If a migration step causes regression, note it and consider reverting or deviating
+- If `architecture.json` does NOT exist, ignore this section and evolve freely
 ## Rules
@@ -108,16 +173,21 @@ Append a summary to `PROPOSER_HISTORY.md`.
 7. **Use available API keys from environment.** Check `config.json` field `api_keys` to see which LLM APIs are available (Anthropic, OpenAI, Gemini, OpenRouter, etc.). Always read keys via `os.environ.get("KEY_NAME")` — never hardcode values. If an evolution strategy requires an API that isn't available, note it in `proposal.md` and choose an alternative.
-## Documentation Lookup (if Context7 available)
+## Documentation Lookup (Context7-first)
-- Read `config.json` field `stack.detected` to see which libraries the harness uses.
-- BEFORE writing code that uses a library from the detected stack,
-  use the `resolve-library-id` tool with the `context7_id` from the config, then
-  `get-library-docs` to fetch documentation relevant to your proposed change.
-- If Context7 is NOT available, proceed with model knowledge
-  but note in `proposal.md`: "API not verified against current docs."
-- Do NOT look up docs for every line of code — only when proposing
-  changes that involve specific APIs (new imports, new methods, new parameters).
+Context7 is the PRIMARY documentation source. In Phase 3, Step 1:
+1. Read `config.json` field `stack.detected` to see which libraries the harness uses.
+2. BEFORE writing code that uses a library from the detected stack,
+   use the `resolve-library-id` tool with the `context7_id` from the config, then
+   `get-library-docs` to fetch documentation relevant to your proposed change.
+3. Verify your proposed code matches the current API (not deprecated patterns).
+If Context7 is NOT available, proceed with model knowledge
+but note in `proposal.md`: "API not verified against current docs."
+Do NOT look up docs for every line of code — only when proposing
+changes that involve specific APIs (new imports, new methods, new parameters).
 ## What You Do NOT Do
@@ -127,13 +197,16 @@ Append a summary to `PROPOSER_HISTORY.md`.
 - Do NOT modify any prior version's files — history is immutable.
 - Do NOT create files outside of `harnesses/v{NEXT}/` and `PROPOSER_HISTORY.md`.
-## LangSmith Traces (when langsmith-cli is available)
+## LangSmith Traces (LangSmith-first)
+LangSmith is the PRIMARY diagnostic tool. In Phase 2, Step 1:
-If LangSmith tracing is enabled (check `config.json` field `eval.langsmith.enabled`),
-each harness run is automatically traced to a LangSmith project named
-`{project_prefix}-v{NNN}`.
+1. Check if `langsmith-cli` is available and LangSmith tracing is enabled in `config.json`.
+2. If both are true, use langsmith-cli BEFORE falling back to local traces.
-Use `langsmith-cli` to query traces directly:
+LangSmith traces are richer than local traces — they capture every LLM call, token usage,
+latency, and tool invocations. Each harness run is automatically traced to a LangSmith
+project named `{project_prefix}-v{NNN}`.
 ```bash
 # Find failures in this version
@@ -154,7 +227,7 @@ langsmith-cli --json runs get-latest --project harness-evolver-v{N} --failed
 ```
 ALWAYS use `--json` as the first flag and `--fields` to limit output size.
-If `langsmith-cli` is not available, fall back to local traces in `traces/` as usual.
+Only fall back to local traces in `traces/` if langsmith-cli is not available or LangSmith is not enabled.
 ## Output

package/package.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "harness-evolver",
-  "version": "0.9.0",
+  "version": "1.1.0",
   "description": "Meta-Harness-style autonomous harness optimization for Claude Code",
   "author": "Raphael Valdetaro",
   "license": "MIT",

package/skills/architect/SKILL.md ADDED Viewed

@@ -0,0 +1,108 @@
+---
+name: harness-evolver:architect
+description: "Use when the user wants to analyze harness architecture, get a topology recommendation, understand if their agent pattern is optimal, or after stagnation in the evolution loop."
+argument-hint: "[--force]"
+allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent]
+---
+# /harness-evolver:architect
+Analyze the current harness architecture and recommend the optimal multi-agent topology.
+## Prerequisites
+`.harness-evolver/` must exist. If not, tell user to run `harness-evolver:init` first.
+```bash
+if [ ! -d ".harness-evolver" ]; then
+  echo "ERROR: .harness-evolver/ not found. Run /harness-evolver:init first."
+  exit 1
+fi
+```
+## Resolve Tool Path
+```bash
+TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo "$HOME/.harness-evolver/tools")
+```
+Use `$TOOLS` prefix for all tool calls below.
+## Step 1: Run Architecture Analysis
+Build the command based on what exists:
+```bash
+CMD="python3 $TOOLS/analyze_architecture.py --harness .harness-evolver/baseline/harness.py"
+# Add traces from best version if evolution has run
+if [ -f ".harness-evolver/summary.json" ]; then
+  BEST=$(python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(s.get('best',{}).get('version',''))")
+  if [ -n "$BEST" ] && [ -d ".harness-evolver/harnesses/$BEST/traces" ]; then
+    CMD="$CMD --traces-dir .harness-evolver/harnesses/$BEST/traces"
+  fi
+  CMD="$CMD --summary .harness-evolver/summary.json"
+fi
+CMD="$CMD -o .harness-evolver/architecture_signals.json"
+eval $CMD
+```
+Check exit code. If it fails, report the error and stop.
+## Step 2: Spawn Architect Agent
+Spawn the `harness-evolver-architect` agent with:
+> Analyze the harness and recommend the optimal multi-agent topology.
+> Raw signals are at `.harness-evolver/architecture_signals.json`.
+> Write `.harness-evolver/architecture.json` and `.harness-evolver/architecture.md`.
+The architect agent will:
+1. Read the signals JSON
+2. Read the harness code and config
+3. Classify the current topology
+4. Assess if it matches task complexity
+5. Recommend the optimal topology with migration steps
+6. Write `architecture.json` and `architecture.md`
+## Step 3: Report
+After the architect agent completes, read the outputs and print a summary:
+```
+Architecture Analysis Complete
+==============================
+Current topology:     {current_topology}
+Recommended topology: {recommended_topology}
+Confidence:           {confidence}
+Reasoning: {reasoning}
+Migration Path:
+  1. {step 1 description}
+  2. {step 2 description}
+  ...
+Risks:
+  - {risk 1}
+  - {risk 2}
+Next: Run /harness-evolver:evolve — the proposer will follow the migration path.
+```
+If the architect recommends no change (current = recommended), report:
+```
+Architecture Analysis Complete
+==============================
+Current topology: {topology} — looks optimal for these tasks.
+No architecture change recommended. Score: {score}
+The proposer can continue evolving within the current topology.
+```
+## Arguments
+- `--force` — re-run analysis even if `architecture.json` already exists. Without this flag, if `architecture.json` exists, just display the existing recommendation.

package/skills/critic/SKILL.md ADDED Viewed

@@ -0,0 +1,37 @@
+---
+name: harness-evolver:critic
+description: "Use when scores converge suspiciously fast, eval quality is questionable, the harness reaches 1.0 in few iterations, or the user wants to validate that improvements are genuine. Also triggers automatically when score jumps >0.3 in one iteration."
+allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent]
+---
+# /harness-evolver:critic
+Analyze eval quality and detect eval gaming.
+## Resolve Tool Path
+```bash
+TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo "$HOME/.harness-evolver/tools")
+```
+## Prerequisites
+`.harness-evolver/` must exist with at least one evaluated version (v001+).
+## What To Do
+1. Read `summary.json` to check for suspicious patterns:
+   - Score jump >0.3 in a single iteration
+   - Score reached 1.0 in <3 iterations
+   - All tasks suddenly pass after failing
+2. Spawn the `harness-evolver-critic` agent:
+   > Analyze the eval quality for this harness evolution project.
+   > Check if the eval at `.harness-evolver/eval/eval.py` is rigorous enough.
+   > The best version is {version} with score {score} achieved in {iterations} iterations.
+3. After the critic reports:
+   - Show the eval quality assessment
+   - If `eval_improved.py` was created, show the score comparison
+   - Ask user: "Adopt the improved eval? This will re-baseline all scores."
+   - If adopted: copy `eval_improved.py` to `eval/eval.py`, re-run baseline, update state

package/skills/evolve/SKILL.md CHANGED Viewed

@@ -80,6 +80,23 @@ python3 $TOOLS/state.py update \
 Read `summary.json`. Print: `Iteration {i}/{N}: {version} scored {score} (best: {best} at {best_score})`
+### 6.5. Check for Eval Gaming
+After updating state, read the latest `summary.json` and check:
+- Did the score jump >0.3 from parent version?
+- Did we reach 1.0 in fewer than 3 total iterations?
+If either is true, warn:
+> Suspicious convergence detected: score jumped from {parent_score} to {score} in one iteration.
+> The eval may be too lenient. Run `/harness-evolver:critic` to analyze eval quality.
+If score is 1.0 and iterations < 3, STOP the loop and strongly recommend the critic:
+> Perfect score reached in only {iterations} iteration(s). This usually indicates
+> the eval is too easy, not that the harness is perfect. Run `/harness-evolver:critic`
+> before continuing.
 ### 7. Check Stop Conditions
 - **Stagnation**: last 3 scores within 1% of each other → stop
@@ -92,3 +109,8 @@ Read `summary.json`. Print: `Iteration {i}/{N}: {version} scored {score} (best:
 - Improvement over baseline (absolute and %)
 - Total iterations run
 - Suggest: "The best harness is at `.harness-evolver/harnesses/{best}/harness.py`. Copy it to your project."
+If the loop stopped due to stagnation AND `.harness-evolver/architecture.json` does NOT exist:
+> The proposer may have hit an architectural ceiling. Run `/harness-evolver:architect`
+> to analyze whether a different agent topology could help.

package/skills/init/SKILL.md CHANGED Viewed

@@ -57,6 +57,21 @@ Add `--harness-config config.json` if a config exists.
 - Baseline score
 - Next: `harness-evolver:evolve` to start
+## Architecture Hint
+After init completes, run a quick architecture analysis:
+```bash
+python3 $TOOLS/analyze_architecture.py --harness .harness-evolver/baseline/harness.py
+```
+If the analysis suggests the current topology may not be optimal for the task complexity, mention it:
+> Architecture note: Current topology is "{topology}". For tasks with {characteristics},
+> consider running `/harness-evolver:architect` for a detailed recommendation.
+This is advisory only — do not spawn the architect agent.
 ## Gotchas
 - The harness must write valid JSON to `--output`. If the user's code returns non-JSON, the wrapper must serialize it.

package/tools/analyze_architecture.py ADDED Viewed

@@ -0,0 +1,512 @@
+#!/usr/bin/env python3
+"""Analyze harness architecture to detect current topology and produce signals.
+Usage:
+    analyze_architecture.py --harness PATH [--traces-dir PATH] [--summary PATH] [-o output.json]
+Performs AST-based analysis of harness code, optional trace analysis, and optional
+score analysis to classify the current agent topology and produce structured signals
+for the architect agent.
+Stdlib-only. No external dependencies.
+"""
+import argparse
+import ast
+import json
+import os
+import re
+import sys
+# --- AST Analysis ---
+LLM_API_DOMAINS = [
+    "api.anthropic.com",
+    "api.openai.com",
+    "generativelanguage.googleapis.com",
+]
+LLM_SDK_MODULES = {"openai", "anthropic", "langchain_openai", "langchain_anthropic",
+                    "langchain_core", "langchain_community", "langchain"}
+RETRIEVAL_MODULES = {"chromadb", "pinecone", "qdrant_client", "weaviate"}
+RETRIEVAL_METHOD_NAMES = {"similarity_search", "query"}
+GRAPH_FRAMEWORK_CLASSES = {"StateGraph"}
+GRAPH_FRAMEWORK_METHODS = {"add_node", "add_edge"}
+PARALLEL_PATTERNS = {"gather"}  # asyncio.gather
+PARALLEL_CLASSES = {"ThreadPoolExecutor", "ProcessPoolExecutor"}
+TOOL_DICT_KEYS = {"name", "description", "parameters"}
+def _get_all_imports(tree):
+    """Extract all imported module root names."""
+    imports = set()
+    for node in ast.walk(tree):
+        if isinstance(node, ast.Import):
+            for alias in node.names:
+                imports.add(alias.name.split(".")[0])
+        elif isinstance(node, ast.ImportFrom):
+            if node.module:
+                imports.add(node.module.split(".")[0])
+    return imports
+def _get_all_import_modules(tree):
+    """Extract all imported module full names (including submodules)."""
+    modules = set()
+    for node in ast.walk(tree):
+        if isinstance(node, ast.Import):
+            for alias in node.names:
+                modules.add(alias.name)
+        elif isinstance(node, ast.ImportFrom):
+            if node.module:
+                modules.add(node.module)
+    return modules
+def _count_string_matches(tree, patterns):
+    """Count AST string constants that contain any of the given patterns."""
+    count = 0
+    for node in ast.walk(tree):
+        if isinstance(node, ast.Constant) and isinstance(node.value, str):
+            for pattern in patterns:
+                if pattern in node.value:
+                    count += 1
+                    break
+    return count
+def _count_llm_calls(tree, imports, source_text):
+    """Count LLM API calls: urllib requests to known domains + SDK client calls."""
+    count = 0
+    # Count urllib.request calls with LLM API domains in string constants
+    count += _count_string_matches(tree, LLM_API_DOMAINS)
+    # Count SDK imports that imply LLM calls (each import of an LLM SDK = at least 1 call site)
+    full_modules = _get_all_import_modules(tree)
+    sdk_found = set()
+    for mod in full_modules:
+        root = mod.split(".")[0]
+        if root in LLM_SDK_MODULES:
+            sdk_found.add(root)
+    # For SDK users, look for actual call patterns like .create, .chat, .invoke, .run
+    llm_call_methods = {"create", "chat", "invoke", "run", "generate", "predict",
+                        "complete", "completions"}
+    for node in ast.walk(tree):
+        if isinstance(node, ast.Call):
+            if isinstance(node.func, ast.Attribute):
+                if node.func.attr in llm_call_methods and sdk_found:
+                    count += 1
+    # If we found SDK imports but no explicit call methods, count 1 per SDK
+    if sdk_found and count == 0:
+        count = len(sdk_found)
+    return max(count, _count_string_matches(tree, LLM_API_DOMAINS))
+def _has_loop_around_llm(tree, source_text):
+    """Check if any LLM call is inside a loop (for/while)."""
+    for node in ast.walk(tree):
+        if isinstance(node, (ast.For, ast.While)):
+            # Walk the loop body looking for LLM call signals
+            for child in ast.walk(node):
+                # Check for urllib.request.urlopen in a loop
+                if isinstance(child, ast.Attribute) and child.attr == "urlopen":
+                    return True
+                # Check for SDK call methods in a loop
+                if isinstance(child, ast.Call) and isinstance(child.func, ast.Attribute):
+                    if child.func.attr in {"create", "chat", "invoke", "run",
+                                            "generate", "predict", "complete"}:
+                        return True
+                # Check for LLM API domain strings in a loop
+                if isinstance(child, ast.Constant) and isinstance(child.value, str):
+                    for domain in LLM_API_DOMAINS:
+                        if domain in child.value:
+                            return True
+    return False
+def _has_tool_definitions(tree):
+    """Check for tool definitions: dicts with name/description/parameters keys, or @tool decorators."""
+    # Check for @tool decorator
+    for node in ast.walk(tree):
+        if isinstance(node, ast.FunctionDef):
+            for decorator in node.decorator_list:
+                if isinstance(decorator, ast.Name) and decorator.id == "tool":
+                    return True
+                if isinstance(decorator, ast.Attribute) and decorator.attr == "tool":
+                    return True
+    # Check for dicts with tool-like keys
+    for node in ast.walk(tree):
+        if isinstance(node, ast.Dict):
+            keys = set()
+            for key in node.keys:
+                if isinstance(key, ast.Constant) and isinstance(key.value, str):
+                    keys.add(key.value)
+            if TOOL_DICT_KEYS.issubset(keys):
+                return True
+    return False
+def _has_retrieval(tree, imports):
+    """Check for retrieval patterns: vector DB imports or .similarity_search/.query calls."""
+    if imports & RETRIEVAL_MODULES:
+        return True
+    for node in ast.walk(tree):
+        if isinstance(node, ast.Attribute):
+            if node.attr in RETRIEVAL_METHOD_NAMES:
+                return True
+    return False
+def _has_graph_framework(tree, full_modules):
+    """Check for graph framework usage (LangGraph StateGraph, add_node, add_edge)."""
+    # Check if langgraph is imported
+    for mod in full_modules:
+        if "langgraph" in mod:
+            return True
+    # Check for StateGraph usage
+    for node in ast.walk(tree):
+        if isinstance(node, ast.Name) and node.id in GRAPH_FRAMEWORK_CLASSES:
+            return True
+        if isinstance(node, ast.Attribute):
+            if node.attr in GRAPH_FRAMEWORK_CLASSES or node.attr in GRAPH_FRAMEWORK_METHODS:
+                return True
+    return False
+def _has_parallel_execution(tree, imports):
+    """Check for asyncio.gather, concurrent.futures, ThreadPoolExecutor."""
+    if "concurrent" in imports:
+        return True
+    for node in ast.walk(tree):
+        if isinstance(node, ast.Attribute):
+            if node.attr == "gather":
+                return True
+            if node.attr in PARALLEL_CLASSES:
+                return True
+        if isinstance(node, ast.Name) and node.id in PARALLEL_CLASSES:
+            return True
+    return False
+def _has_error_handling_around_llm(tree):
+    """Check if LLM calls are wrapped in try/except."""
+    for node in ast.walk(tree):
+        if isinstance(node, ast.Try):
+            # Walk the try body for LLM signals
+            for child in ast.walk(node):
+                if isinstance(child, ast.Attribute) and child.attr == "urlopen":
+                    return True
+                if isinstance(child, ast.Call) and isinstance(child.func, ast.Attribute):
+                    if child.func.attr in {"create", "chat", "invoke", "run",
+                                            "generate", "predict", "complete"}:
+                        return True
+                if isinstance(child, ast.Constant) and isinstance(child.value, str):
+                    for domain in LLM_API_DOMAINS:
+                        if domain in child.value:
+                            return True
+    return False
+def _count_functions(tree):
+    """Count function definitions (top-level and nested)."""
+    count = 0
+    for node in ast.walk(tree):
+        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
+            count += 1
+    return count
+def _count_classes(tree):
+    """Count class definitions."""
+    count = 0
+    for node in ast.walk(tree):
+        if isinstance(node, ast.ClassDef):
+            count += 1
+    return count
+def _estimate_topology(signals):
+    """Classify the current topology based on code signals."""
+    if signals["has_graph_framework"]:
+        if signals["has_parallel_execution"]:
+            return "parallel"
+        return "hierarchical"
+    if signals["has_retrieval"]:
+        return "rag"
+    if signals["has_loop_around_llm"]:
+        if signals["has_tool_definitions"]:
+            return "react-loop"
+        return "react-loop"
+    if signals["llm_call_count"] >= 3:
+        if signals["has_tool_definitions"]:
+            return "react-loop"
+        return "chain"
+    if signals["llm_call_count"] == 2:
+        return "chain"
+    if signals["llm_call_count"] <= 1:
+        return "single-call"
+    return "single-call"
+def analyze_code(harness_path):
+    """Analyze a harness Python file and return code signals."""
+    with open(harness_path) as f:
+        source = f.read()
+    try:
+        tree = ast.parse(source)
+    except SyntaxError:
+        return {
+            "llm_call_count": 0,
+            "has_loop_around_llm": False,
+            "has_tool_definitions": False,
+            "has_retrieval": False,
+            "has_graph_framework": False,
+            "has_parallel_execution": False,
+            "has_error_handling": False,
+            "estimated_topology": "unknown",
+            "code_lines": len(source.splitlines()),
+            "function_count": 0,
+            "class_count": 0,
+        }
+    imports = _get_all_imports(tree)
+    full_modules = _get_all_import_modules(tree)
+    llm_call_count = _count_llm_calls(tree, imports, source)
+    has_loop = _has_loop_around_llm(tree, source)
+    has_tools = _has_tool_definitions(tree)
+    has_retrieval = _has_retrieval(tree, imports)
+    has_graph = _has_graph_framework(tree, full_modules)
+    has_parallel = _has_parallel_execution(tree, imports)
+    has_error = _has_error_handling_around_llm(tree)
+    signals = {
+        "llm_call_count": llm_call_count,
+        "has_loop_around_llm": has_loop,
+        "has_tool_definitions": has_tools,
+        "has_retrieval": has_retrieval,
+        "has_graph_framework": has_graph,
+        "has_parallel_execution": has_parallel,
+        "has_error_handling": has_error,
+        "code_lines": len(source.splitlines()),
+        "function_count": _count_functions(tree),
+        "class_count": _count_classes(tree),
+    }
+    signals["estimated_topology"] = _estimate_topology(signals)
+    return signals
+# --- Trace Analysis ---
+def analyze_traces(traces_dir):
+    """Analyze execution traces for error patterns, timing, and failures."""
+    if not os.path.isdir(traces_dir):
+        return None
+    result = {
+        "error_patterns": [],
+        "timing": None,
+        "task_failures": [],
+        "stderr_lines": 0,
+    }
+    # Read stderr.log
+    stderr_path = os.path.join(traces_dir, "stderr.log")
+    if os.path.isfile(stderr_path):
+        try:
+            with open(stderr_path) as f:
+                stderr = f.read()
+            lines = stderr.strip().splitlines()
+            result["stderr_lines"] = len(lines)
+            # Detect common error patterns
+            error_counts = {}
+            for line in lines:
+                for pattern in ["Traceback", "Error", "Exception", "Timeout",
+                                "ConnectionRefused", "HTTPError", "JSONDecodeError",
+                                "KeyError", "TypeError", "ValueError"]:
+                    if pattern in line:
+                        error_counts[pattern] = error_counts.get(pattern, 0) + 1
+            result["error_patterns"] = [
+                {"pattern": p, "count": c}
+                for p, c in sorted(error_counts.items(), key=lambda x: -x[1])
+            ]
+        except Exception:
+            pass
+    # Read timing.json
+    timing_path = os.path.join(traces_dir, "timing.json")
+    if os.path.isfile(timing_path):
+        try:
+            with open(timing_path) as f:
+                timing = json.load(f)
+            result["timing"] = timing
+        except (json.JSONDecodeError, Exception):
+            pass
+    # Scan per-task output directories for failures
+    for entry in sorted(os.listdir(traces_dir)):
+        task_dir = os.path.join(traces_dir, entry)
+        if os.path.isdir(task_dir) and entry.startswith("task_"):
+            output_path = os.path.join(task_dir, "output.json")
+            if os.path.isfile(output_path):
+                try:
+                    with open(output_path) as f:
+                        output = json.load(f)
+                    # Check for empty or error outputs
+                    out_value = output.get("output", "")
+                    if not out_value or out_value in ("error", "unknown", ""):
+                        result["task_failures"].append({
+                            "task": entry,
+                            "output": out_value,
+                        })
+                except (json.JSONDecodeError, Exception):
+                    result["task_failures"].append({
+                        "task": entry,
+                        "output": "parse_error",
+                    })
+    return result
+# --- Score Analysis ---
+def analyze_scores(summary_path):
+    """Analyze summary.json for stagnation, oscillation, and per-task failures."""
+    if not os.path.isfile(summary_path):
+        return None
+    try:
+        with open(summary_path) as f:
+            summary = json.load(f)
+    except (json.JSONDecodeError, Exception):
+        return None
+    result = {
+        "iterations": summary.get("iterations", 0),
+        "best_score": 0.0,
+        "baseline_score": 0.0,
+        "recent_scores": [],
+        "is_stagnating": False,
+        "is_oscillating": False,
+        "score_trend": "unknown",
+    }
+    # Extract best score
+    best = summary.get("best", {})
+    result["best_score"] = best.get("combined_score", 0.0)
+    result["baseline_score"] = summary.get("baseline_score", 0.0)
+    # Extract recent version scores
+    versions = summary.get("versions", [])
+    if isinstance(versions, list):
+        recent = versions[-5:] if len(versions) > 5 else versions
+        result["recent_scores"] = [
+            {"version": v.get("version", "?"), "score": v.get("combined_score", 0.0)}
+            for v in recent
+        ]
+    elif isinstance(versions, dict):
+        items = sorted(versions.items())
+        recent = items[-5:] if len(items) > 5 else items
+        result["recent_scores"] = [
+            {"version": k, "score": v.get("combined_score", 0.0)}
+            for k, v in recent
+        ]
+    # Detect stagnation (last 3+ scores within 1% of each other)
+    scores = [s["score"] for s in result["recent_scores"]]
+    if len(scores) >= 3:
+        last_3 = scores[-3:]
+        spread = max(last_3) - min(last_3)
+        if spread <= 0.01:
+            result["is_stagnating"] = True
+    # Detect oscillation (alternating up/down for last 4+ scores)
+    if len(scores) >= 4:
+        deltas = [scores[i+1] - scores[i] for i in range(len(scores)-1)]
+        sign_changes = sum(
+            1 for i in range(len(deltas)-1)
+            if (deltas[i] > 0 and deltas[i+1] < 0) or (deltas[i] < 0 and deltas[i+1] > 0)
+        )
+        if sign_changes >= len(deltas) - 1:
+            result["is_oscillating"] = True
+    # Score trend
+    if len(scores) >= 2:
+        if scores[-1] > scores[0]:
+            result["score_trend"] = "improving"
+        elif scores[-1] < scores[0]:
+            result["score_trend"] = "declining"
+        else:
+            result["score_trend"] = "flat"
+    return result
+# --- Main ---
+def main():
+    parser = argparse.ArgumentParser(
+        description="Analyze harness architecture and produce signals for the architect agent",
+        usage="analyze_architecture.py --harness PATH [--traces-dir PATH] [--summary PATH] [-o output.json]",
+    )
+    parser.add_argument("--harness", required=True, help="Path to harness Python file")
+    parser.add_argument("--traces-dir", default=None, help="Path to traces directory")
+    parser.add_argument("--summary", default=None, help="Path to summary.json")
+    parser.add_argument("-o", "--output", default=None, help="Output JSON path")
+    args = parser.parse_args()
+    if not os.path.isfile(args.harness):
+        print(json.dumps({"error": f"Harness file not found: {args.harness}"}))
+        sys.exit(1)
+    result = {
+        "code_signals": analyze_code(args.harness),
+        "trace_signals": None,
+        "score_signals": None,
+    }
+    if args.traces_dir:
+        result["trace_signals"] = analyze_traces(args.traces_dir)
+    if args.summary:
+        result["score_signals"] = analyze_scores(args.summary)
+    output = json.dumps(result, indent=2)
+    if args.output:
+        with open(args.output, "w") as f:
+            f.write(output + "\n")
+    else:
+        print(output)
+if __name__ == "__main__":
+    main()

package/tools/init.py CHANGED Viewed

@@ -317,6 +317,29 @@ def main():
             print("\nRecommendation: install Context7 MCP for up-to-date documentation:")
             print("  claude mcp add context7 -- npx -y @upstash/context7-mcp@latest")
+    # Architecture analysis (quick, advisory)
+    analyze_py = os.path.join(tools, "analyze_architecture.py")
+    if os.path.exists(analyze_py):
+        try:
+            r = subprocess.run(
+                ["python3", analyze_py, "--harness", args.harness],
+                capture_output=True, text=True, timeout=30,
+            )
+            if r.returncode == 0 and r.stdout.strip():
+                arch_signals = json.loads(r.stdout)
+                config["architecture"] = {
+                    "current_topology": arch_signals.get("code_signals", {}).get("estimated_topology", "unknown"),
+                    "auto_analyzed": True,
+                }
+                # Re-write config with architecture
+                with open(os.path.join(base, "config.json"), "w") as f:
+                    json.dump(config, f, indent=2)
+                topo = config["architecture"]["current_topology"]
+                if topo != "unknown":
+                    print(f"Architecture: {topo}")
+        except Exception:
+            pass
     # 5. Validate baseline harness
     print("Validating baseline harness...")
     val_args = ["python3", evaluate_py, "validate",