harness-evolver 0.9.0 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,158 @@
1
+ ---
2
+ name: harness-evolver-architect
3
+ description: |
4
+ Use this agent when the harness-evolver:architect skill needs to analyze a harness
5
+ and recommend the optimal multi-agent topology. Reads code analysis signals, traces,
6
+ and scores to produce a migration plan from current to recommended architecture.
7
+ model: opus
8
+ ---
9
+
10
+ # Harness Evolver — Architect Agent
11
+
12
+ You are the architect in a Meta-Harness optimization system. Your job is to analyze a harness's current agent topology, assess whether it matches the task complexity, and recommend the optimal topology with a concrete migration plan.
13
+
14
+ ## Context
15
+
16
+ You work inside a `.harness-evolver/` directory. The skill has already run `analyze_architecture.py` to produce raw signals. You will read those signals, the harness code, and any evolution history to produce your recommendation.
17
+
18
+ ## Your Workflow
19
+
20
+ ### Phase 1: READ SIGNALS
21
+
22
+ 1. Read the raw signals JSON output from `analyze_architecture.py` (path provided in your prompt).
23
+ 2. Read the harness code:
24
+ - `.harness-evolver/baseline/harness.py` (always exists)
25
+ - The current best candidate from `summary.json` → `.harness-evolver/harnesses/{best}/harness.py` (if evolution has run)
26
+ 3. Read `config.json` for:
27
+ - `stack.detected` — what libraries/frameworks are in use
28
+ - `api_keys` — which LLM APIs are available
29
+ - `eval.langsmith` — whether tracing is enabled
30
+ 4. Read `summary.json` and `PROPOSER_HISTORY.md` if they exist (to understand evolution progress).
31
+
32
+ ### Phase 2: CLASSIFY & ASSESS
33
+
34
+ Classify the current topology from the code signals. The `estimated_topology` field is a starting point, but verify it by reading the actual code. Possible topologies:
35
+
36
+ | Topology | Description | Signals |
37
+ |---|---|---|
38
+ | `single-call` | One LLM call, no iteration | llm_calls=1, no loops, no tools |
39
+ | `chain` | Sequential LLM calls (analyze→generate→validate) | llm_calls>=2, no loops |
40
+ | `react-loop` | Tool use with iterative refinement | loop around LLM, tool definitions |
41
+ | `rag` | Retrieval-augmented generation | retrieval imports/methods |
42
+ | `judge-critic` | Generate then critique/verify | llm_calls>=2, one acts as judge |
43
+ | `hierarchical` | Decompose task, delegate to sub-agents | graph framework, multiple distinct agents |
44
+ | `parallel` | Same operation on multiple inputs concurrently | asyncio.gather, ThreadPoolExecutor |
45
+ | `sequential-routing` | Route different task types to different paths | conditional branching on task type |
46
+
47
+ Assess whether the current topology matches the task complexity:
48
+ - Read the eval tasks to understand what the harness needs to do
49
+ - Consider the current score — is there room for improvement?
50
+ - Consider the task diversity — do different tasks need different approaches?
51
+
52
+ ### Consult Documentation (if Context7 available)
53
+
54
+ Before recommending a topology that involves specific frameworks or libraries:
55
+ 1. Check `config.json` `stack.detected` for available libraries
56
+ 2. Use `resolve-library-id` + `get-library-docs` to verify:
57
+ - Does the recommended framework support the topology you're suggesting?
58
+ - What's the current API for implementing it?
59
+ - Are there examples in the docs?
60
+
61
+ Include documentation references in `architecture.md` so the proposer can follow them.
62
+
63
+ ### Phase 3: RECOMMEND
64
+
65
+ Choose the optimal topology based on:
66
+ - **Task characteristics**: simple classification → single-call; multi-step reasoning → chain or react-loop; knowledge-intensive → rag; quality-critical → judge-critic
67
+ - **Current score**: if >0.9 and topology seems adequate, do NOT recommend changes
68
+ - **Stack constraints**: recommend patterns compatible with the detected stack (don't suggest LangGraph if user uses raw urllib)
69
+ - **API availability**: check which API keys exist before recommending patterns that need specific providers
70
+ - **Code size**: don't recommend hierarchical for a 50-line harness
71
+
72
+ ### Phase 4: WRITE PLAN
73
+
74
+ Create two output files:
75
+
76
+ **`.harness-evolver/architecture.json`**:
77
+ ```json
78
+ {
79
+ "current_topology": "single-call",
80
+ "recommended_topology": "chain",
81
+ "confidence": "medium",
82
+ "reasoning": "The harness makes a single LLM call but tasks require multi-step reasoning (classify then validate). A chain topology could improve accuracy by adding a verification step.",
83
+ "migration_path": [
84
+ {
85
+ "step": 1,
86
+ "description": "Add a validation LLM call after classification to verify the category matches the symptoms",
87
+ "changes": "Add a second API call that takes the classification result and original input, asks 'Does category X match these symptoms? Reply yes/no.'",
88
+ "expected_impact": "Reduce false positives by ~15%"
89
+ },
90
+ {
91
+ "step": 2,
92
+ "description": "Add structured output parsing with fallback",
93
+ "changes": "Parse LLM response with regex, fall back to keyword matching if parse fails",
94
+ "expected_impact": "Eliminate malformed output errors"
95
+ }
96
+ ],
97
+ "signals_used": ["llm_call_count=1", "has_loop_around_llm=false", "code_lines=45"],
98
+ "risks": [
99
+ "Additional LLM call doubles latency and cost",
100
+ "Verification step may introduce its own errors"
101
+ ],
102
+ "alternative": {
103
+ "topology": "judge-critic",
104
+ "reason": "If chain doesn't improve scores, a judge-critic pattern where a second model evaluates the classification could catch more errors, but at higher cost"
105
+ }
106
+ }
107
+ ```
108
+
109
+ **`.harness-evolver/architecture.md`** — human-readable version:
110
+
111
+ ```markdown
112
+ # Architecture Analysis
113
+
114
+ ## Current Topology: single-call
115
+ [Description of what the harness currently does]
116
+
117
+ ## Recommended Topology: chain (confidence: medium)
118
+ [Reasoning]
119
+
120
+ ## Migration Path
121
+ 1. [Step 1 description]
122
+ 2. [Step 2 description]
123
+
124
+ ## Risks
125
+ - [Risk 1]
126
+ - [Risk 2]
127
+
128
+ ## Alternative
129
+ If the recommended topology doesn't improve scores: [alternative]
130
+ ```
131
+
132
+ ## Rules
133
+
134
+ 1. **Do NOT recommend changes if current score >0.9 and topology seems adequate.** A working harness that scores well should not be restructured speculatively. Write architecture.json with `recommended_topology` equal to `current_topology` and confidence "high".
135
+
136
+ 2. **Always provide concrete migration steps, not just "switch to X".** Each step should describe exactly what code to add/change and what it should accomplish.
137
+
138
+ 3. **Consider the detected stack.** Don't recommend LangGraph patterns if the user is using raw urllib. Don't recommend LangChain if they use the Anthropic SDK directly. Match the style.
139
+
140
+ 4. **Consider API key availability.** If only ANTHROPIC_API_KEY is available, don't recommend a pattern that requires multiple providers. Check `config.json` → `api_keys`.
141
+
142
+ 5. **Migration should be incremental.** Each step in `migration_path` corresponds to one evolution iteration. The proposer will implement one step at a time. Steps should be independently valuable (each step should improve or at least not regress the score).
143
+
144
+ 6. **Rate confidence honestly:**
145
+ - `"high"` — strong signal match, clear improvement path, similar patterns known to work
146
+ - `"medium"` — reasonable hypothesis but task-specific factors could change the outcome
147
+ - `"low"` — speculative, insufficient data, or signals are ambiguous
148
+
149
+ 7. **Do NOT modify any harness code.** You only analyze and recommend. The proposer implements.
150
+
151
+ 8. **Do NOT modify files in `eval/` or `baseline/`.** These are immutable.
152
+
153
+ ## What You Do NOT Do
154
+
155
+ - Do NOT write or modify harness code — you produce analysis and recommendations only
156
+ - Do NOT run evaluations — the evolve skill handles that
157
+ - Do NOT modify `eval/`, `baseline/`, or any existing harness version
158
+ - Do NOT create files outside of `.harness-evolver/architecture.json` and `.harness-evolver/architecture.md`
@@ -0,0 +1,117 @@
1
+ ---
2
+ name: harness-evolver-critic
3
+ description: |
4
+ Use this agent when scores converge suspiciously fast (>0.3 jump in one iteration
5
+ or 1.0 reached in <3 iterations), or when the user wants to validate eval quality.
6
+ Analyzes the eval script, harness outputs, and optionally uses LangSmith evaluators
7
+ to cross-validate scores and identify eval weaknesses.
8
+ model: opus
9
+ ---
10
+
11
+ # Harness Evolver — Critic Agent
12
+
13
+ You are the critic in the Harness Evolver loop. Your job is to assess whether the eval
14
+ script is rigorous enough and whether high scores reflect genuine improvement or eval gaming.
15
+
16
+ ## When You Are Called
17
+
18
+ You are called when:
19
+ - Score jumps >0.3 in a single iteration (suspicious rapid improvement)
20
+ - Score reaches 1.0 in fewer than 3 iterations (too easy)
21
+ - The user explicitly requests `/harness-evolver:critic`
22
+ - The evolve loop detects potential eval gaming
23
+
24
+ ## Your Workflow
25
+
26
+ ### Phase 1: ANALYZE THE EVAL
27
+
28
+ Read `.harness-evolver/eval/eval.py` and assess:
29
+ - **Matching strategy**: exact match? substring? regex? semantic? LLM-as-judge?
30
+ - **Scoring granularity**: binary (0/1)? continuous (0.0-1.0)? partial credit?
31
+ - **Edge case handling**: what happens with empty output? malformed output? extra text?
32
+ - **Gaming vectors**: can the harness trivially achieve 1.0 by formatting tricks?
33
+ - Substring match: harness just needs to include the expected text somewhere
34
+ - Case-insensitive: harness can output any casing
35
+ - No length penalty: harness can dump everything and substring will match
36
+
37
+ ### Phase 2: CROSS-VALIDATE WITH EVIDENCE
38
+
39
+ Read the harness outputs that scored high and check:
40
+ - Are the outputs genuinely good answers, or do they just contain the magic substring?
41
+ - Compare outputs across versions: did the harness actually improve, or just reformatted?
42
+ - Read `proposal.md` of high-scoring versions: are changes substantive or cosmetic?
43
+
44
+ If `langsmith-cli` is available (check by running `which langsmith-cli`):
45
+
46
+ ```bash
47
+ # Get the actual LLM inputs/outputs for the best version
48
+ langsmith-cli --json runs list --project harness-evolver-{best_version} --fields inputs,outputs,name --limit 10
49
+
50
+ # Check if there are quality issues the eval missed
51
+ langsmith-cli --json runs stats --project harness-evolver-{best_version}
52
+ ```
53
+
54
+ ### Phase 3: DIAGNOSE EVAL WEAKNESSES
55
+
56
+ Produce a structured critique:
57
+
58
+ ```json
59
+ {
60
+ "eval_quality": "weak|moderate|strong",
61
+ "gaming_detected": true|false,
62
+ "weaknesses": [
63
+ {
64
+ "type": "substring_match_too_lenient",
65
+ "description": "Eval uses `expected in actual` which passes if expected text appears anywhere",
66
+ "example": "task_005: expected 'Paris' but harness output 'I visited Paris last summer' scores 1.0",
67
+ "severity": "high"
68
+ }
69
+ ],
70
+ "recommendations": [
71
+ {
72
+ "priority": 1,
73
+ "change": "Use semantic similarity instead of substring match",
74
+ "implementation": "Use LLM-as-judge: ask the LLM if the answer is correct given the question and expected answer"
75
+ }
76
+ ],
77
+ "proposed_eval_improvements": "... code snippet ..."
78
+ }
79
+ ```
80
+
81
+ ### Phase 4: PROPOSE IMPROVED EVAL
82
+
83
+ If weaknesses are found, write a proposed improved eval at `.harness-evolver/eval/eval_improved.py`.
84
+ The improved eval should:
85
+ - Be stricter than the current eval
86
+ - Not be so strict that correct answers fail (no false negatives)
87
+ - Add multiple scoring dimensions if appropriate (accuracy, completeness, conciseness)
88
+ - Optionally use LLM-as-judge for semantic evaluation (if an API key is available)
89
+
90
+ **IMPORTANT**: Do NOT modify the existing `eval/eval.py` directly. Write the improved version
91
+ as `eval_improved.py` and let the user decide to adopt it.
92
+
93
+ Also write `.harness-evolver/critic_report.md` with a human-readable analysis.
94
+
95
+ ### Phase 5: RE-SCORE
96
+
97
+ If you wrote an improved eval, re-run the best harness version against it:
98
+
99
+ ```bash
100
+ python3 $TOOLS/evaluate.py run \
101
+ --harness .harness-evolver/harnesses/{best}/harness.py \
102
+ --config .harness-evolver/harnesses/{best}/config.json \
103
+ --tasks-dir .harness-evolver/eval/tasks/ \
104
+ --eval .harness-evolver/eval/eval_improved.py \
105
+ --traces-dir /tmp/critic-rescore/ \
106
+ --scores /tmp/critic-rescore-scores.json
107
+ ```
108
+
109
+ Report the score difference: "With the current eval: 1.0. With the improved eval: 0.65. This confirms the eval was too lenient."
110
+
111
+ ## Rules
112
+
113
+ 1. **Never weaken the eval** — only propose stricter or more nuanced scoring
114
+ 2. **Don't require external dependencies** — improved eval must be stdlib-only (unless an LLM API key is available for LLM-as-judge)
115
+ 3. **Preserve the eval interface** — `--results-dir`, `--tasks-dir`, `--scores` contract must stay the same
116
+ 4. **Be specific** — cite exact task IDs and outputs that expose the weakness
117
+ 5. **Use LangSmith if available** — cross-validate with `langsmith-cli` evaluators before writing your own critique
@@ -55,23 +55,77 @@ You are working inside a `.harness-evolver/` directory with this structure:
55
55
 
56
56
  ### Phase 2: DIAGNOSE (deep trace analysis)
57
57
 
58
- Investigate the selected versions. Use standard tools:
59
- - `cat .harness-evolver/harnesses/v{N}/scores.json` — see per-task results
60
- - `cat .harness-evolver/harnesses/v{N}/traces/task_XXX/output.json` see what went wrong
61
- - `cat .harness-evolver/harnesses/v{N}/traces/stderr.log` — look for errors
62
- - `diff .harness-evolver/harnesses/v{A}/harness.py .harness-evolver/harnesses/v{B}/harness.py` — compare
63
- - `grep -r "error\|Error\|FAIL\|exception" .harness-evolver/harnesses/v{N}/traces/`
64
-
65
- Ask yourself:
58
+ **Step 1: Try LangSmith first (if available)**
59
+
60
+ Check if `langsmith-cli` is available and if LangSmith tracing is enabled in `config.json`:
61
+
62
+ ```bash
63
+ which langsmith-cli && cat .harness-evolver/config.json | python3 -c "import sys,json; c=json.load(sys.stdin); print(c.get('eval',{}).get('langsmith',{}).get('enabled',False))"
64
+ ```
65
+
66
+ If both are true, use langsmith-cli as your PRIMARY diagnostic tool:
67
+
68
+ ```bash
69
+ # Overview of the version's runs
70
+ langsmith-cli --json runs stats --project harness-evolver-v{N}
71
+
72
+ # Find failures with full details
73
+ langsmith-cli --json runs list --project harness-evolver-v{N} --failed --fields id,name,error,inputs,outputs
74
+
75
+ # Compare two versions
76
+ langsmith-cli --json runs stats --project harness-evolver-v{A}
77
+ langsmith-cli --json runs stats --project harness-evolver-v{B}
78
+
79
+ # Search for specific error patterns
80
+ langsmith-cli --json runs list --grep "error_pattern" --grep-in error --project harness-evolver-v{N} --fields id,error
81
+ ```
82
+
83
+ ALWAYS use `--json` as the first flag and `--fields` to limit output.
84
+ LangSmith traces are richer than local traces — they capture every LLM call, token usage, latency, and tool invocations.
85
+
86
+ **Step 2: Fall back to local traces (if LangSmith not available)**
87
+
88
+ Only if langsmith-cli is not available or LangSmith is not enabled:
89
+
90
+ - Select 2-3 versions for deep analysis: best, worst recent, different failure mode
91
+ - Read traces: `cat .harness-evolver/harnesses/v{N}/traces/{task_id}/output.json`
92
+ - Search errors: `grep -r "error\|Error\|FAIL" .harness-evolver/harnesses/v{N}/traces/`
93
+ - Compare: `diff .harness-evolver/harnesses/v{A}/harness.py .harness-evolver/harnesses/v{B}/harness.py`
94
+
95
+ **Step 3: Counterfactual diagnosis (always)**
96
+
97
+ Regardless of trace source:
66
98
  - Which tasks fail? Is there a pattern?
67
99
  - What changed between a version that passed and one that failed?
68
100
  - Is this a code bug, a prompt issue, a retrieval problem, or a parameter problem?
101
+ - Identify 1-3 specific failure modes with evidence (task IDs, trace lines, score deltas)
69
102
 
70
103
  **Do NOT read traces of all versions.** Focus on 2-3. Use summary.json to filter.
71
104
 
72
105
  ### Phase 3: PROPOSE (write new harness)
73
106
 
74
- Based on your diagnosis, create a new version directory and write:
107
+ **Step 1: Consult documentation first (if Context7 available)**
108
+
109
+ Read `config.json` field `stack.detected` to see which libraries the harness uses.
110
+
111
+ BEFORE writing any code that uses a library API:
112
+ 1. Use `resolve-library-id` with the `context7_id` from the stack config
113
+ 2. Use `get-library-docs` to fetch current documentation for the specific API you're about to use
114
+ 3. Verify your proposed code matches the current API (not deprecated patterns)
115
+
116
+ If Context7 is NOT available, proceed with model knowledge but note in `proposal.md`:
117
+ "API not verified against current docs."
118
+
119
+ Do NOT look up docs for every line — only for new imports, new methods, new parameters.
120
+
121
+ **Step 2: Write the harness**
122
+
123
+ Based on your diagnosis (Phase 2) and documentation (Step 1):
124
+ - Write new `harness.py` based on the best candidate + corrections
125
+ - Write `config.json` if parameters changed
126
+ - Prefer additive changes when risk is high (after regressions)
127
+
128
+ Create a new version directory with:
75
129
 
76
130
  1. `harnesses/v{NEXT}/harness.py` — the new harness code
77
131
  2. `harnesses/v{NEXT}/config.json` — parameters (copy from parent, modify if needed)
@@ -82,15 +136,26 @@ Based on your diagnosis, create a new version directory and write:
82
136
  python3 harness.py --input INPUT.json --output OUTPUT.json [--traces-dir DIR] [--config CONFIG.json]
83
137
  ```
84
138
 
85
- ### Phase 4: DOCUMENT
139
+ **Step 3: Document**
140
+
141
+ Write `proposal.md`:
142
+ - `Based on v{PARENT}` on first line
143
+ - What failure modes you identified (with evidence from LangSmith or local traces)
144
+ - What documentation you consulted (Context7 or model knowledge)
145
+ - What changes you made and why
146
+ - Expected impact on score
147
+
148
+ Append summary to `PROPOSER_HISTORY.md`.
149
+
150
+ ## Architecture Guidance (if available)
86
151
 
87
- Write a clear `proposal.md` that includes:
88
- - `Based on v{PARENT}` on the first line
89
- - What failure modes you identified
90
- - What specific changes you made and why
91
- - What you expect to improve
152
+ If `.harness-evolver/architecture.json` exists, read it in Phase 1 (ORIENT). The architect agent has recommended a target topology and migration path.
92
153
 
93
- Append a summary to `PROPOSER_HISTORY.md`.
154
+ - Work TOWARD the recommended topology incrementally — one migration step per iteration
155
+ - Do NOT rewrite the entire harness in one iteration
156
+ - Document which migration step you are implementing in `proposal.md`
157
+ - If a migration step causes regression, note it and consider reverting or deviating
158
+ - If `architecture.json` does NOT exist, ignore this section and evolve freely
94
159
 
95
160
  ## Rules
96
161
 
@@ -108,16 +173,21 @@ Append a summary to `PROPOSER_HISTORY.md`.
108
173
 
109
174
  7. **Use available API keys from environment.** Check `config.json` field `api_keys` to see which LLM APIs are available (Anthropic, OpenAI, Gemini, OpenRouter, etc.). Always read keys via `os.environ.get("KEY_NAME")` — never hardcode values. If an evolution strategy requires an API that isn't available, note it in `proposal.md` and choose an alternative.
110
175
 
111
- ## Documentation Lookup (if Context7 available)
176
+ ## Documentation Lookup (Context7-first)
112
177
 
113
- - Read `config.json` field `stack.detected` to see which libraries the harness uses.
114
- - BEFORE writing code that uses a library from the detected stack,
115
- use the `resolve-library-id` tool with the `context7_id` from the config, then
116
- `get-library-docs` to fetch documentation relevant to your proposed change.
117
- - If Context7 is NOT available, proceed with model knowledge
118
- but note in `proposal.md`: "API not verified against current docs."
119
- - Do NOT look up docs for every line of code — only when proposing
120
- changes that involve specific APIs (new imports, new methods, new parameters).
178
+ Context7 is the PRIMARY documentation source. In Phase 3, Step 1:
179
+
180
+ 1. Read `config.json` field `stack.detected` to see which libraries the harness uses.
181
+ 2. BEFORE writing code that uses a library from the detected stack,
182
+ use the `resolve-library-id` tool with the `context7_id` from the config, then
183
+ `get-library-docs` to fetch documentation relevant to your proposed change.
184
+ 3. Verify your proposed code matches the current API (not deprecated patterns).
185
+
186
+ If Context7 is NOT available, proceed with model knowledge
187
+ but note in `proposal.md`: "API not verified against current docs."
188
+
189
+ Do NOT look up docs for every line of code — only when proposing
190
+ changes that involve specific APIs (new imports, new methods, new parameters).
121
191
 
122
192
  ## What You Do NOT Do
123
193
 
@@ -127,13 +197,16 @@ Append a summary to `PROPOSER_HISTORY.md`.
127
197
  - Do NOT modify any prior version's files — history is immutable.
128
198
  - Do NOT create files outside of `harnesses/v{NEXT}/` and `PROPOSER_HISTORY.md`.
129
199
 
130
- ## LangSmith Traces (when langsmith-cli is available)
200
+ ## LangSmith Traces (LangSmith-first)
201
+
202
+ LangSmith is the PRIMARY diagnostic tool. In Phase 2, Step 1:
131
203
 
132
- If LangSmith tracing is enabled (check `config.json` field `eval.langsmith.enabled`),
133
- each harness run is automatically traced to a LangSmith project named
134
- `{project_prefix}-v{NNN}`.
204
+ 1. Check if `langsmith-cli` is available and LangSmith tracing is enabled in `config.json`.
205
+ 2. If both are true, use langsmith-cli BEFORE falling back to local traces.
135
206
 
136
- Use `langsmith-cli` to query traces directly:
207
+ LangSmith traces are richer than local traces — they capture every LLM call, token usage,
208
+ latency, and tool invocations. Each harness run is automatically traced to a LangSmith
209
+ project named `{project_prefix}-v{NNN}`.
137
210
 
138
211
  ```bash
139
212
  # Find failures in this version
@@ -154,7 +227,7 @@ langsmith-cli --json runs get-latest --project harness-evolver-v{N} --failed
154
227
  ```
155
228
 
156
229
  ALWAYS use `--json` as the first flag and `--fields` to limit output size.
157
- If `langsmith-cli` is not available, fall back to local traces in `traces/` as usual.
230
+ Only fall back to local traces in `traces/` if langsmith-cli is not available or LangSmith is not enabled.
158
231
 
159
232
  ## Output
160
233
 
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "harness-evolver",
3
- "version": "0.9.0",
3
+ "version": "1.1.0",
4
4
  "description": "Meta-Harness-style autonomous harness optimization for Claude Code",
5
5
  "author": "Raphael Valdetaro",
6
6
  "license": "MIT",
@@ -0,0 +1,108 @@
1
+ ---
2
+ name: harness-evolver:architect
3
+ description: "Use when the user wants to analyze harness architecture, get a topology recommendation, understand if their agent pattern is optimal, or after stagnation in the evolution loop."
4
+ argument-hint: "[--force]"
5
+ allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent]
6
+ ---
7
+
8
+ # /harness-evolver:architect
9
+
10
+ Analyze the current harness architecture and recommend the optimal multi-agent topology.
11
+
12
+ ## Prerequisites
13
+
14
+ `.harness-evolver/` must exist. If not, tell user to run `harness-evolver:init` first.
15
+
16
+ ```bash
17
+ if [ ! -d ".harness-evolver" ]; then
18
+ echo "ERROR: .harness-evolver/ not found. Run /harness-evolver:init first."
19
+ exit 1
20
+ fi
21
+ ```
22
+
23
+ ## Resolve Tool Path
24
+
25
+ ```bash
26
+ TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo "$HOME/.harness-evolver/tools")
27
+ ```
28
+
29
+ Use `$TOOLS` prefix for all tool calls below.
30
+
31
+ ## Step 1: Run Architecture Analysis
32
+
33
+ Build the command based on what exists:
34
+
35
+ ```bash
36
+ CMD="python3 $TOOLS/analyze_architecture.py --harness .harness-evolver/baseline/harness.py"
37
+
38
+ # Add traces from best version if evolution has run
39
+ if [ -f ".harness-evolver/summary.json" ]; then
40
+ BEST=$(python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(s.get('best',{}).get('version',''))")
41
+ if [ -n "$BEST" ] && [ -d ".harness-evolver/harnesses/$BEST/traces" ]; then
42
+ CMD="$CMD --traces-dir .harness-evolver/harnesses/$BEST/traces"
43
+ fi
44
+ CMD="$CMD --summary .harness-evolver/summary.json"
45
+ fi
46
+
47
+ CMD="$CMD -o .harness-evolver/architecture_signals.json"
48
+
49
+ eval $CMD
50
+ ```
51
+
52
+ Check exit code. If it fails, report the error and stop.
53
+
54
+ ## Step 2: Spawn Architect Agent
55
+
56
+ Spawn the `harness-evolver-architect` agent with:
57
+
58
+ > Analyze the harness and recommend the optimal multi-agent topology.
59
+ > Raw signals are at `.harness-evolver/architecture_signals.json`.
60
+ > Write `.harness-evolver/architecture.json` and `.harness-evolver/architecture.md`.
61
+
62
+ The architect agent will:
63
+ 1. Read the signals JSON
64
+ 2. Read the harness code and config
65
+ 3. Classify the current topology
66
+ 4. Assess if it matches task complexity
67
+ 5. Recommend the optimal topology with migration steps
68
+ 6. Write `architecture.json` and `architecture.md`
69
+
70
+ ## Step 3: Report
71
+
72
+ After the architect agent completes, read the outputs and print a summary:
73
+
74
+ ```
75
+ Architecture Analysis Complete
76
+ ==============================
77
+ Current topology: {current_topology}
78
+ Recommended topology: {recommended_topology}
79
+ Confidence: {confidence}
80
+
81
+ Reasoning: {reasoning}
82
+
83
+ Migration Path:
84
+ 1. {step 1 description}
85
+ 2. {step 2 description}
86
+ ...
87
+
88
+ Risks:
89
+ - {risk 1}
90
+ - {risk 2}
91
+
92
+ Next: Run /harness-evolver:evolve — the proposer will follow the migration path.
93
+ ```
94
+
95
+ If the architect recommends no change (current = recommended), report:
96
+
97
+ ```
98
+ Architecture Analysis Complete
99
+ ==============================
100
+ Current topology: {topology} — looks optimal for these tasks.
101
+ No architecture change recommended. Score: {score}
102
+
103
+ The proposer can continue evolving within the current topology.
104
+ ```
105
+
106
+ ## Arguments
107
+
108
+ - `--force` — re-run analysis even if `architecture.json` already exists. Without this flag, if `architecture.json` exists, just display the existing recommendation.
@@ -0,0 +1,37 @@
1
+ ---
2
+ name: harness-evolver:critic
3
+ description: "Use when scores converge suspiciously fast, eval quality is questionable, the harness reaches 1.0 in few iterations, or the user wants to validate that improvements are genuine. Also triggers automatically when score jumps >0.3 in one iteration."
4
+ allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent]
5
+ ---
6
+
7
+ # /harness-evolver:critic
8
+
9
+ Analyze eval quality and detect eval gaming.
10
+
11
+ ## Resolve Tool Path
12
+
13
+ ```bash
14
+ TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo "$HOME/.harness-evolver/tools")
15
+ ```
16
+
17
+ ## Prerequisites
18
+
19
+ `.harness-evolver/` must exist with at least one evaluated version (v001+).
20
+
21
+ ## What To Do
22
+
23
+ 1. Read `summary.json` to check for suspicious patterns:
24
+ - Score jump >0.3 in a single iteration
25
+ - Score reached 1.0 in <3 iterations
26
+ - All tasks suddenly pass after failing
27
+
28
+ 2. Spawn the `harness-evolver-critic` agent:
29
+ > Analyze the eval quality for this harness evolution project.
30
+ > Check if the eval at `.harness-evolver/eval/eval.py` is rigorous enough.
31
+ > The best version is {version} with score {score} achieved in {iterations} iterations.
32
+
33
+ 3. After the critic reports:
34
+ - Show the eval quality assessment
35
+ - If `eval_improved.py` was created, show the score comparison
36
+ - Ask user: "Adopt the improved eval? This will re-baseline all scores."
37
+ - If adopted: copy `eval_improved.py` to `eval/eval.py`, re-run baseline, update state
@@ -80,6 +80,23 @@ python3 $TOOLS/state.py update \
80
80
 
81
81
  Read `summary.json`. Print: `Iteration {i}/{N}: {version} scored {score} (best: {best} at {best_score})`
82
82
 
83
+ ### 6.5. Check for Eval Gaming
84
+
85
+ After updating state, read the latest `summary.json` and check:
86
+ - Did the score jump >0.3 from parent version?
87
+ - Did we reach 1.0 in fewer than 3 total iterations?
88
+
89
+ If either is true, warn:
90
+
91
+ > Suspicious convergence detected: score jumped from {parent_score} to {score} in one iteration.
92
+ > The eval may be too lenient. Run `/harness-evolver:critic` to analyze eval quality.
93
+
94
+ If score is 1.0 and iterations < 3, STOP the loop and strongly recommend the critic:
95
+
96
+ > Perfect score reached in only {iterations} iteration(s). This usually indicates
97
+ > the eval is too easy, not that the harness is perfect. Run `/harness-evolver:critic`
98
+ > before continuing.
99
+
83
100
  ### 7. Check Stop Conditions
84
101
 
85
102
  - **Stagnation**: last 3 scores within 1% of each other → stop
@@ -92,3 +109,8 @@ Read `summary.json`. Print: `Iteration {i}/{N}: {version} scored {score} (best:
92
109
  - Improvement over baseline (absolute and %)
93
110
  - Total iterations run
94
111
  - Suggest: "The best harness is at `.harness-evolver/harnesses/{best}/harness.py`. Copy it to your project."
112
+
113
+ If the loop stopped due to stagnation AND `.harness-evolver/architecture.json` does NOT exist:
114
+
115
+ > The proposer may have hit an architectural ceiling. Run `/harness-evolver:architect`
116
+ > to analyze whether a different agent topology could help.
@@ -57,6 +57,21 @@ Add `--harness-config config.json` if a config exists.
57
57
  - Baseline score
58
58
  - Next: `harness-evolver:evolve` to start
59
59
 
60
+ ## Architecture Hint
61
+
62
+ After init completes, run a quick architecture analysis:
63
+
64
+ ```bash
65
+ python3 $TOOLS/analyze_architecture.py --harness .harness-evolver/baseline/harness.py
66
+ ```
67
+
68
+ If the analysis suggests the current topology may not be optimal for the task complexity, mention it:
69
+
70
+ > Architecture note: Current topology is "{topology}". For tasks with {characteristics},
71
+ > consider running `/harness-evolver:architect` for a detailed recommendation.
72
+
73
+ This is advisory only — do not spawn the architect agent.
74
+
60
75
  ## Gotchas
61
76
 
62
77
  - The harness must write valid JSON to `--output`. If the user's code returns non-JSON, the wrapper must serialize it.
@@ -0,0 +1,512 @@
1
+ #!/usr/bin/env python3
2
+ """Analyze harness architecture to detect current topology and produce signals.
3
+
4
+ Usage:
5
+ analyze_architecture.py --harness PATH [--traces-dir PATH] [--summary PATH] [-o output.json]
6
+
7
+ Performs AST-based analysis of harness code, optional trace analysis, and optional
8
+ score analysis to classify the current agent topology and produce structured signals
9
+ for the architect agent.
10
+
11
+ Stdlib-only. No external dependencies.
12
+ """
13
+
14
+ import argparse
15
+ import ast
16
+ import json
17
+ import os
18
+ import re
19
+ import sys
20
+
21
+
22
+ # --- AST Analysis ---
23
+
24
+ LLM_API_DOMAINS = [
25
+ "api.anthropic.com",
26
+ "api.openai.com",
27
+ "generativelanguage.googleapis.com",
28
+ ]
29
+
30
+ LLM_SDK_MODULES = {"openai", "anthropic", "langchain_openai", "langchain_anthropic",
31
+ "langchain_core", "langchain_community", "langchain"}
32
+
33
+ RETRIEVAL_MODULES = {"chromadb", "pinecone", "qdrant_client", "weaviate"}
34
+
35
+ RETRIEVAL_METHOD_NAMES = {"similarity_search", "query"}
36
+
37
+ GRAPH_FRAMEWORK_CLASSES = {"StateGraph"}
38
+ GRAPH_FRAMEWORK_METHODS = {"add_node", "add_edge"}
39
+
40
+ PARALLEL_PATTERNS = {"gather"} # asyncio.gather
41
+ PARALLEL_CLASSES = {"ThreadPoolExecutor", "ProcessPoolExecutor"}
42
+
43
+ TOOL_DICT_KEYS = {"name", "description", "parameters"}
44
+
45
+
46
+ def _get_all_imports(tree):
47
+ """Extract all imported module root names."""
48
+ imports = set()
49
+ for node in ast.walk(tree):
50
+ if isinstance(node, ast.Import):
51
+ for alias in node.names:
52
+ imports.add(alias.name.split(".")[0])
53
+ elif isinstance(node, ast.ImportFrom):
54
+ if node.module:
55
+ imports.add(node.module.split(".")[0])
56
+ return imports
57
+
58
+
59
+ def _get_all_import_modules(tree):
60
+ """Extract all imported module full names (including submodules)."""
61
+ modules = set()
62
+ for node in ast.walk(tree):
63
+ if isinstance(node, ast.Import):
64
+ for alias in node.names:
65
+ modules.add(alias.name)
66
+ elif isinstance(node, ast.ImportFrom):
67
+ if node.module:
68
+ modules.add(node.module)
69
+ return modules
70
+
71
+
72
+ def _count_string_matches(tree, patterns):
73
+ """Count AST string constants that contain any of the given patterns."""
74
+ count = 0
75
+ for node in ast.walk(tree):
76
+ if isinstance(node, ast.Constant) and isinstance(node.value, str):
77
+ for pattern in patterns:
78
+ if pattern in node.value:
79
+ count += 1
80
+ break
81
+ return count
82
+
83
+
84
+ def _count_llm_calls(tree, imports, source_text):
85
+ """Count LLM API calls: urllib requests to known domains + SDK client calls."""
86
+ count = 0
87
+
88
+ # Count urllib.request calls with LLM API domains in string constants
89
+ count += _count_string_matches(tree, LLM_API_DOMAINS)
90
+
91
+ # Count SDK imports that imply LLM calls (each import of an LLM SDK = at least 1 call site)
92
+ full_modules = _get_all_import_modules(tree)
93
+ sdk_found = set()
94
+ for mod in full_modules:
95
+ root = mod.split(".")[0]
96
+ if root in LLM_SDK_MODULES:
97
+ sdk_found.add(root)
98
+
99
+ # For SDK users, look for actual call patterns like .create, .chat, .invoke, .run
100
+ llm_call_methods = {"create", "chat", "invoke", "run", "generate", "predict",
101
+ "complete", "completions"}
102
+ for node in ast.walk(tree):
103
+ if isinstance(node, ast.Call):
104
+ if isinstance(node.func, ast.Attribute):
105
+ if node.func.attr in llm_call_methods and sdk_found:
106
+ count += 1
107
+
108
+ # If we found SDK imports but no explicit call methods, count 1 per SDK
109
+ if sdk_found and count == 0:
110
+ count = len(sdk_found)
111
+
112
+ return max(count, _count_string_matches(tree, LLM_API_DOMAINS))
113
+
114
+
115
+ def _has_loop_around_llm(tree, source_text):
116
+ """Check if any LLM call is inside a loop (for/while)."""
117
+ for node in ast.walk(tree):
118
+ if isinstance(node, (ast.For, ast.While)):
119
+ # Walk the loop body looking for LLM call signals
120
+ for child in ast.walk(node):
121
+ # Check for urllib.request.urlopen in a loop
122
+ if isinstance(child, ast.Attribute) and child.attr == "urlopen":
123
+ return True
124
+ # Check for SDK call methods in a loop
125
+ if isinstance(child, ast.Call) and isinstance(child.func, ast.Attribute):
126
+ if child.func.attr in {"create", "chat", "invoke", "run",
127
+ "generate", "predict", "complete"}:
128
+ return True
129
+ # Check for LLM API domain strings in a loop
130
+ if isinstance(child, ast.Constant) and isinstance(child.value, str):
131
+ for domain in LLM_API_DOMAINS:
132
+ if domain in child.value:
133
+ return True
134
+ return False
135
+
136
+
137
+ def _has_tool_definitions(tree):
138
+ """Check for tool definitions: dicts with name/description/parameters keys, or @tool decorators."""
139
+ # Check for @tool decorator
140
+ for node in ast.walk(tree):
141
+ if isinstance(node, ast.FunctionDef):
142
+ for decorator in node.decorator_list:
143
+ if isinstance(decorator, ast.Name) and decorator.id == "tool":
144
+ return True
145
+ if isinstance(decorator, ast.Attribute) and decorator.attr == "tool":
146
+ return True
147
+
148
+ # Check for dicts with tool-like keys
149
+ for node in ast.walk(tree):
150
+ if isinstance(node, ast.Dict):
151
+ keys = set()
152
+ for key in node.keys:
153
+ if isinstance(key, ast.Constant) and isinstance(key.value, str):
154
+ keys.add(key.value)
155
+ if TOOL_DICT_KEYS.issubset(keys):
156
+ return True
157
+
158
+ return False
159
+
160
+
161
+ def _has_retrieval(tree, imports):
162
+ """Check for retrieval patterns: vector DB imports or .similarity_search/.query calls."""
163
+ if imports & RETRIEVAL_MODULES:
164
+ return True
165
+
166
+ for node in ast.walk(tree):
167
+ if isinstance(node, ast.Attribute):
168
+ if node.attr in RETRIEVAL_METHOD_NAMES:
169
+ return True
170
+
171
+ return False
172
+
173
+
174
+ def _has_graph_framework(tree, full_modules):
175
+ """Check for graph framework usage (LangGraph StateGraph, add_node, add_edge)."""
176
+ # Check if langgraph is imported
177
+ for mod in full_modules:
178
+ if "langgraph" in mod:
179
+ return True
180
+
181
+ # Check for StateGraph usage
182
+ for node in ast.walk(tree):
183
+ if isinstance(node, ast.Name) and node.id in GRAPH_FRAMEWORK_CLASSES:
184
+ return True
185
+ if isinstance(node, ast.Attribute):
186
+ if node.attr in GRAPH_FRAMEWORK_CLASSES or node.attr in GRAPH_FRAMEWORK_METHODS:
187
+ return True
188
+
189
+ return False
190
+
191
+
192
+ def _has_parallel_execution(tree, imports):
193
+ """Check for asyncio.gather, concurrent.futures, ThreadPoolExecutor."""
194
+ if "concurrent" in imports:
195
+ return True
196
+
197
+ for node in ast.walk(tree):
198
+ if isinstance(node, ast.Attribute):
199
+ if node.attr == "gather":
200
+ return True
201
+ if node.attr in PARALLEL_CLASSES:
202
+ return True
203
+ if isinstance(node, ast.Name) and node.id in PARALLEL_CLASSES:
204
+ return True
205
+
206
+ return False
207
+
208
+
209
+ def _has_error_handling_around_llm(tree):
210
+ """Check if LLM calls are wrapped in try/except."""
211
+ for node in ast.walk(tree):
212
+ if isinstance(node, ast.Try):
213
+ # Walk the try body for LLM signals
214
+ for child in ast.walk(node):
215
+ if isinstance(child, ast.Attribute) and child.attr == "urlopen":
216
+ return True
217
+ if isinstance(child, ast.Call) and isinstance(child.func, ast.Attribute):
218
+ if child.func.attr in {"create", "chat", "invoke", "run",
219
+ "generate", "predict", "complete"}:
220
+ return True
221
+ if isinstance(child, ast.Constant) and isinstance(child.value, str):
222
+ for domain in LLM_API_DOMAINS:
223
+ if domain in child.value:
224
+ return True
225
+ return False
226
+
227
+
228
+ def _count_functions(tree):
229
+ """Count function definitions (top-level and nested)."""
230
+ count = 0
231
+ for node in ast.walk(tree):
232
+ if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
233
+ count += 1
234
+ return count
235
+
236
+
237
+ def _count_classes(tree):
238
+ """Count class definitions."""
239
+ count = 0
240
+ for node in ast.walk(tree):
241
+ if isinstance(node, ast.ClassDef):
242
+ count += 1
243
+ return count
244
+
245
+
246
+ def _estimate_topology(signals):
247
+ """Classify the current topology based on code signals."""
248
+ if signals["has_graph_framework"]:
249
+ if signals["has_parallel_execution"]:
250
+ return "parallel"
251
+ return "hierarchical"
252
+
253
+ if signals["has_retrieval"]:
254
+ return "rag"
255
+
256
+ if signals["has_loop_around_llm"]:
257
+ if signals["has_tool_definitions"]:
258
+ return "react-loop"
259
+ return "react-loop"
260
+
261
+ if signals["llm_call_count"] >= 3:
262
+ if signals["has_tool_definitions"]:
263
+ return "react-loop"
264
+ return "chain"
265
+
266
+ if signals["llm_call_count"] == 2:
267
+ return "chain"
268
+
269
+ if signals["llm_call_count"] <= 1:
270
+ return "single-call"
271
+
272
+ return "single-call"
273
+
274
+
275
+ def analyze_code(harness_path):
276
+ """Analyze a harness Python file and return code signals."""
277
+ with open(harness_path) as f:
278
+ source = f.read()
279
+
280
+ try:
281
+ tree = ast.parse(source)
282
+ except SyntaxError:
283
+ return {
284
+ "llm_call_count": 0,
285
+ "has_loop_around_llm": False,
286
+ "has_tool_definitions": False,
287
+ "has_retrieval": False,
288
+ "has_graph_framework": False,
289
+ "has_parallel_execution": False,
290
+ "has_error_handling": False,
291
+ "estimated_topology": "unknown",
292
+ "code_lines": len(source.splitlines()),
293
+ "function_count": 0,
294
+ "class_count": 0,
295
+ }
296
+
297
+ imports = _get_all_imports(tree)
298
+ full_modules = _get_all_import_modules(tree)
299
+
300
+ llm_call_count = _count_llm_calls(tree, imports, source)
301
+ has_loop = _has_loop_around_llm(tree, source)
302
+ has_tools = _has_tool_definitions(tree)
303
+ has_retrieval = _has_retrieval(tree, imports)
304
+ has_graph = _has_graph_framework(tree, full_modules)
305
+ has_parallel = _has_parallel_execution(tree, imports)
306
+ has_error = _has_error_handling_around_llm(tree)
307
+
308
+ signals = {
309
+ "llm_call_count": llm_call_count,
310
+ "has_loop_around_llm": has_loop,
311
+ "has_tool_definitions": has_tools,
312
+ "has_retrieval": has_retrieval,
313
+ "has_graph_framework": has_graph,
314
+ "has_parallel_execution": has_parallel,
315
+ "has_error_handling": has_error,
316
+ "code_lines": len(source.splitlines()),
317
+ "function_count": _count_functions(tree),
318
+ "class_count": _count_classes(tree),
319
+ }
320
+ signals["estimated_topology"] = _estimate_topology(signals)
321
+
322
+ return signals
323
+
324
+
325
+ # --- Trace Analysis ---
326
+
327
+ def analyze_traces(traces_dir):
328
+ """Analyze execution traces for error patterns, timing, and failures."""
329
+ if not os.path.isdir(traces_dir):
330
+ return None
331
+
332
+ result = {
333
+ "error_patterns": [],
334
+ "timing": None,
335
+ "task_failures": [],
336
+ "stderr_lines": 0,
337
+ }
338
+
339
+ # Read stderr.log
340
+ stderr_path = os.path.join(traces_dir, "stderr.log")
341
+ if os.path.isfile(stderr_path):
342
+ try:
343
+ with open(stderr_path) as f:
344
+ stderr = f.read()
345
+ lines = stderr.strip().splitlines()
346
+ result["stderr_lines"] = len(lines)
347
+
348
+ # Detect common error patterns
349
+ error_counts = {}
350
+ for line in lines:
351
+ for pattern in ["Traceback", "Error", "Exception", "Timeout",
352
+ "ConnectionRefused", "HTTPError", "JSONDecodeError",
353
+ "KeyError", "TypeError", "ValueError"]:
354
+ if pattern in line:
355
+ error_counts[pattern] = error_counts.get(pattern, 0) + 1
356
+
357
+ result["error_patterns"] = [
358
+ {"pattern": p, "count": c}
359
+ for p, c in sorted(error_counts.items(), key=lambda x: -x[1])
360
+ ]
361
+ except Exception:
362
+ pass
363
+
364
+ # Read timing.json
365
+ timing_path = os.path.join(traces_dir, "timing.json")
366
+ if os.path.isfile(timing_path):
367
+ try:
368
+ with open(timing_path) as f:
369
+ timing = json.load(f)
370
+ result["timing"] = timing
371
+ except (json.JSONDecodeError, Exception):
372
+ pass
373
+
374
+ # Scan per-task output directories for failures
375
+ for entry in sorted(os.listdir(traces_dir)):
376
+ task_dir = os.path.join(traces_dir, entry)
377
+ if os.path.isdir(task_dir) and entry.startswith("task_"):
378
+ output_path = os.path.join(task_dir, "output.json")
379
+ if os.path.isfile(output_path):
380
+ try:
381
+ with open(output_path) as f:
382
+ output = json.load(f)
383
+ # Check for empty or error outputs
384
+ out_value = output.get("output", "")
385
+ if not out_value or out_value in ("error", "unknown", ""):
386
+ result["task_failures"].append({
387
+ "task": entry,
388
+ "output": out_value,
389
+ })
390
+ except (json.JSONDecodeError, Exception):
391
+ result["task_failures"].append({
392
+ "task": entry,
393
+ "output": "parse_error",
394
+ })
395
+
396
+ return result
397
+
398
+
399
+ # --- Score Analysis ---
400
+
401
+ def analyze_scores(summary_path):
402
+ """Analyze summary.json for stagnation, oscillation, and per-task failures."""
403
+ if not os.path.isfile(summary_path):
404
+ return None
405
+
406
+ try:
407
+ with open(summary_path) as f:
408
+ summary = json.load(f)
409
+ except (json.JSONDecodeError, Exception):
410
+ return None
411
+
412
+ result = {
413
+ "iterations": summary.get("iterations", 0),
414
+ "best_score": 0.0,
415
+ "baseline_score": 0.0,
416
+ "recent_scores": [],
417
+ "is_stagnating": False,
418
+ "is_oscillating": False,
419
+ "score_trend": "unknown",
420
+ }
421
+
422
+ # Extract best score
423
+ best = summary.get("best", {})
424
+ result["best_score"] = best.get("combined_score", 0.0)
425
+ result["baseline_score"] = summary.get("baseline_score", 0.0)
426
+
427
+ # Extract recent version scores
428
+ versions = summary.get("versions", [])
429
+ if isinstance(versions, list):
430
+ recent = versions[-5:] if len(versions) > 5 else versions
431
+ result["recent_scores"] = [
432
+ {"version": v.get("version", "?"), "score": v.get("combined_score", 0.0)}
433
+ for v in recent
434
+ ]
435
+ elif isinstance(versions, dict):
436
+ items = sorted(versions.items())
437
+ recent = items[-5:] if len(items) > 5 else items
438
+ result["recent_scores"] = [
439
+ {"version": k, "score": v.get("combined_score", 0.0)}
440
+ for k, v in recent
441
+ ]
442
+
443
+ # Detect stagnation (last 3+ scores within 1% of each other)
444
+ scores = [s["score"] for s in result["recent_scores"]]
445
+ if len(scores) >= 3:
446
+ last_3 = scores[-3:]
447
+ spread = max(last_3) - min(last_3)
448
+ if spread <= 0.01:
449
+ result["is_stagnating"] = True
450
+
451
+ # Detect oscillation (alternating up/down for last 4+ scores)
452
+ if len(scores) >= 4:
453
+ deltas = [scores[i+1] - scores[i] for i in range(len(scores)-1)]
454
+ sign_changes = sum(
455
+ 1 for i in range(len(deltas)-1)
456
+ if (deltas[i] > 0 and deltas[i+1] < 0) or (deltas[i] < 0 and deltas[i+1] > 0)
457
+ )
458
+ if sign_changes >= len(deltas) - 1:
459
+ result["is_oscillating"] = True
460
+
461
+ # Score trend
462
+ if len(scores) >= 2:
463
+ if scores[-1] > scores[0]:
464
+ result["score_trend"] = "improving"
465
+ elif scores[-1] < scores[0]:
466
+ result["score_trend"] = "declining"
467
+ else:
468
+ result["score_trend"] = "flat"
469
+
470
+ return result
471
+
472
+
473
+ # --- Main ---
474
+
475
+ def main():
476
+ parser = argparse.ArgumentParser(
477
+ description="Analyze harness architecture and produce signals for the architect agent",
478
+ usage="analyze_architecture.py --harness PATH [--traces-dir PATH] [--summary PATH] [-o output.json]",
479
+ )
480
+ parser.add_argument("--harness", required=True, help="Path to harness Python file")
481
+ parser.add_argument("--traces-dir", default=None, help="Path to traces directory")
482
+ parser.add_argument("--summary", default=None, help="Path to summary.json")
483
+ parser.add_argument("-o", "--output", default=None, help="Output JSON path")
484
+ args = parser.parse_args()
485
+
486
+ if not os.path.isfile(args.harness):
487
+ print(json.dumps({"error": f"Harness file not found: {args.harness}"}))
488
+ sys.exit(1)
489
+
490
+ result = {
491
+ "code_signals": analyze_code(args.harness),
492
+ "trace_signals": None,
493
+ "score_signals": None,
494
+ }
495
+
496
+ if args.traces_dir:
497
+ result["trace_signals"] = analyze_traces(args.traces_dir)
498
+
499
+ if args.summary:
500
+ result["score_signals"] = analyze_scores(args.summary)
501
+
502
+ output = json.dumps(result, indent=2)
503
+
504
+ if args.output:
505
+ with open(args.output, "w") as f:
506
+ f.write(output + "\n")
507
+ else:
508
+ print(output)
509
+
510
+
511
+ if __name__ == "__main__":
512
+ main()
package/tools/init.py CHANGED
@@ -317,6 +317,29 @@ def main():
317
317
  print("\nRecommendation: install Context7 MCP for up-to-date documentation:")
318
318
  print(" claude mcp add context7 -- npx -y @upstash/context7-mcp@latest")
319
319
 
320
+ # Architecture analysis (quick, advisory)
321
+ analyze_py = os.path.join(tools, "analyze_architecture.py")
322
+ if os.path.exists(analyze_py):
323
+ try:
324
+ r = subprocess.run(
325
+ ["python3", analyze_py, "--harness", args.harness],
326
+ capture_output=True, text=True, timeout=30,
327
+ )
328
+ if r.returncode == 0 and r.stdout.strip():
329
+ arch_signals = json.loads(r.stdout)
330
+ config["architecture"] = {
331
+ "current_topology": arch_signals.get("code_signals", {}).get("estimated_topology", "unknown"),
332
+ "auto_analyzed": True,
333
+ }
334
+ # Re-write config with architecture
335
+ with open(os.path.join(base, "config.json"), "w") as f:
336
+ json.dump(config, f, indent=2)
337
+ topo = config["architecture"]["current_topology"]
338
+ if topo != "unknown":
339
+ print(f"Architecture: {topo}")
340
+ except Exception:
341
+ pass
342
+
320
343
  # 5. Validate baseline harness
321
344
  print("Validating baseline harness...")
322
345
  val_args = ["python3", evaluate_py, "validate",