harness-evolver 0.8.0 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -4,46 +4,102 @@ End-to-end optimization of LLM agent harnesses, inspired by [Meta-Harness](https
4
4
 
5
5
  **The harness is the 80% factor.** Changing just the scaffolding around a fixed LLM can produce a [6x performance gap](https://arxiv.org/abs/2603.28052) on the same benchmark. Harness Evolver automates the search for better harnesses using an autonomous propose-evaluate-iterate loop with full execution traces as feedback.
6
6
 
7
- ## Why
7
+ ## Install
8
8
 
9
- Manual harness engineering is slow and doesn't scale. Existing optimizers work in prompt-space (OPRO, TextGrad, GEPA) or use compressed summaries. Meta-Harness showed that **code-space search with full diagnostic context** (10M+ tokens of traces) outperforms all of them by 10+ points.
9
+ ```bash
10
+ npx harness-evolver@latest
11
+ ```
10
12
 
11
- Harness Evolver brings that approach to any domain as a Claude Code plugin.
13
+ Select your runtime (Claude Code, Cursor, Codex, Windsurf) and scope (global/local). Then **restart your AI coding agent** for the skills to appear.
12
14
 
13
- ## Install
15
+ ## Prerequisites
16
+
17
+ ### API Keys (set in your shell before launching Claude Code)
18
+
19
+ The harness you're evolving may call LLM APIs. Set the keys your harness needs:
14
20
 
15
21
  ```bash
16
- # Via npx (recommended)
17
- npx harness-evolver@latest
22
+ # Required: at least one LLM provider
23
+ export ANTHROPIC_API_KEY="sk-ant-..." # For Claude-based harnesses
24
+ export OPENAI_API_KEY="sk-..." # For OpenAI-based harnesses
25
+ export GEMINI_API_KEY="AIza..." # For Gemini-based harnesses
26
+ export OPENROUTER_API_KEY="sk-or-..." # For OpenRouter (multi-model)
27
+
28
+ # Optional: enhanced tracing
29
+ export LANGSMITH_API_KEY="lsv2_pt_..." # Auto-enables LangSmith tracing
30
+ ```
31
+
32
+ The plugin auto-detects which keys are available during `/harness-evolver:init` and shows them. The proposer agent knows which APIs are available and uses them accordingly.
33
+
34
+ **No API key needed for the example** — the classifier example uses keyword matching (mock mode), no LLM calls.
35
+
36
+ ### Optional: Enhanced Integrations
37
+
38
+ ```bash
39
+ # LangSmith — rich trace analysis for the proposer
40
+ uv tool install langsmith-cli && langsmith-cli auth login
41
+
42
+ # Context7 — up-to-date library documentation for the proposer
43
+ claude mcp add context7 -- npx -y @upstash/context7-mcp@latest
18
44
 
19
- # Or as a Claude Code plugin
20
- /plugin install harness-evolver
45
+ # LangChain Docs LangChain/LangGraph-specific documentation
46
+ claude mcp add docs-langchain --transport http https://docs.langchain.com/mcp
21
47
  ```
22
48
 
23
49
  ## Quick Start
24
50
 
51
+ ### Try the Example (no API key needed)
52
+
25
53
  ```bash
26
- # 1. Copy the example into a working directory
54
+ # 1. Copy the example
27
55
  cp -r ~/.harness-evolver/examples/classifier ./my-classifier
28
56
  cd my-classifier
29
57
 
30
- # 2. Initialize (validates harness, evaluates baseline)
31
- /harness-evolve-init --harness harness.py --eval eval.py --tasks tasks/
58
+ # 2. Open Claude Code
59
+ claude
60
+
61
+ # 3. Initialize — auto-detects harness.py, eval.py, tasks/
62
+ /harness-evolver:init
63
+
64
+ # 4. Run the evolution loop
65
+ /harness-evolver:evolve --iterations 3
32
66
 
33
- # 3. Run the evolution loop
34
- /harness-evolve --iterations 5
67
+ # 5. Check progress
68
+ /harness-evolver:status
69
+ ```
70
+
71
+ ### Use with Your Own Project
72
+
73
+ ```bash
74
+ cd my-llm-project
75
+ claude
35
76
 
36
- # 4. Check progress anytime
37
- /harness-evolve-status
77
+ # Init scans your project, identifies the entry point,
78
+ # and helps create harness wrapper + eval + tasks if missing
79
+ /harness-evolver:init
80
+
81
+ # Run optimization
82
+ /harness-evolver:evolve --iterations 10
38
83
  ```
39
84
 
40
- The classifier example runs in mock mode (no API key needed) and demonstrates the full loop in under 2 minutes.
85
+ The init skill adapts to your project if you have `graph.py` instead of `harness.py`, it creates a thin wrapper. If you don't have an eval script, it helps you write one.
86
+
87
+ ## Available Commands
88
+
89
+ | Command | What it does |
90
+ |---|---|
91
+ | `/harness-evolver:init` | Scan project, create harness/eval/tasks, run baseline |
92
+ | `/harness-evolver:evolve` | Run the autonomous optimization loop |
93
+ | `/harness-evolver:status` | Show progress (scores, iterations, stagnation) |
94
+ | `/harness-evolver:compare` | Diff two versions with per-task analysis |
95
+ | `/harness-evolver:diagnose` | Deep trace analysis of a specific version |
96
+ | `/harness-evolver:deploy` | Copy the best harness back to your project |
41
97
 
42
98
  ## How It Works
43
99
 
44
100
  ```
45
101
  ┌─────────────────────────────┐
46
- /harness-evolve
102
+ /harness-evolver:evolve
47
103
  │ (orchestrator skill) │
48
104
  └──────────┬──────────────────┘
49
105
 
@@ -63,10 +119,10 @@ The classifier example runs in mock mode (no API key needed) and demonstrates th
63
119
  scores.json
64
120
  ```
65
121
 
66
- 1. **Propose** — A proposer agent (Claude Code subagent) reads all prior candidates' code, execution traces, and scores. It diagnoses failure modes via counterfactual analysis and writes a new harness.
67
- 2. **Evaluate** — The harness runs against every task. Traces are captured per-task (input, output, stdout, stderr, timing). The user's eval script scores the results.
122
+ 1. **Propose** — A proposer agent reads all prior candidates' code, execution traces, and scores. Diagnoses failure modes via counterfactual analysis and writes a new harness.
123
+ 2. **Evaluate** — The harness runs against every task. Traces are captured per-task (input, output, stdout, stderr, timing). Your eval script scores the results.
68
124
  3. **Update** — State files are updated with the new score, parent lineage, and regression detection.
69
- 4. **Repeat** — The loop continues until N iterations, stagnation (3 rounds without >1% improvement), or a target score is reached.
125
+ 4. **Repeat** — Until N iterations, stagnation (3 rounds without >1% improvement), or target score reached.
70
126
 
71
127
  ## The Harness Contract
72
128
 
@@ -78,8 +134,8 @@ python3 harness.py --input task.json --output result.json [--traces-dir DIR] [--
78
134
 
79
135
  - `--input`: JSON with `{id, input, metadata}` (never sees expected answers)
80
136
  - `--output`: JSON with `{id, output}`
81
- - `--traces-dir`: optional directory for the harness to write rich traces
82
- - `--config`: optional JSON with evolvable parameters (model, temperature, etc.)
137
+ - `--traces-dir`: optional directory for rich traces
138
+ - `--config`: optional JSON with evolvable parameters
83
139
 
84
140
  The eval script is also any executable:
85
141
 
@@ -87,165 +143,104 @@ The eval script is also any executable:
87
143
  python3 eval.py --results-dir results/ --tasks-dir tasks/ --scores scores.json
88
144
  ```
89
145
 
90
- This means Harness Evolver works with **any language, any framework, any domain**.
146
+ Works with **any language, any framework, any domain**.
91
147
 
92
- ## Project Structure
148
+ ## Project Structure (after init)
93
149
 
94
150
  ```
95
- .harness-evolver/ # Created in your project by /harness-evolve-init
96
- ├── config.json # Project config (harness cmd, eval cmd, evolution params)
151
+ .harness-evolver/ # Created by /harness-evolver:init
152
+ ├── config.json # Project config (harness cmd, eval, API keys detected)
97
153
  ├── summary.json # Source of truth (versions, scores, parents)
98
- ├── STATE.md # Human-readable status (generated)
154
+ ├── STATE.md # Human-readable status
99
155
  ├── PROPOSER_HISTORY.md # Log of all proposals and outcomes
100
- ├── baseline/ # Original harness (read-only reference)
101
- ├── harness.py
102
- │ └── config.json
156
+ ├── baseline/ # Original harness (read-only)
157
+ └── harness.py
103
158
  ├── eval/
104
- │ ├── eval.py # Scoring script
105
- │ └── tasks/ # Test cases (JSON files)
159
+ │ ├── eval.py # Your scoring script
160
+ │ └── tasks/ # Test cases
106
161
  └── harnesses/
107
162
  └── v001/
108
- ├── harness.py # Candidate code
109
- ├── config.json # Evolvable parameters
110
- ├── proposal.md # Proposer's reasoning
111
- ├── scores.json # Evaluation results
163
+ ├── harness.py # Evolved candidate
164
+ ├── proposal.md # Why this version was created
165
+ ├── scores.json # How it scored
112
166
  └── traces/ # Full execution traces
113
167
  ├── stdout.log
114
168
  ├── stderr.log
115
169
  ├── timing.json
116
170
  └── task_001/
117
- ├── input.json # What the harness received
118
- └── output.json # What the harness returned
171
+ ├── input.json
172
+ └── output.json
119
173
  ```
120
174
 
121
- ## Plugin Architecture
175
+ ## The Proposer
122
176
 
123
- Three-layer design inspired by [GSD](https://github.com/gsd-build/get-shit-done):
177
+ The core of the system. 4-phase workflow from the Meta-Harness paper:
124
178
 
125
- ```
126
- Layer 1: Skills + Agents (markdown) → AI orchestration
127
- Layer 2: Tools (Python stdlib-only) → Deterministic operations
128
- Layer 3: Installer (Node.js) → Distribution via npx
129
- ```
179
+ | Phase | What it does |
180
+ |---|---|
181
+ | **Orient** | Read `summary.json` + `PROPOSER_HISTORY.md`. Pick 2-3 versions to investigate. |
182
+ | **Diagnose** | Deep trace analysis. grep for errors, diff versions, counterfactual diagnosis. |
183
+ | **Propose** | Write new harness. Prefer additive changes after regressions. |
184
+ | **Document** | Write `proposal.md` with evidence. Update history. |
130
185
 
131
- | Component | Files | Purpose |
132
- |---|---|---|
133
- | **Skills** | `skills/harness-evolve-init/`, `skills/harness-evolve/`, `skills/harness-evolve-status/` | Slash commands that orchestrate the loop |
134
- | **Agent** | `agents/harness-evolver-proposer.md` | The proposer — 4-phase workflow (orient, diagnose, propose, document) with 6 rules |
135
- | **Tools** | `tools/evaluate.py`, `tools/state.py`, `tools/init.py`, `tools/detect_stack.py`, `tools/trace_logger.py` | CLI tools called via subprocess — zero LLM tokens spent on deterministic work |
136
- | **Installer** | `bin/install.js`, `package.json` | Copies skills/agents/tools to the right locations |
137
- | **Example** | `examples/classifier/` | 10-task medical classifier with mock mode |
186
+ **7 rules:** evidence-based changes, conservative after regression, don't repeat mistakes, one hypothesis at a time, maintain interface, prefer readability, use available API keys from environment.
138
187
 
139
188
  ## Integrations
140
189
 
141
- ### LangSmith (optional)
142
-
143
- If `LANGSMITH_API_KEY` is set, the plugin automatically:
144
- - Enables `LANGCHAIN_TRACING_V2` for auto-tracing of LangChain/LangGraph harnesses
145
- - Detects [langsmith-cli](https://github.com/gigaverse-app/langsmith-cli) for the proposer to query traces directly
190
+ ### LangSmith (optional, recommended for LangChain/LangGraph harnesses)
146
191
 
147
192
  ```bash
148
- # Setup
149
193
  export LANGSMITH_API_KEY=lsv2_...
150
194
  uv tool install langsmith-cli && langsmith-cli auth login
151
-
152
- # The proposer can then do:
153
- langsmith-cli --json runs list --project harness-evolver-v003 --failed --fields id,name,error
154
- langsmith-cli --json runs stats --project harness-evolver-v003
155
195
  ```
156
196
 
157
- No custom API client — the proposer uses `langsmith-cli` like it uses `grep` and `diff`.
158
-
159
- ### Context7 (optional)
160
-
161
- The plugin detects the harness's technology stack via AST analysis (17 libraries supported) and instructs the proposer to consult current documentation before proposing API changes.
197
+ When detected, the plugin:
198
+ - Sets `LANGCHAIN_TRACING_V2=true` automatically — all LLM calls are traced
199
+ - The proposer queries traces directly via `langsmith-cli`:
162
200
 
163
201
  ```bash
164
- # Setup
165
- claude mcp add context7 -- npx -y @upstash/context7-mcp@latest
166
-
167
- # The proposer automatically:
168
- # 1. Reads config.json → stack.detected (e.g., LangChain, ChromaDB)
169
- # 2. Queries Context7 for current docs before writing code
170
- # 3. Annotates proposal.md with "API verified via Context7"
202
+ langsmith-cli --json runs list --project harness-evolver-v003 --failed --fields id,name,error
203
+ langsmith-cli --json runs stats --project harness-evolver-v003
171
204
  ```
172
205
 
173
- Without Context7, the proposer uses model knowledge and annotates "API not verified against current docs."
174
-
175
- ### LangChain Docs MCP (optional)
206
+ ### Context7 (optional, recommended for any library-heavy harness)
176
207
 
177
208
  ```bash
178
- claude mcp add docs-langchain --transport http https://docs.langchain.com/mcp
209
+ claude mcp add context7 -- npx -y @upstash/context7-mcp@latest
179
210
  ```
180
211
 
181
- Complements Context7 with LangChain/LangGraph/LangSmith-specific documentation search.
182
-
183
- ## The Proposer
184
-
185
- The proposer agent is the core of the system. It follows a 4-phase workflow derived from the Meta-Harness paper:
186
-
187
- | Phase | Context % | What it does |
188
- |---|---|---|
189
- | **Orient** | ~6% | Read `summary.json` and `PROPOSER_HISTORY.md`. Decide which 2-3 versions to investigate. |
190
- | **Diagnose** | ~80% | Deep trace analysis on selected versions. grep for errors, diff between good/bad versions, counterfactual diagnosis. |
191
- | **Propose** | ~10% | Write new `harness.py` + `config.json`. Prefer additive changes after regressions. |
192
- | **Document** | ~4% | Write `proposal.md` with evidence. Append to `PROPOSER_HISTORY.md`. |
193
-
194
- **6 rules:**
195
- 1. Every change motivated by evidence (cite task ID, trace line, or score delta)
196
- 2. After regression, prefer additive changes
197
- 3. Don't repeat past mistakes (read PROPOSER_HISTORY.md)
198
- 4. One hypothesis at a time when possible
199
- 5. Maintain the CLI interface
200
- 6. Prefer readable harnesses over defensive ones
201
-
202
- ## Supported Libraries (Stack Detection)
203
-
204
- The AST-based stack detector recognizes 17 libraries:
205
-
206
- | Category | Libraries |
207
- |---|---|
208
- | **AI Frameworks** | LangChain, LangGraph, LlamaIndex, OpenAI, Anthropic, DSPy, CrewAI, AutoGen |
209
- | **Vector Stores** | ChromaDB, Pinecone, Qdrant, Weaviate |
210
- | **Web** | FastAPI, Flask, Pydantic |
211
- | **Data** | Pandas, NumPy |
212
+ The plugin detects your stack via AST analysis (17 libraries: LangChain, LangGraph, OpenAI, Anthropic, ChromaDB, FastAPI, etc.) and instructs the proposer to consult current docs before proposing API changes.
212
213
 
213
214
  ## Development
214
215
 
215
216
  ```bash
216
- # Run all tests (41 tests, stdlib-only, no pip install needed)
217
+ # Run all tests (41 tests, stdlib-only)
217
218
  python3 -m unittest discover -s tests -v
218
219
 
219
- # Test the example manually
220
- cd examples/classifier
221
- python3 harness.py --input tasks/task_001.json --output /tmp/result.json --config config.json
222
- cat /tmp/result.json
220
+ # Test example manually
221
+ python3 examples/classifier/harness.py --input examples/classifier/tasks/task_001.json --output /tmp/result.json --config examples/classifier/config.json
223
222
 
224
- # Run the installer locally
223
+ # Install locally for development
225
224
  node bin/install.js
226
225
  ```
227
226
 
228
- ## Comparison with Related Work
227
+ ## Comparison
229
228
 
230
- | | Meta-Harness (paper) | A-Evolve | ECC /evolve | **Harness Evolver** |
229
+ | | Meta-Harness | A-Evolve | ECC | **Harness Evolver** |
231
230
  |---|---|---|---|---|
232
231
  | **Format** | Paper artifact | Framework (Docker) | Plugin (passive) | **Plugin (active)** |
233
- | **Search space** | Code-space | Code-space | Prompt-space | **Code-space** |
234
- | **Context/iter** | 10M tokens | Variable | N/A | **Full filesystem** |
232
+ | **Search** | Code-space | Code-space | Prompt-space | **Code-space** |
235
233
  | **Domain** | TerminalBench-2 | Coding benchmarks | Dev workflow | **Any domain** |
236
- | **Install** | Manual Python | Docker CLI | `/plugin install` | **`npx` or `/plugin install`** |
237
- | **LangSmith** | No | No | No | **Yes (langsmith-cli)** |
238
- | **Context7** | No | No | No | **Yes (MCP)** |
234
+ | **Install** | Manual Python | Docker CLI | `/plugin install` | **`npx`** |
235
+ | **LangSmith** | No | No | No | **Yes** |
236
+ | **Context7** | No | No | No | **Yes** |
239
237
 
240
238
  ## References
241
239
 
242
- - [Meta-Harness: End-to-End Optimization of Model Harnesses](https://arxiv.org/abs/2603.28052) — Lee et al., 2026
243
- - [GSD (Get Shit Done)](https://github.com/gsd-build/get-shit-done) — CLI architecture inspiration
244
- - [LangSmith CLI](https://github.com/gigaverse-app/langsmith-cli) — Trace analysis for the proposer
245
- - [Context7](https://github.com/upstash/context7) — Documentation lookup via MCP
240
+ - [Meta-Harness paper (arxiv 2603.28052)](https://arxiv.org/abs/2603.28052) — Lee et al., 2026
246
241
  - [Design Spec](docs/specs/2026-03-31-harness-evolver-design.md)
247
- - [LangSmith Integration Spec](docs/specs/2026-03-31-langsmith-integration.md)
248
- - [Context7 Integration Spec](docs/specs/2026-03-31-context7-integration.md)
242
+ - [LangSmith Integration](docs/specs/2026-03-31-langsmith-integration.md)
243
+ - [Context7 Integration](docs/specs/2026-03-31-context7-integration.md)
249
244
 
250
245
  ## License
251
246
 
@@ -0,0 +1,147 @@
1
+ ---
2
+ name: harness-evolver-architect
3
+ description: |
4
+ Use this agent when the harness-evolver:architect skill needs to analyze a harness
5
+ and recommend the optimal multi-agent topology. Reads code analysis signals, traces,
6
+ and scores to produce a migration plan from current to recommended architecture.
7
+ model: opus
8
+ ---
9
+
10
+ # Harness Evolver — Architect Agent
11
+
12
+ You are the architect in a Meta-Harness optimization system. Your job is to analyze a harness's current agent topology, assess whether it matches the task complexity, and recommend the optimal topology with a concrete migration plan.
13
+
14
+ ## Context
15
+
16
+ You work inside a `.harness-evolver/` directory. The skill has already run `analyze_architecture.py` to produce raw signals. You will read those signals, the harness code, and any evolution history to produce your recommendation.
17
+
18
+ ## Your Workflow
19
+
20
+ ### Phase 1: READ SIGNALS
21
+
22
+ 1. Read the raw signals JSON output from `analyze_architecture.py` (path provided in your prompt).
23
+ 2. Read the harness code:
24
+ - `.harness-evolver/baseline/harness.py` (always exists)
25
+ - The current best candidate from `summary.json` → `.harness-evolver/harnesses/{best}/harness.py` (if evolution has run)
26
+ 3. Read `config.json` for:
27
+ - `stack.detected` — what libraries/frameworks are in use
28
+ - `api_keys` — which LLM APIs are available
29
+ - `eval.langsmith` — whether tracing is enabled
30
+ 4. Read `summary.json` and `PROPOSER_HISTORY.md` if they exist (to understand evolution progress).
31
+
32
+ ### Phase 2: CLASSIFY & ASSESS
33
+
34
+ Classify the current topology from the code signals. The `estimated_topology` field is a starting point, but verify it by reading the actual code. Possible topologies:
35
+
36
+ | Topology | Description | Signals |
37
+ |---|---|---|
38
+ | `single-call` | One LLM call, no iteration | llm_calls=1, no loops, no tools |
39
+ | `chain` | Sequential LLM calls (analyze→generate→validate) | llm_calls>=2, no loops |
40
+ | `react-loop` | Tool use with iterative refinement | loop around LLM, tool definitions |
41
+ | `rag` | Retrieval-augmented generation | retrieval imports/methods |
42
+ | `judge-critic` | Generate then critique/verify | llm_calls>=2, one acts as judge |
43
+ | `hierarchical` | Decompose task, delegate to sub-agents | graph framework, multiple distinct agents |
44
+ | `parallel` | Same operation on multiple inputs concurrently | asyncio.gather, ThreadPoolExecutor |
45
+ | `sequential-routing` | Route different task types to different paths | conditional branching on task type |
46
+
47
+ Assess whether the current topology matches the task complexity:
48
+ - Read the eval tasks to understand what the harness needs to do
49
+ - Consider the current score — is there room for improvement?
50
+ - Consider the task diversity — do different tasks need different approaches?
51
+
52
+ ### Phase 3: RECOMMEND
53
+
54
+ Choose the optimal topology based on:
55
+ - **Task characteristics**: simple classification → single-call; multi-step reasoning → chain or react-loop; knowledge-intensive → rag; quality-critical → judge-critic
56
+ - **Current score**: if >0.9 and topology seems adequate, do NOT recommend changes
57
+ - **Stack constraints**: recommend patterns compatible with the detected stack (don't suggest LangGraph if user uses raw urllib)
58
+ - **API availability**: check which API keys exist before recommending patterns that need specific providers
59
+ - **Code size**: don't recommend hierarchical for a 50-line harness
60
+
61
+ ### Phase 4: WRITE PLAN
62
+
63
+ Create two output files:
64
+
65
+ **`.harness-evolver/architecture.json`**:
66
+ ```json
67
+ {
68
+ "current_topology": "single-call",
69
+ "recommended_topology": "chain",
70
+ "confidence": "medium",
71
+ "reasoning": "The harness makes a single LLM call but tasks require multi-step reasoning (classify then validate). A chain topology could improve accuracy by adding a verification step.",
72
+ "migration_path": [
73
+ {
74
+ "step": 1,
75
+ "description": "Add a validation LLM call after classification to verify the category matches the symptoms",
76
+ "changes": "Add a second API call that takes the classification result and original input, asks 'Does category X match these symptoms? Reply yes/no.'",
77
+ "expected_impact": "Reduce false positives by ~15%"
78
+ },
79
+ {
80
+ "step": 2,
81
+ "description": "Add structured output parsing with fallback",
82
+ "changes": "Parse LLM response with regex, fall back to keyword matching if parse fails",
83
+ "expected_impact": "Eliminate malformed output errors"
84
+ }
85
+ ],
86
+ "signals_used": ["llm_call_count=1", "has_loop_around_llm=false", "code_lines=45"],
87
+ "risks": [
88
+ "Additional LLM call doubles latency and cost",
89
+ "Verification step may introduce its own errors"
90
+ ],
91
+ "alternative": {
92
+ "topology": "judge-critic",
93
+ "reason": "If chain doesn't improve scores, a judge-critic pattern where a second model evaluates the classification could catch more errors, but at higher cost"
94
+ }
95
+ }
96
+ ```
97
+
98
+ **`.harness-evolver/architecture.md`** — human-readable version:
99
+
100
+ ```markdown
101
+ # Architecture Analysis
102
+
103
+ ## Current Topology: single-call
104
+ [Description of what the harness currently does]
105
+
106
+ ## Recommended Topology: chain (confidence: medium)
107
+ [Reasoning]
108
+
109
+ ## Migration Path
110
+ 1. [Step 1 description]
111
+ 2. [Step 2 description]
112
+
113
+ ## Risks
114
+ - [Risk 1]
115
+ - [Risk 2]
116
+
117
+ ## Alternative
118
+ If the recommended topology doesn't improve scores: [alternative]
119
+ ```
120
+
121
+ ## Rules
122
+
123
+ 1. **Do NOT recommend changes if current score >0.9 and topology seems adequate.** A working harness that scores well should not be restructured speculatively. Write architecture.json with `recommended_topology` equal to `current_topology` and confidence "high".
124
+
125
+ 2. **Always provide concrete migration steps, not just "switch to X".** Each step should describe exactly what code to add/change and what it should accomplish.
126
+
127
+ 3. **Consider the detected stack.** Don't recommend LangGraph patterns if the user is using raw urllib. Don't recommend LangChain if they use the Anthropic SDK directly. Match the style.
128
+
129
+ 4. **Consider API key availability.** If only ANTHROPIC_API_KEY is available, don't recommend a pattern that requires multiple providers. Check `config.json` → `api_keys`.
130
+
131
+ 5. **Migration should be incremental.** Each step in `migration_path` corresponds to one evolution iteration. The proposer will implement one step at a time. Steps should be independently valuable (each step should improve or at least not regress the score).
132
+
133
+ 6. **Rate confidence honestly:**
134
+ - `"high"` — strong signal match, clear improvement path, similar patterns known to work
135
+ - `"medium"` — reasonable hypothesis but task-specific factors could change the outcome
136
+ - `"low"` — speculative, insufficient data, or signals are ambiguous
137
+
138
+ 7. **Do NOT modify any harness code.** You only analyze and recommend. The proposer implements.
139
+
140
+ 8. **Do NOT modify files in `eval/` or `baseline/`.** These are immutable.
141
+
142
+ ## What You Do NOT Do
143
+
144
+ - Do NOT write or modify harness code — you produce analysis and recommendations only
145
+ - Do NOT run evaluations — the evolve skill handles that
146
+ - Do NOT modify `eval/`, `baseline/`, or any existing harness version
147
+ - Do NOT create files outside of `.harness-evolver/architecture.json` and `.harness-evolver/architecture.md`
@@ -92,6 +92,16 @@ Write a clear `proposal.md` that includes:
92
92
 
93
93
  Append a summary to `PROPOSER_HISTORY.md`.
94
94
 
95
+ ## Architecture Guidance (if available)
96
+
97
+ If `.harness-evolver/architecture.json` exists, read it in Phase 1 (ORIENT). The architect agent has recommended a target topology and migration path.
98
+
99
+ - Work TOWARD the recommended topology incrementally — one migration step per iteration
100
+ - Do NOT rewrite the entire harness in one iteration
101
+ - Document which migration step you are implementing in `proposal.md`
102
+ - If a migration step causes regression, note it and consider reverting or deviating
103
+ - If `architecture.json` does NOT exist, ignore this section and evolve freely
104
+
95
105
  ## Rules
96
106
 
97
107
  1. **Every change motivated by evidence.** Cite the task ID, trace line, or score delta that justifies the change. Never change code "to see what happens."
package/bin/install.js CHANGED
@@ -70,6 +70,15 @@ function checkPython() {
70
70
  }
71
71
  }
72
72
 
73
+ function checkCommand(cmd) {
74
+ try {
75
+ execSync(cmd, { stdio: "pipe" });
76
+ return true;
77
+ } catch {
78
+ return false;
79
+ }
80
+ }
81
+
73
82
  function installForRuntime(runtimeDir, scope) {
74
83
  const baseDir = scope === "local"
75
84
  ? path.join(process.cwd(), runtimeDir)
@@ -223,7 +232,90 @@ async function main() {
223
232
  fs.writeFileSync(versionPath, VERSION);
224
233
  console.log(` ${GREEN}✓${RESET} VERSION ${VERSION}`);
225
234
 
226
- console.log(`\n ${GREEN}Done!${RESET} Run ${BRIGHT_MAGENTA}/reload-plugins${RESET} in Claude Code, then ${BRIGHT_MAGENTA}/harness-evolver:init${RESET}`);
235
+ console.log(`\n ${GREEN}Done!${RESET} Restart Claude Code, then run ${BRIGHT_MAGENTA}/harness-evolver:init${RESET}\n`);
236
+
237
+ // Optional integrations
238
+ console.log(` ${YELLOW}Install optional integrations?${RESET}\n`);
239
+ console.log(` These enhance the proposer with rich traces and up-to-date documentation.\n`);
240
+
241
+ // LangSmith CLI
242
+ const hasLangsmithCli = checkCommand("langsmith-cli --version");
243
+ if (hasLangsmithCli) {
244
+ console.log(` ${GREEN}✓${RESET} langsmith-cli already installed`);
245
+ } else {
246
+ console.log(` ${BOLD}LangSmith CLI${RESET} — rich trace analysis (error rates, latency, token usage)`);
247
+ console.log(` ${DIM}uv tool install langsmith-cli && langsmith-cli auth login${RESET}`);
248
+ const lsAnswer = await ask(rl, `\n ${YELLOW}Install langsmith-cli? [y/N]:${RESET} `);
249
+ if (lsAnswer.trim().toLowerCase() === "y") {
250
+ console.log(`\n Installing langsmith-cli...`);
251
+ try {
252
+ execSync("uv tool install langsmith-cli", { stdio: "inherit" });
253
+ console.log(`\n ${GREEN}✓${RESET} langsmith-cli installed`);
254
+ console.log(` ${YELLOW}Run ${BOLD}langsmith-cli auth login${RESET}${YELLOW} to authenticate with your LangSmith API key.${RESET}\n`);
255
+ } catch {
256
+ console.log(`\n ${RED}Failed.${RESET} Install manually: uv tool install langsmith-cli\n`);
257
+ }
258
+ }
259
+ }
260
+
261
+ // Context7 MCP
262
+ const hasContext7 = (() => {
263
+ try {
264
+ for (const p of [path.join(HOME, ".claude", "settings.json"), path.join(HOME, ".claude.json")]) {
265
+ if (fs.existsSync(p)) {
266
+ const s = JSON.parse(fs.readFileSync(p, "utf8"));
267
+ if (s.mcpServers && (s.mcpServers.context7 || s.mcpServers.Context7)) return true;
268
+ }
269
+ }
270
+ } catch {}
271
+ return false;
272
+ })();
273
+ if (hasContext7) {
274
+ console.log(` ${GREEN}✓${RESET} Context7 MCP already configured`);
275
+ } else {
276
+ console.log(`\n ${BOLD}Context7 MCP${RESET} — up-to-date library documentation (LangChain, OpenAI, etc.)`);
277
+ console.log(` ${DIM}claude mcp add context7 -- npx -y @upstash/context7-mcp@latest${RESET}`);
278
+ const c7Answer = await ask(rl, `\n ${YELLOW}Install Context7 MCP? [y/N]:${RESET} `);
279
+ if (c7Answer.trim().toLowerCase() === "y") {
280
+ console.log(`\n Installing Context7 MCP...`);
281
+ try {
282
+ execSync("claude mcp add context7 -- npx -y @upstash/context7-mcp@latest", { stdio: "inherit" });
283
+ console.log(`\n ${GREEN}✓${RESET} Context7 MCP configured`);
284
+ } catch {
285
+ console.log(`\n ${RED}Failed.${RESET} Install manually: claude mcp add context7 -- npx -y @upstash/context7-mcp@latest\n`);
286
+ }
287
+ }
288
+ }
289
+
290
+ // LangChain Docs MCP
291
+ const hasLcDocs = (() => {
292
+ try {
293
+ for (const p of [path.join(HOME, ".claude", "settings.json"), path.join(HOME, ".claude.json")]) {
294
+ if (fs.existsSync(p)) {
295
+ const s = JSON.parse(fs.readFileSync(p, "utf8"));
296
+ if (s.mcpServers && (s.mcpServers["docs-langchain"] || s.mcpServers["LangChain Docs"])) return true;
297
+ }
298
+ }
299
+ } catch {}
300
+ return false;
301
+ })();
302
+ if (hasLcDocs) {
303
+ console.log(` ${GREEN}✓${RESET} LangChain Docs MCP already configured`);
304
+ } else {
305
+ console.log(`\n ${BOLD}LangChain Docs MCP${RESET} — LangChain/LangGraph/LangSmith documentation search`);
306
+ console.log(` ${DIM}claude mcp add docs-langchain --transport http https://docs.langchain.com/mcp${RESET}`);
307
+ const lcAnswer = await ask(rl, `\n ${YELLOW}Install LangChain Docs MCP? [y/N]:${RESET} `);
308
+ if (lcAnswer.trim().toLowerCase() === "y") {
309
+ console.log(`\n Installing LangChain Docs MCP...`);
310
+ try {
311
+ execSync("claude mcp add docs-langchain --transport http https://docs.langchain.com/mcp", { stdio: "inherit" });
312
+ console.log(`\n ${GREEN}✓${RESET} LangChain Docs MCP configured`);
313
+ } catch {
314
+ console.log(`\n ${RED}Failed.${RESET} Install manually: claude mcp add docs-langchain --transport http https://docs.langchain.com/mcp\n`);
315
+ }
316
+ }
317
+ }
318
+
227
319
  console.log(`\n ${DIM}Quick start with example:${RESET}`);
228
320
  console.log(` cp -r ~/.harness-evolver/examples/classifier ./my-project`);
229
321
  console.log(` cd my-project && claude`);
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "harness-evolver",
3
- "version": "0.8.0",
3
+ "version": "1.0.0",
4
4
  "description": "Meta-Harness-style autonomous harness optimization for Claude Code",
5
5
  "author": "Raphael Valdetaro",
6
6
  "license": "MIT",
@@ -0,0 +1,108 @@
1
+ ---
2
+ name: harness-evolver:architect
3
+ description: "Use when the user wants to analyze harness architecture, get a topology recommendation, understand if their agent pattern is optimal, or after stagnation in the evolution loop."
4
+ argument-hint: "[--force]"
5
+ allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent]
6
+ ---
7
+
8
+ # /harness-evolver:architect
9
+
10
+ Analyze the current harness architecture and recommend the optimal multi-agent topology.
11
+
12
+ ## Prerequisites
13
+
14
+ `.harness-evolver/` must exist. If not, tell user to run `harness-evolver:init` first.
15
+
16
+ ```bash
17
+ if [ ! -d ".harness-evolver" ]; then
18
+ echo "ERROR: .harness-evolver/ not found. Run /harness-evolver:init first."
19
+ exit 1
20
+ fi
21
+ ```
22
+
23
+ ## Resolve Tool Path
24
+
25
+ ```bash
26
+ TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo "$HOME/.harness-evolver/tools")
27
+ ```
28
+
29
+ Use `$TOOLS` prefix for all tool calls below.
30
+
31
+ ## Step 1: Run Architecture Analysis
32
+
33
+ Build the command based on what exists:
34
+
35
+ ```bash
36
+ CMD="python3 $TOOLS/analyze_architecture.py --harness .harness-evolver/baseline/harness.py"
37
+
38
+ # Add traces from best version if evolution has run
39
+ if [ -f ".harness-evolver/summary.json" ]; then
40
+ BEST=$(python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(s.get('best',{}).get('version',''))")
41
+ if [ -n "$BEST" ] && [ -d ".harness-evolver/harnesses/$BEST/traces" ]; then
42
+ CMD="$CMD --traces-dir .harness-evolver/harnesses/$BEST/traces"
43
+ fi
44
+ CMD="$CMD --summary .harness-evolver/summary.json"
45
+ fi
46
+
47
+ CMD="$CMD -o .harness-evolver/architecture_signals.json"
48
+
49
+ eval $CMD
50
+ ```
51
+
52
+ Check exit code. If it fails, report the error and stop.
53
+
54
+ ## Step 2: Spawn Architect Agent
55
+
56
+ Spawn the `harness-evolver-architect` agent with:
57
+
58
+ > Analyze the harness and recommend the optimal multi-agent topology.
59
+ > Raw signals are at `.harness-evolver/architecture_signals.json`.
60
+ > Write `.harness-evolver/architecture.json` and `.harness-evolver/architecture.md`.
61
+
62
+ The architect agent will:
63
+ 1. Read the signals JSON
64
+ 2. Read the harness code and config
65
+ 3. Classify the current topology
66
+ 4. Assess if it matches task complexity
67
+ 5. Recommend the optimal topology with migration steps
68
+ 6. Write `architecture.json` and `architecture.md`
69
+
70
+ ## Step 3: Report
71
+
72
+ After the architect agent completes, read the outputs and print a summary:
73
+
74
+ ```
75
+ Architecture Analysis Complete
76
+ ==============================
77
+ Current topology: {current_topology}
78
+ Recommended topology: {recommended_topology}
79
+ Confidence: {confidence}
80
+
81
+ Reasoning: {reasoning}
82
+
83
+ Migration Path:
84
+ 1. {step 1 description}
85
+ 2. {step 2 description}
86
+ ...
87
+
88
+ Risks:
89
+ - {risk 1}
90
+ - {risk 2}
91
+
92
+ Next: Run /harness-evolver:evolve — the proposer will follow the migration path.
93
+ ```
94
+
95
+ If the architect recommends no change (current = recommended), report:
96
+
97
+ ```
98
+ Architecture Analysis Complete
99
+ ==============================
100
+ Current topology: {topology} — looks optimal for these tasks.
101
+ No architecture change recommended. Score: {score}
102
+
103
+ The proposer can continue evolving within the current topology.
104
+ ```
105
+
106
+ ## Arguments
107
+
108
+ - `--force` — re-run analysis even if `architecture.json` already exists. Without this flag, if `architecture.json` exists, just display the existing recommendation.
@@ -92,3 +92,8 @@ Read `summary.json`. Print: `Iteration {i}/{N}: {version} scored {score} (best:
92
92
  - Improvement over baseline (absolute and %)
93
93
  - Total iterations run
94
94
  - Suggest: "The best harness is at `.harness-evolver/harnesses/{best}/harness.py`. Copy it to your project."
95
+
96
+ If the loop stopped due to stagnation AND `.harness-evolver/architecture.json` does NOT exist:
97
+
98
+ > The proposer may have hit an architectural ceiling. Run `/harness-evolver:architect`
99
+ > to analyze whether a different agent topology could help.
@@ -57,6 +57,21 @@ Add `--harness-config config.json` if a config exists.
57
57
  - Baseline score
58
58
  - Next: `harness-evolver:evolve` to start
59
59
 
60
+ ## Architecture Hint
61
+
62
+ After init completes, run a quick architecture analysis:
63
+
64
+ ```bash
65
+ python3 $TOOLS/analyze_architecture.py --harness .harness-evolver/baseline/harness.py
66
+ ```
67
+
68
+ If the analysis suggests the current topology may not be optimal for the task complexity, mention it:
69
+
70
+ > Architecture note: Current topology is "{topology}". For tasks with {characteristics},
71
+ > consider running `/harness-evolver:architect` for a detailed recommendation.
72
+
73
+ This is advisory only — do not spawn the architect agent.
74
+
60
75
  ## Gotchas
61
76
 
62
77
  - The harness must write valid JSON to `--output`. If the user's code returns non-JSON, the wrapper must serialize it.
@@ -0,0 +1,512 @@
1
+ #!/usr/bin/env python3
2
+ """Analyze harness architecture to detect current topology and produce signals.
3
+
4
+ Usage:
5
+ analyze_architecture.py --harness PATH [--traces-dir PATH] [--summary PATH] [-o output.json]
6
+
7
+ Performs AST-based analysis of harness code, optional trace analysis, and optional
8
+ score analysis to classify the current agent topology and produce structured signals
9
+ for the architect agent.
10
+
11
+ Stdlib-only. No external dependencies.
12
+ """
13
+
14
+ import argparse
15
+ import ast
16
+ import json
17
+ import os
18
+ import re
19
+ import sys
20
+
21
+
22
+ # --- AST Analysis ---
23
+
24
+ LLM_API_DOMAINS = [
25
+ "api.anthropic.com",
26
+ "api.openai.com",
27
+ "generativelanguage.googleapis.com",
28
+ ]
29
+
30
+ LLM_SDK_MODULES = {"openai", "anthropic", "langchain_openai", "langchain_anthropic",
31
+ "langchain_core", "langchain_community", "langchain"}
32
+
33
+ RETRIEVAL_MODULES = {"chromadb", "pinecone", "qdrant_client", "weaviate"}
34
+
35
+ RETRIEVAL_METHOD_NAMES = {"similarity_search", "query"}
36
+
37
+ GRAPH_FRAMEWORK_CLASSES = {"StateGraph"}
38
+ GRAPH_FRAMEWORK_METHODS = {"add_node", "add_edge"}
39
+
40
+ PARALLEL_PATTERNS = {"gather"} # asyncio.gather
41
+ PARALLEL_CLASSES = {"ThreadPoolExecutor", "ProcessPoolExecutor"}
42
+
43
+ TOOL_DICT_KEYS = {"name", "description", "parameters"}
44
+
45
+
46
+ def _get_all_imports(tree):
47
+ """Extract all imported module root names."""
48
+ imports = set()
49
+ for node in ast.walk(tree):
50
+ if isinstance(node, ast.Import):
51
+ for alias in node.names:
52
+ imports.add(alias.name.split(".")[0])
53
+ elif isinstance(node, ast.ImportFrom):
54
+ if node.module:
55
+ imports.add(node.module.split(".")[0])
56
+ return imports
57
+
58
+
59
+ def _get_all_import_modules(tree):
60
+ """Extract all imported module full names (including submodules)."""
61
+ modules = set()
62
+ for node in ast.walk(tree):
63
+ if isinstance(node, ast.Import):
64
+ for alias in node.names:
65
+ modules.add(alias.name)
66
+ elif isinstance(node, ast.ImportFrom):
67
+ if node.module:
68
+ modules.add(node.module)
69
+ return modules
70
+
71
+
72
+ def _count_string_matches(tree, patterns):
73
+ """Count AST string constants that contain any of the given patterns."""
74
+ count = 0
75
+ for node in ast.walk(tree):
76
+ if isinstance(node, ast.Constant) and isinstance(node.value, str):
77
+ for pattern in patterns:
78
+ if pattern in node.value:
79
+ count += 1
80
+ break
81
+ return count
82
+
83
+
84
+ def _count_llm_calls(tree, imports, source_text):
85
+ """Count LLM API calls: urllib requests to known domains + SDK client calls."""
86
+ count = 0
87
+
88
+ # Count urllib.request calls with LLM API domains in string constants
89
+ count += _count_string_matches(tree, LLM_API_DOMAINS)
90
+
91
+ # Count SDK imports that imply LLM calls (each import of an LLM SDK = at least 1 call site)
92
+ full_modules = _get_all_import_modules(tree)
93
+ sdk_found = set()
94
+ for mod in full_modules:
95
+ root = mod.split(".")[0]
96
+ if root in LLM_SDK_MODULES:
97
+ sdk_found.add(root)
98
+
99
+ # For SDK users, look for actual call patterns like .create, .chat, .invoke, .run
100
+ llm_call_methods = {"create", "chat", "invoke", "run", "generate", "predict",
101
+ "complete", "completions"}
102
+ for node in ast.walk(tree):
103
+ if isinstance(node, ast.Call):
104
+ if isinstance(node.func, ast.Attribute):
105
+ if node.func.attr in llm_call_methods and sdk_found:
106
+ count += 1
107
+
108
+ # If we found SDK imports but no explicit call methods, count 1 per SDK
109
+ if sdk_found and count == 0:
110
+ count = len(sdk_found)
111
+
112
+ return max(count, _count_string_matches(tree, LLM_API_DOMAINS))
113
+
114
+
115
+ def _has_loop_around_llm(tree, source_text):
116
+ """Check if any LLM call is inside a loop (for/while)."""
117
+ for node in ast.walk(tree):
118
+ if isinstance(node, (ast.For, ast.While)):
119
+ # Walk the loop body looking for LLM call signals
120
+ for child in ast.walk(node):
121
+ # Check for urllib.request.urlopen in a loop
122
+ if isinstance(child, ast.Attribute) and child.attr == "urlopen":
123
+ return True
124
+ # Check for SDK call methods in a loop
125
+ if isinstance(child, ast.Call) and isinstance(child.func, ast.Attribute):
126
+ if child.func.attr in {"create", "chat", "invoke", "run",
127
+ "generate", "predict", "complete"}:
128
+ return True
129
+ # Check for LLM API domain strings in a loop
130
+ if isinstance(child, ast.Constant) and isinstance(child.value, str):
131
+ for domain in LLM_API_DOMAINS:
132
+ if domain in child.value:
133
+ return True
134
+ return False
135
+
136
+
137
+ def _has_tool_definitions(tree):
138
+ """Check for tool definitions: dicts with name/description/parameters keys, or @tool decorators."""
139
+ # Check for @tool decorator
140
+ for node in ast.walk(tree):
141
+ if isinstance(node, ast.FunctionDef):
142
+ for decorator in node.decorator_list:
143
+ if isinstance(decorator, ast.Name) and decorator.id == "tool":
144
+ return True
145
+ if isinstance(decorator, ast.Attribute) and decorator.attr == "tool":
146
+ return True
147
+
148
+ # Check for dicts with tool-like keys
149
+ for node in ast.walk(tree):
150
+ if isinstance(node, ast.Dict):
151
+ keys = set()
152
+ for key in node.keys:
153
+ if isinstance(key, ast.Constant) and isinstance(key.value, str):
154
+ keys.add(key.value)
155
+ if TOOL_DICT_KEYS.issubset(keys):
156
+ return True
157
+
158
+ return False
159
+
160
+
161
+ def _has_retrieval(tree, imports):
162
+ """Check for retrieval patterns: vector DB imports or .similarity_search/.query calls."""
163
+ if imports & RETRIEVAL_MODULES:
164
+ return True
165
+
166
+ for node in ast.walk(tree):
167
+ if isinstance(node, ast.Attribute):
168
+ if node.attr in RETRIEVAL_METHOD_NAMES:
169
+ return True
170
+
171
+ return False
172
+
173
+
174
+ def _has_graph_framework(tree, full_modules):
175
+ """Check for graph framework usage (LangGraph StateGraph, add_node, add_edge)."""
176
+ # Check if langgraph is imported
177
+ for mod in full_modules:
178
+ if "langgraph" in mod:
179
+ return True
180
+
181
+ # Check for StateGraph usage
182
+ for node in ast.walk(tree):
183
+ if isinstance(node, ast.Name) and node.id in GRAPH_FRAMEWORK_CLASSES:
184
+ return True
185
+ if isinstance(node, ast.Attribute):
186
+ if node.attr in GRAPH_FRAMEWORK_CLASSES or node.attr in GRAPH_FRAMEWORK_METHODS:
187
+ return True
188
+
189
+ return False
190
+
191
+
192
+ def _has_parallel_execution(tree, imports):
193
+ """Check for asyncio.gather, concurrent.futures, ThreadPoolExecutor."""
194
+ if "concurrent" in imports:
195
+ return True
196
+
197
+ for node in ast.walk(tree):
198
+ if isinstance(node, ast.Attribute):
199
+ if node.attr == "gather":
200
+ return True
201
+ if node.attr in PARALLEL_CLASSES:
202
+ return True
203
+ if isinstance(node, ast.Name) and node.id in PARALLEL_CLASSES:
204
+ return True
205
+
206
+ return False
207
+
208
+
209
+ def _has_error_handling_around_llm(tree):
210
+ """Check if LLM calls are wrapped in try/except."""
211
+ for node in ast.walk(tree):
212
+ if isinstance(node, ast.Try):
213
+ # Walk the try body for LLM signals
214
+ for child in ast.walk(node):
215
+ if isinstance(child, ast.Attribute) and child.attr == "urlopen":
216
+ return True
217
+ if isinstance(child, ast.Call) and isinstance(child.func, ast.Attribute):
218
+ if child.func.attr in {"create", "chat", "invoke", "run",
219
+ "generate", "predict", "complete"}:
220
+ return True
221
+ if isinstance(child, ast.Constant) and isinstance(child.value, str):
222
+ for domain in LLM_API_DOMAINS:
223
+ if domain in child.value:
224
+ return True
225
+ return False
226
+
227
+
228
+ def _count_functions(tree):
229
+ """Count function definitions (top-level and nested)."""
230
+ count = 0
231
+ for node in ast.walk(tree):
232
+ if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
233
+ count += 1
234
+ return count
235
+
236
+
237
+ def _count_classes(tree):
238
+ """Count class definitions."""
239
+ count = 0
240
+ for node in ast.walk(tree):
241
+ if isinstance(node, ast.ClassDef):
242
+ count += 1
243
+ return count
244
+
245
+
246
+ def _estimate_topology(signals):
247
+ """Classify the current topology based on code signals."""
248
+ if signals["has_graph_framework"]:
249
+ if signals["has_parallel_execution"]:
250
+ return "parallel"
251
+ return "hierarchical"
252
+
253
+ if signals["has_retrieval"]:
254
+ return "rag"
255
+
256
+ if signals["has_loop_around_llm"]:
257
+ if signals["has_tool_definitions"]:
258
+ return "react-loop"
259
+ return "react-loop"
260
+
261
+ if signals["llm_call_count"] >= 3:
262
+ if signals["has_tool_definitions"]:
263
+ return "react-loop"
264
+ return "chain"
265
+
266
+ if signals["llm_call_count"] == 2:
267
+ return "chain"
268
+
269
+ if signals["llm_call_count"] <= 1:
270
+ return "single-call"
271
+
272
+ return "single-call"
273
+
274
+
275
+ def analyze_code(harness_path):
276
+ """Analyze a harness Python file and return code signals."""
277
+ with open(harness_path) as f:
278
+ source = f.read()
279
+
280
+ try:
281
+ tree = ast.parse(source)
282
+ except SyntaxError:
283
+ return {
284
+ "llm_call_count": 0,
285
+ "has_loop_around_llm": False,
286
+ "has_tool_definitions": False,
287
+ "has_retrieval": False,
288
+ "has_graph_framework": False,
289
+ "has_parallel_execution": False,
290
+ "has_error_handling": False,
291
+ "estimated_topology": "unknown",
292
+ "code_lines": len(source.splitlines()),
293
+ "function_count": 0,
294
+ "class_count": 0,
295
+ }
296
+
297
+ imports = _get_all_imports(tree)
298
+ full_modules = _get_all_import_modules(tree)
299
+
300
+ llm_call_count = _count_llm_calls(tree, imports, source)
301
+ has_loop = _has_loop_around_llm(tree, source)
302
+ has_tools = _has_tool_definitions(tree)
303
+ has_retrieval = _has_retrieval(tree, imports)
304
+ has_graph = _has_graph_framework(tree, full_modules)
305
+ has_parallel = _has_parallel_execution(tree, imports)
306
+ has_error = _has_error_handling_around_llm(tree)
307
+
308
+ signals = {
309
+ "llm_call_count": llm_call_count,
310
+ "has_loop_around_llm": has_loop,
311
+ "has_tool_definitions": has_tools,
312
+ "has_retrieval": has_retrieval,
313
+ "has_graph_framework": has_graph,
314
+ "has_parallel_execution": has_parallel,
315
+ "has_error_handling": has_error,
316
+ "code_lines": len(source.splitlines()),
317
+ "function_count": _count_functions(tree),
318
+ "class_count": _count_classes(tree),
319
+ }
320
+ signals["estimated_topology"] = _estimate_topology(signals)
321
+
322
+ return signals
323
+
324
+
325
+ # --- Trace Analysis ---
326
+
327
+ def analyze_traces(traces_dir):
328
+ """Analyze execution traces for error patterns, timing, and failures."""
329
+ if not os.path.isdir(traces_dir):
330
+ return None
331
+
332
+ result = {
333
+ "error_patterns": [],
334
+ "timing": None,
335
+ "task_failures": [],
336
+ "stderr_lines": 0,
337
+ }
338
+
339
+ # Read stderr.log
340
+ stderr_path = os.path.join(traces_dir, "stderr.log")
341
+ if os.path.isfile(stderr_path):
342
+ try:
343
+ with open(stderr_path) as f:
344
+ stderr = f.read()
345
+ lines = stderr.strip().splitlines()
346
+ result["stderr_lines"] = len(lines)
347
+
348
+ # Detect common error patterns
349
+ error_counts = {}
350
+ for line in lines:
351
+ for pattern in ["Traceback", "Error", "Exception", "Timeout",
352
+ "ConnectionRefused", "HTTPError", "JSONDecodeError",
353
+ "KeyError", "TypeError", "ValueError"]:
354
+ if pattern in line:
355
+ error_counts[pattern] = error_counts.get(pattern, 0) + 1
356
+
357
+ result["error_patterns"] = [
358
+ {"pattern": p, "count": c}
359
+ for p, c in sorted(error_counts.items(), key=lambda x: -x[1])
360
+ ]
361
+ except Exception:
362
+ pass
363
+
364
+ # Read timing.json
365
+ timing_path = os.path.join(traces_dir, "timing.json")
366
+ if os.path.isfile(timing_path):
367
+ try:
368
+ with open(timing_path) as f:
369
+ timing = json.load(f)
370
+ result["timing"] = timing
371
+ except (json.JSONDecodeError, Exception):
372
+ pass
373
+
374
+ # Scan per-task output directories for failures
375
+ for entry in sorted(os.listdir(traces_dir)):
376
+ task_dir = os.path.join(traces_dir, entry)
377
+ if os.path.isdir(task_dir) and entry.startswith("task_"):
378
+ output_path = os.path.join(task_dir, "output.json")
379
+ if os.path.isfile(output_path):
380
+ try:
381
+ with open(output_path) as f:
382
+ output = json.load(f)
383
+ # Check for empty or error outputs
384
+ out_value = output.get("output", "")
385
+ if not out_value or out_value in ("error", "unknown", ""):
386
+ result["task_failures"].append({
387
+ "task": entry,
388
+ "output": out_value,
389
+ })
390
+ except (json.JSONDecodeError, Exception):
391
+ result["task_failures"].append({
392
+ "task": entry,
393
+ "output": "parse_error",
394
+ })
395
+
396
+ return result
397
+
398
+
399
+ # --- Score Analysis ---
400
+
401
+ def analyze_scores(summary_path):
402
+ """Analyze summary.json for stagnation, oscillation, and per-task failures."""
403
+ if not os.path.isfile(summary_path):
404
+ return None
405
+
406
+ try:
407
+ with open(summary_path) as f:
408
+ summary = json.load(f)
409
+ except (json.JSONDecodeError, Exception):
410
+ return None
411
+
412
+ result = {
413
+ "iterations": summary.get("iterations", 0),
414
+ "best_score": 0.0,
415
+ "baseline_score": 0.0,
416
+ "recent_scores": [],
417
+ "is_stagnating": False,
418
+ "is_oscillating": False,
419
+ "score_trend": "unknown",
420
+ }
421
+
422
+ # Extract best score
423
+ best = summary.get("best", {})
424
+ result["best_score"] = best.get("combined_score", 0.0)
425
+ result["baseline_score"] = summary.get("baseline_score", 0.0)
426
+
427
+ # Extract recent version scores
428
+ versions = summary.get("versions", [])
429
+ if isinstance(versions, list):
430
+ recent = versions[-5:] if len(versions) > 5 else versions
431
+ result["recent_scores"] = [
432
+ {"version": v.get("version", "?"), "score": v.get("combined_score", 0.0)}
433
+ for v in recent
434
+ ]
435
+ elif isinstance(versions, dict):
436
+ items = sorted(versions.items())
437
+ recent = items[-5:] if len(items) > 5 else items
438
+ result["recent_scores"] = [
439
+ {"version": k, "score": v.get("combined_score", 0.0)}
440
+ for k, v in recent
441
+ ]
442
+
443
+ # Detect stagnation (last 3+ scores within 1% of each other)
444
+ scores = [s["score"] for s in result["recent_scores"]]
445
+ if len(scores) >= 3:
446
+ last_3 = scores[-3:]
447
+ spread = max(last_3) - min(last_3)
448
+ if spread <= 0.01:
449
+ result["is_stagnating"] = True
450
+
451
+ # Detect oscillation (alternating up/down for last 4+ scores)
452
+ if len(scores) >= 4:
453
+ deltas = [scores[i+1] - scores[i] for i in range(len(scores)-1)]
454
+ sign_changes = sum(
455
+ 1 for i in range(len(deltas)-1)
456
+ if (deltas[i] > 0 and deltas[i+1] < 0) or (deltas[i] < 0 and deltas[i+1] > 0)
457
+ )
458
+ if sign_changes >= len(deltas) - 1:
459
+ result["is_oscillating"] = True
460
+
461
+ # Score trend
462
+ if len(scores) >= 2:
463
+ if scores[-1] > scores[0]:
464
+ result["score_trend"] = "improving"
465
+ elif scores[-1] < scores[0]:
466
+ result["score_trend"] = "declining"
467
+ else:
468
+ result["score_trend"] = "flat"
469
+
470
+ return result
471
+
472
+
473
+ # --- Main ---
474
+
475
+ def main():
476
+ parser = argparse.ArgumentParser(
477
+ description="Analyze harness architecture and produce signals for the architect agent",
478
+ usage="analyze_architecture.py --harness PATH [--traces-dir PATH] [--summary PATH] [-o output.json]",
479
+ )
480
+ parser.add_argument("--harness", required=True, help="Path to harness Python file")
481
+ parser.add_argument("--traces-dir", default=None, help="Path to traces directory")
482
+ parser.add_argument("--summary", default=None, help="Path to summary.json")
483
+ parser.add_argument("-o", "--output", default=None, help="Output JSON path")
484
+ args = parser.parse_args()
485
+
486
+ if not os.path.isfile(args.harness):
487
+ print(json.dumps({"error": f"Harness file not found: {args.harness}"}))
488
+ sys.exit(1)
489
+
490
+ result = {
491
+ "code_signals": analyze_code(args.harness),
492
+ "trace_signals": None,
493
+ "score_signals": None,
494
+ }
495
+
496
+ if args.traces_dir:
497
+ result["trace_signals"] = analyze_traces(args.traces_dir)
498
+
499
+ if args.summary:
500
+ result["score_signals"] = analyze_scores(args.summary)
501
+
502
+ output = json.dumps(result, indent=2)
503
+
504
+ if args.output:
505
+ with open(args.output, "w") as f:
506
+ f.write(output + "\n")
507
+ else:
508
+ print(output)
509
+
510
+
511
+ if __name__ == "__main__":
512
+ main()
package/tools/init.py CHANGED
@@ -317,6 +317,29 @@ def main():
317
317
  print("\nRecommendation: install Context7 MCP for up-to-date documentation:")
318
318
  print(" claude mcp add context7 -- npx -y @upstash/context7-mcp@latest")
319
319
 
320
+ # Architecture analysis (quick, advisory)
321
+ analyze_py = os.path.join(tools, "analyze_architecture.py")
322
+ if os.path.exists(analyze_py):
323
+ try:
324
+ r = subprocess.run(
325
+ ["python3", analyze_py, "--harness", args.harness],
326
+ capture_output=True, text=True, timeout=30,
327
+ )
328
+ if r.returncode == 0 and r.stdout.strip():
329
+ arch_signals = json.loads(r.stdout)
330
+ config["architecture"] = {
331
+ "current_topology": arch_signals.get("code_signals", {}).get("estimated_topology", "unknown"),
332
+ "auto_analyzed": True,
333
+ }
334
+ # Re-write config with architecture
335
+ with open(os.path.join(base, "config.json"), "w") as f:
336
+ json.dump(config, f, indent=2)
337
+ topo = config["architecture"]["current_topology"]
338
+ if topo != "unknown":
339
+ print(f"Architecture: {topo}")
340
+ except Exception:
341
+ pass
342
+
320
343
  # 5. Validate baseline harness
321
344
  print("Validating baseline harness...")
322
345
  val_args = ["python3", evaluate_py, "validate",