harness-evolver 0.8.0 → 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +123 -128
- package/agents/harness-evolver-architect.md +147 -0
- package/agents/harness-evolver-proposer.md +10 -0
- package/bin/install.js +93 -1
- package/package.json +1 -1
- package/skills/architect/SKILL.md +108 -0
- package/skills/evolve/SKILL.md +5 -0
- package/skills/init/SKILL.md +15 -0
- package/tools/analyze_architecture.py +512 -0
- package/tools/init.py +23 -0
package/README.md
CHANGED
|
@@ -4,46 +4,102 @@ End-to-end optimization of LLM agent harnesses, inspired by [Meta-Harness](https
|
|
|
4
4
|
|
|
5
5
|
**The harness is the 80% factor.** Changing just the scaffolding around a fixed LLM can produce a [6x performance gap](https://arxiv.org/abs/2603.28052) on the same benchmark. Harness Evolver automates the search for better harnesses using an autonomous propose-evaluate-iterate loop with full execution traces as feedback.
|
|
6
6
|
|
|
7
|
-
##
|
|
7
|
+
## Install
|
|
8
8
|
|
|
9
|
-
|
|
9
|
+
```bash
|
|
10
|
+
npx harness-evolver@latest
|
|
11
|
+
```
|
|
10
12
|
|
|
11
|
-
|
|
13
|
+
Select your runtime (Claude Code, Cursor, Codex, Windsurf) and scope (global/local). Then **restart your AI coding agent** for the skills to appear.
|
|
12
14
|
|
|
13
|
-
##
|
|
15
|
+
## Prerequisites
|
|
16
|
+
|
|
17
|
+
### API Keys (set in your shell before launching Claude Code)
|
|
18
|
+
|
|
19
|
+
The harness you're evolving may call LLM APIs. Set the keys your harness needs:
|
|
14
20
|
|
|
15
21
|
```bash
|
|
16
|
-
#
|
|
17
|
-
|
|
22
|
+
# Required: at least one LLM provider
|
|
23
|
+
export ANTHROPIC_API_KEY="sk-ant-..." # For Claude-based harnesses
|
|
24
|
+
export OPENAI_API_KEY="sk-..." # For OpenAI-based harnesses
|
|
25
|
+
export GEMINI_API_KEY="AIza..." # For Gemini-based harnesses
|
|
26
|
+
export OPENROUTER_API_KEY="sk-or-..." # For OpenRouter (multi-model)
|
|
27
|
+
|
|
28
|
+
# Optional: enhanced tracing
|
|
29
|
+
export LANGSMITH_API_KEY="lsv2_pt_..." # Auto-enables LangSmith tracing
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
The plugin auto-detects which keys are available during `/harness-evolver:init` and shows them. The proposer agent knows which APIs are available and uses them accordingly.
|
|
33
|
+
|
|
34
|
+
**No API key needed for the example** — the classifier example uses keyword matching (mock mode), no LLM calls.
|
|
35
|
+
|
|
36
|
+
### Optional: Enhanced Integrations
|
|
37
|
+
|
|
38
|
+
```bash
|
|
39
|
+
# LangSmith — rich trace analysis for the proposer
|
|
40
|
+
uv tool install langsmith-cli && langsmith-cli auth login
|
|
41
|
+
|
|
42
|
+
# Context7 — up-to-date library documentation for the proposer
|
|
43
|
+
claude mcp add context7 -- npx -y @upstash/context7-mcp@latest
|
|
18
44
|
|
|
19
|
-
#
|
|
20
|
-
|
|
45
|
+
# LangChain Docs — LangChain/LangGraph-specific documentation
|
|
46
|
+
claude mcp add docs-langchain --transport http https://docs.langchain.com/mcp
|
|
21
47
|
```
|
|
22
48
|
|
|
23
49
|
## Quick Start
|
|
24
50
|
|
|
51
|
+
### Try the Example (no API key needed)
|
|
52
|
+
|
|
25
53
|
```bash
|
|
26
|
-
# 1. Copy the example
|
|
54
|
+
# 1. Copy the example
|
|
27
55
|
cp -r ~/.harness-evolver/examples/classifier ./my-classifier
|
|
28
56
|
cd my-classifier
|
|
29
57
|
|
|
30
|
-
# 2.
|
|
31
|
-
|
|
58
|
+
# 2. Open Claude Code
|
|
59
|
+
claude
|
|
60
|
+
|
|
61
|
+
# 3. Initialize — auto-detects harness.py, eval.py, tasks/
|
|
62
|
+
/harness-evolver:init
|
|
63
|
+
|
|
64
|
+
# 4. Run the evolution loop
|
|
65
|
+
/harness-evolver:evolve --iterations 3
|
|
32
66
|
|
|
33
|
-
#
|
|
34
|
-
/harness-
|
|
67
|
+
# 5. Check progress
|
|
68
|
+
/harness-evolver:status
|
|
69
|
+
```
|
|
70
|
+
|
|
71
|
+
### Use with Your Own Project
|
|
72
|
+
|
|
73
|
+
```bash
|
|
74
|
+
cd my-llm-project
|
|
75
|
+
claude
|
|
35
76
|
|
|
36
|
-
#
|
|
37
|
-
|
|
77
|
+
# Init scans your project, identifies the entry point,
|
|
78
|
+
# and helps create harness wrapper + eval + tasks if missing
|
|
79
|
+
/harness-evolver:init
|
|
80
|
+
|
|
81
|
+
# Run optimization
|
|
82
|
+
/harness-evolver:evolve --iterations 10
|
|
38
83
|
```
|
|
39
84
|
|
|
40
|
-
The
|
|
85
|
+
The init skill adapts to your project — if you have `graph.py` instead of `harness.py`, it creates a thin wrapper. If you don't have an eval script, it helps you write one.
|
|
86
|
+
|
|
87
|
+
## Available Commands
|
|
88
|
+
|
|
89
|
+
| Command | What it does |
|
|
90
|
+
|---|---|
|
|
91
|
+
| `/harness-evolver:init` | Scan project, create harness/eval/tasks, run baseline |
|
|
92
|
+
| `/harness-evolver:evolve` | Run the autonomous optimization loop |
|
|
93
|
+
| `/harness-evolver:status` | Show progress (scores, iterations, stagnation) |
|
|
94
|
+
| `/harness-evolver:compare` | Diff two versions with per-task analysis |
|
|
95
|
+
| `/harness-evolver:diagnose` | Deep trace analysis of a specific version |
|
|
96
|
+
| `/harness-evolver:deploy` | Copy the best harness back to your project |
|
|
41
97
|
|
|
42
98
|
## How It Works
|
|
43
99
|
|
|
44
100
|
```
|
|
45
101
|
┌─────────────────────────────┐
|
|
46
|
-
│
|
|
102
|
+
│ /harness-evolver:evolve │
|
|
47
103
|
│ (orchestrator skill) │
|
|
48
104
|
└──────────┬──────────────────┘
|
|
49
105
|
│
|
|
@@ -63,10 +119,10 @@ The classifier example runs in mock mode (no API key needed) and demonstrates th
|
|
|
63
119
|
scores.json
|
|
64
120
|
```
|
|
65
121
|
|
|
66
|
-
1. **Propose** — A proposer agent
|
|
67
|
-
2. **Evaluate** — The harness runs against every task. Traces are captured per-task (input, output, stdout, stderr, timing).
|
|
122
|
+
1. **Propose** — A proposer agent reads all prior candidates' code, execution traces, and scores. Diagnoses failure modes via counterfactual analysis and writes a new harness.
|
|
123
|
+
2. **Evaluate** — The harness runs against every task. Traces are captured per-task (input, output, stdout, stderr, timing). Your eval script scores the results.
|
|
68
124
|
3. **Update** — State files are updated with the new score, parent lineage, and regression detection.
|
|
69
|
-
4. **Repeat** —
|
|
125
|
+
4. **Repeat** — Until N iterations, stagnation (3 rounds without >1% improvement), or target score reached.
|
|
70
126
|
|
|
71
127
|
## The Harness Contract
|
|
72
128
|
|
|
@@ -78,8 +134,8 @@ python3 harness.py --input task.json --output result.json [--traces-dir DIR] [--
|
|
|
78
134
|
|
|
79
135
|
- `--input`: JSON with `{id, input, metadata}` (never sees expected answers)
|
|
80
136
|
- `--output`: JSON with `{id, output}`
|
|
81
|
-
- `--traces-dir`: optional directory for
|
|
82
|
-
- `--config`: optional JSON with evolvable parameters
|
|
137
|
+
- `--traces-dir`: optional directory for rich traces
|
|
138
|
+
- `--config`: optional JSON with evolvable parameters
|
|
83
139
|
|
|
84
140
|
The eval script is also any executable:
|
|
85
141
|
|
|
@@ -87,165 +143,104 @@ The eval script is also any executable:
|
|
|
87
143
|
python3 eval.py --results-dir results/ --tasks-dir tasks/ --scores scores.json
|
|
88
144
|
```
|
|
89
145
|
|
|
90
|
-
|
|
146
|
+
Works with **any language, any framework, any domain**.
|
|
91
147
|
|
|
92
|
-
## Project Structure
|
|
148
|
+
## Project Structure (after init)
|
|
93
149
|
|
|
94
150
|
```
|
|
95
|
-
.harness-evolver/ # Created
|
|
96
|
-
├── config.json # Project config (harness cmd, eval
|
|
151
|
+
.harness-evolver/ # Created by /harness-evolver:init
|
|
152
|
+
├── config.json # Project config (harness cmd, eval, API keys detected)
|
|
97
153
|
├── summary.json # Source of truth (versions, scores, parents)
|
|
98
|
-
├── STATE.md # Human-readable status
|
|
154
|
+
├── STATE.md # Human-readable status
|
|
99
155
|
├── PROPOSER_HISTORY.md # Log of all proposals and outcomes
|
|
100
|
-
├── baseline/ # Original harness (read-only
|
|
101
|
-
│
|
|
102
|
-
│ └── config.json
|
|
156
|
+
├── baseline/ # Original harness (read-only)
|
|
157
|
+
│ └── harness.py
|
|
103
158
|
├── eval/
|
|
104
|
-
│ ├── eval.py #
|
|
105
|
-
│ └── tasks/ # Test cases
|
|
159
|
+
│ ├── eval.py # Your scoring script
|
|
160
|
+
│ └── tasks/ # Test cases
|
|
106
161
|
└── harnesses/
|
|
107
162
|
└── v001/
|
|
108
|
-
├── harness.py #
|
|
109
|
-
├──
|
|
110
|
-
├──
|
|
111
|
-
├── scores.json # Evaluation results
|
|
163
|
+
├── harness.py # Evolved candidate
|
|
164
|
+
├── proposal.md # Why this version was created
|
|
165
|
+
├── scores.json # How it scored
|
|
112
166
|
└── traces/ # Full execution traces
|
|
113
167
|
├── stdout.log
|
|
114
168
|
├── stderr.log
|
|
115
169
|
├── timing.json
|
|
116
170
|
└── task_001/
|
|
117
|
-
├── input.json
|
|
118
|
-
└── output.json
|
|
171
|
+
├── input.json
|
|
172
|
+
└── output.json
|
|
119
173
|
```
|
|
120
174
|
|
|
121
|
-
##
|
|
175
|
+
## The Proposer
|
|
122
176
|
|
|
123
|
-
|
|
177
|
+
The core of the system. 4-phase workflow from the Meta-Harness paper:
|
|
124
178
|
|
|
125
|
-
|
|
126
|
-
|
|
127
|
-
|
|
128
|
-
|
|
129
|
-
|
|
179
|
+
| Phase | What it does |
|
|
180
|
+
|---|---|
|
|
181
|
+
| **Orient** | Read `summary.json` + `PROPOSER_HISTORY.md`. Pick 2-3 versions to investigate. |
|
|
182
|
+
| **Diagnose** | Deep trace analysis. grep for errors, diff versions, counterfactual diagnosis. |
|
|
183
|
+
| **Propose** | Write new harness. Prefer additive changes after regressions. |
|
|
184
|
+
| **Document** | Write `proposal.md` with evidence. Update history. |
|
|
130
185
|
|
|
131
|
-
|
|
132
|
-
|---|---|---|
|
|
133
|
-
| **Skills** | `skills/harness-evolve-init/`, `skills/harness-evolve/`, `skills/harness-evolve-status/` | Slash commands that orchestrate the loop |
|
|
134
|
-
| **Agent** | `agents/harness-evolver-proposer.md` | The proposer — 4-phase workflow (orient, diagnose, propose, document) with 6 rules |
|
|
135
|
-
| **Tools** | `tools/evaluate.py`, `tools/state.py`, `tools/init.py`, `tools/detect_stack.py`, `tools/trace_logger.py` | CLI tools called via subprocess — zero LLM tokens spent on deterministic work |
|
|
136
|
-
| **Installer** | `bin/install.js`, `package.json` | Copies skills/agents/tools to the right locations |
|
|
137
|
-
| **Example** | `examples/classifier/` | 10-task medical classifier with mock mode |
|
|
186
|
+
**7 rules:** evidence-based changes, conservative after regression, don't repeat mistakes, one hypothesis at a time, maintain interface, prefer readability, use available API keys from environment.
|
|
138
187
|
|
|
139
188
|
## Integrations
|
|
140
189
|
|
|
141
|
-
### LangSmith (optional)
|
|
142
|
-
|
|
143
|
-
If `LANGSMITH_API_KEY` is set, the plugin automatically:
|
|
144
|
-
- Enables `LANGCHAIN_TRACING_V2` for auto-tracing of LangChain/LangGraph harnesses
|
|
145
|
-
- Detects [langsmith-cli](https://github.com/gigaverse-app/langsmith-cli) for the proposer to query traces directly
|
|
190
|
+
### LangSmith (optional, recommended for LangChain/LangGraph harnesses)
|
|
146
191
|
|
|
147
192
|
```bash
|
|
148
|
-
# Setup
|
|
149
193
|
export LANGSMITH_API_KEY=lsv2_...
|
|
150
194
|
uv tool install langsmith-cli && langsmith-cli auth login
|
|
151
|
-
|
|
152
|
-
# The proposer can then do:
|
|
153
|
-
langsmith-cli --json runs list --project harness-evolver-v003 --failed --fields id,name,error
|
|
154
|
-
langsmith-cli --json runs stats --project harness-evolver-v003
|
|
155
195
|
```
|
|
156
196
|
|
|
157
|
-
|
|
158
|
-
|
|
159
|
-
|
|
160
|
-
|
|
161
|
-
The plugin detects the harness's technology stack via AST analysis (17 libraries supported) and instructs the proposer to consult current documentation before proposing API changes.
|
|
197
|
+
When detected, the plugin:
|
|
198
|
+
- Sets `LANGCHAIN_TRACING_V2=true` automatically — all LLM calls are traced
|
|
199
|
+
- The proposer queries traces directly via `langsmith-cli`:
|
|
162
200
|
|
|
163
201
|
```bash
|
|
164
|
-
|
|
165
|
-
|
|
166
|
-
|
|
167
|
-
# The proposer automatically:
|
|
168
|
-
# 1. Reads config.json → stack.detected (e.g., LangChain, ChromaDB)
|
|
169
|
-
# 2. Queries Context7 for current docs before writing code
|
|
170
|
-
# 3. Annotates proposal.md with "API verified via Context7"
|
|
202
|
+
langsmith-cli --json runs list --project harness-evolver-v003 --failed --fields id,name,error
|
|
203
|
+
langsmith-cli --json runs stats --project harness-evolver-v003
|
|
171
204
|
```
|
|
172
205
|
|
|
173
|
-
|
|
174
|
-
|
|
175
|
-
### LangChain Docs MCP (optional)
|
|
206
|
+
### Context7 (optional, recommended for any library-heavy harness)
|
|
176
207
|
|
|
177
208
|
```bash
|
|
178
|
-
claude mcp add
|
|
209
|
+
claude mcp add context7 -- npx -y @upstash/context7-mcp@latest
|
|
179
210
|
```
|
|
180
211
|
|
|
181
|
-
|
|
182
|
-
|
|
183
|
-
## The Proposer
|
|
184
|
-
|
|
185
|
-
The proposer agent is the core of the system. It follows a 4-phase workflow derived from the Meta-Harness paper:
|
|
186
|
-
|
|
187
|
-
| Phase | Context % | What it does |
|
|
188
|
-
|---|---|---|
|
|
189
|
-
| **Orient** | ~6% | Read `summary.json` and `PROPOSER_HISTORY.md`. Decide which 2-3 versions to investigate. |
|
|
190
|
-
| **Diagnose** | ~80% | Deep trace analysis on selected versions. grep for errors, diff between good/bad versions, counterfactual diagnosis. |
|
|
191
|
-
| **Propose** | ~10% | Write new `harness.py` + `config.json`. Prefer additive changes after regressions. |
|
|
192
|
-
| **Document** | ~4% | Write `proposal.md` with evidence. Append to `PROPOSER_HISTORY.md`. |
|
|
193
|
-
|
|
194
|
-
**6 rules:**
|
|
195
|
-
1. Every change motivated by evidence (cite task ID, trace line, or score delta)
|
|
196
|
-
2. After regression, prefer additive changes
|
|
197
|
-
3. Don't repeat past mistakes (read PROPOSER_HISTORY.md)
|
|
198
|
-
4. One hypothesis at a time when possible
|
|
199
|
-
5. Maintain the CLI interface
|
|
200
|
-
6. Prefer readable harnesses over defensive ones
|
|
201
|
-
|
|
202
|
-
## Supported Libraries (Stack Detection)
|
|
203
|
-
|
|
204
|
-
The AST-based stack detector recognizes 17 libraries:
|
|
205
|
-
|
|
206
|
-
| Category | Libraries |
|
|
207
|
-
|---|---|
|
|
208
|
-
| **AI Frameworks** | LangChain, LangGraph, LlamaIndex, OpenAI, Anthropic, DSPy, CrewAI, AutoGen |
|
|
209
|
-
| **Vector Stores** | ChromaDB, Pinecone, Qdrant, Weaviate |
|
|
210
|
-
| **Web** | FastAPI, Flask, Pydantic |
|
|
211
|
-
| **Data** | Pandas, NumPy |
|
|
212
|
+
The plugin detects your stack via AST analysis (17 libraries: LangChain, LangGraph, OpenAI, Anthropic, ChromaDB, FastAPI, etc.) and instructs the proposer to consult current docs before proposing API changes.
|
|
212
213
|
|
|
213
214
|
## Development
|
|
214
215
|
|
|
215
216
|
```bash
|
|
216
|
-
# Run all tests (41 tests, stdlib-only
|
|
217
|
+
# Run all tests (41 tests, stdlib-only)
|
|
217
218
|
python3 -m unittest discover -s tests -v
|
|
218
219
|
|
|
219
|
-
# Test
|
|
220
|
-
|
|
221
|
-
python3 harness.py --input tasks/task_001.json --output /tmp/result.json --config config.json
|
|
222
|
-
cat /tmp/result.json
|
|
220
|
+
# Test example manually
|
|
221
|
+
python3 examples/classifier/harness.py --input examples/classifier/tasks/task_001.json --output /tmp/result.json --config examples/classifier/config.json
|
|
223
222
|
|
|
224
|
-
#
|
|
223
|
+
# Install locally for development
|
|
225
224
|
node bin/install.js
|
|
226
225
|
```
|
|
227
226
|
|
|
228
|
-
## Comparison
|
|
227
|
+
## Comparison
|
|
229
228
|
|
|
230
|
-
| | Meta-Harness
|
|
229
|
+
| | Meta-Harness | A-Evolve | ECC | **Harness Evolver** |
|
|
231
230
|
|---|---|---|---|---|
|
|
232
231
|
| **Format** | Paper artifact | Framework (Docker) | Plugin (passive) | **Plugin (active)** |
|
|
233
|
-
| **Search
|
|
234
|
-
| **Context/iter** | 10M tokens | Variable | N/A | **Full filesystem** |
|
|
232
|
+
| **Search** | Code-space | Code-space | Prompt-space | **Code-space** |
|
|
235
233
|
| **Domain** | TerminalBench-2 | Coding benchmarks | Dev workflow | **Any domain** |
|
|
236
|
-
| **Install** | Manual Python | Docker CLI | `/plugin install` | **`npx
|
|
237
|
-
| **LangSmith** | No | No | No | **Yes
|
|
238
|
-
| **Context7** | No | No | No | **Yes
|
|
234
|
+
| **Install** | Manual Python | Docker CLI | `/plugin install` | **`npx`** |
|
|
235
|
+
| **LangSmith** | No | No | No | **Yes** |
|
|
236
|
+
| **Context7** | No | No | No | **Yes** |
|
|
239
237
|
|
|
240
238
|
## References
|
|
241
239
|
|
|
242
|
-
- [Meta-Harness
|
|
243
|
-
- [GSD (Get Shit Done)](https://github.com/gsd-build/get-shit-done) — CLI architecture inspiration
|
|
244
|
-
- [LangSmith CLI](https://github.com/gigaverse-app/langsmith-cli) — Trace analysis for the proposer
|
|
245
|
-
- [Context7](https://github.com/upstash/context7) — Documentation lookup via MCP
|
|
240
|
+
- [Meta-Harness paper (arxiv 2603.28052)](https://arxiv.org/abs/2603.28052) — Lee et al., 2026
|
|
246
241
|
- [Design Spec](docs/specs/2026-03-31-harness-evolver-design.md)
|
|
247
|
-
- [LangSmith Integration
|
|
248
|
-
- [Context7 Integration
|
|
242
|
+
- [LangSmith Integration](docs/specs/2026-03-31-langsmith-integration.md)
|
|
243
|
+
- [Context7 Integration](docs/specs/2026-03-31-context7-integration.md)
|
|
249
244
|
|
|
250
245
|
## License
|
|
251
246
|
|
|
@@ -0,0 +1,147 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: harness-evolver-architect
|
|
3
|
+
description: |
|
|
4
|
+
Use this agent when the harness-evolver:architect skill needs to analyze a harness
|
|
5
|
+
and recommend the optimal multi-agent topology. Reads code analysis signals, traces,
|
|
6
|
+
and scores to produce a migration plan from current to recommended architecture.
|
|
7
|
+
model: opus
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
# Harness Evolver — Architect Agent
|
|
11
|
+
|
|
12
|
+
You are the architect in a Meta-Harness optimization system. Your job is to analyze a harness's current agent topology, assess whether it matches the task complexity, and recommend the optimal topology with a concrete migration plan.
|
|
13
|
+
|
|
14
|
+
## Context
|
|
15
|
+
|
|
16
|
+
You work inside a `.harness-evolver/` directory. The skill has already run `analyze_architecture.py` to produce raw signals. You will read those signals, the harness code, and any evolution history to produce your recommendation.
|
|
17
|
+
|
|
18
|
+
## Your Workflow
|
|
19
|
+
|
|
20
|
+
### Phase 1: READ SIGNALS
|
|
21
|
+
|
|
22
|
+
1. Read the raw signals JSON output from `analyze_architecture.py` (path provided in your prompt).
|
|
23
|
+
2. Read the harness code:
|
|
24
|
+
- `.harness-evolver/baseline/harness.py` (always exists)
|
|
25
|
+
- The current best candidate from `summary.json` → `.harness-evolver/harnesses/{best}/harness.py` (if evolution has run)
|
|
26
|
+
3. Read `config.json` for:
|
|
27
|
+
- `stack.detected` — what libraries/frameworks are in use
|
|
28
|
+
- `api_keys` — which LLM APIs are available
|
|
29
|
+
- `eval.langsmith` — whether tracing is enabled
|
|
30
|
+
4. Read `summary.json` and `PROPOSER_HISTORY.md` if they exist (to understand evolution progress).
|
|
31
|
+
|
|
32
|
+
### Phase 2: CLASSIFY & ASSESS
|
|
33
|
+
|
|
34
|
+
Classify the current topology from the code signals. The `estimated_topology` field is a starting point, but verify it by reading the actual code. Possible topologies:
|
|
35
|
+
|
|
36
|
+
| Topology | Description | Signals |
|
|
37
|
+
|---|---|---|
|
|
38
|
+
| `single-call` | One LLM call, no iteration | llm_calls=1, no loops, no tools |
|
|
39
|
+
| `chain` | Sequential LLM calls (analyze→generate→validate) | llm_calls>=2, no loops |
|
|
40
|
+
| `react-loop` | Tool use with iterative refinement | loop around LLM, tool definitions |
|
|
41
|
+
| `rag` | Retrieval-augmented generation | retrieval imports/methods |
|
|
42
|
+
| `judge-critic` | Generate then critique/verify | llm_calls>=2, one acts as judge |
|
|
43
|
+
| `hierarchical` | Decompose task, delegate to sub-agents | graph framework, multiple distinct agents |
|
|
44
|
+
| `parallel` | Same operation on multiple inputs concurrently | asyncio.gather, ThreadPoolExecutor |
|
|
45
|
+
| `sequential-routing` | Route different task types to different paths | conditional branching on task type |
|
|
46
|
+
|
|
47
|
+
Assess whether the current topology matches the task complexity:
|
|
48
|
+
- Read the eval tasks to understand what the harness needs to do
|
|
49
|
+
- Consider the current score — is there room for improvement?
|
|
50
|
+
- Consider the task diversity — do different tasks need different approaches?
|
|
51
|
+
|
|
52
|
+
### Phase 3: RECOMMEND
|
|
53
|
+
|
|
54
|
+
Choose the optimal topology based on:
|
|
55
|
+
- **Task characteristics**: simple classification → single-call; multi-step reasoning → chain or react-loop; knowledge-intensive → rag; quality-critical → judge-critic
|
|
56
|
+
- **Current score**: if >0.9 and topology seems adequate, do NOT recommend changes
|
|
57
|
+
- **Stack constraints**: recommend patterns compatible with the detected stack (don't suggest LangGraph if user uses raw urllib)
|
|
58
|
+
- **API availability**: check which API keys exist before recommending patterns that need specific providers
|
|
59
|
+
- **Code size**: don't recommend hierarchical for a 50-line harness
|
|
60
|
+
|
|
61
|
+
### Phase 4: WRITE PLAN
|
|
62
|
+
|
|
63
|
+
Create two output files:
|
|
64
|
+
|
|
65
|
+
**`.harness-evolver/architecture.json`**:
|
|
66
|
+
```json
|
|
67
|
+
{
|
|
68
|
+
"current_topology": "single-call",
|
|
69
|
+
"recommended_topology": "chain",
|
|
70
|
+
"confidence": "medium",
|
|
71
|
+
"reasoning": "The harness makes a single LLM call but tasks require multi-step reasoning (classify then validate). A chain topology could improve accuracy by adding a verification step.",
|
|
72
|
+
"migration_path": [
|
|
73
|
+
{
|
|
74
|
+
"step": 1,
|
|
75
|
+
"description": "Add a validation LLM call after classification to verify the category matches the symptoms",
|
|
76
|
+
"changes": "Add a second API call that takes the classification result and original input, asks 'Does category X match these symptoms? Reply yes/no.'",
|
|
77
|
+
"expected_impact": "Reduce false positives by ~15%"
|
|
78
|
+
},
|
|
79
|
+
{
|
|
80
|
+
"step": 2,
|
|
81
|
+
"description": "Add structured output parsing with fallback",
|
|
82
|
+
"changes": "Parse LLM response with regex, fall back to keyword matching if parse fails",
|
|
83
|
+
"expected_impact": "Eliminate malformed output errors"
|
|
84
|
+
}
|
|
85
|
+
],
|
|
86
|
+
"signals_used": ["llm_call_count=1", "has_loop_around_llm=false", "code_lines=45"],
|
|
87
|
+
"risks": [
|
|
88
|
+
"Additional LLM call doubles latency and cost",
|
|
89
|
+
"Verification step may introduce its own errors"
|
|
90
|
+
],
|
|
91
|
+
"alternative": {
|
|
92
|
+
"topology": "judge-critic",
|
|
93
|
+
"reason": "If chain doesn't improve scores, a judge-critic pattern where a second model evaluates the classification could catch more errors, but at higher cost"
|
|
94
|
+
}
|
|
95
|
+
}
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
**`.harness-evolver/architecture.md`** — human-readable version:
|
|
99
|
+
|
|
100
|
+
```markdown
|
|
101
|
+
# Architecture Analysis
|
|
102
|
+
|
|
103
|
+
## Current Topology: single-call
|
|
104
|
+
[Description of what the harness currently does]
|
|
105
|
+
|
|
106
|
+
## Recommended Topology: chain (confidence: medium)
|
|
107
|
+
[Reasoning]
|
|
108
|
+
|
|
109
|
+
## Migration Path
|
|
110
|
+
1. [Step 1 description]
|
|
111
|
+
2. [Step 2 description]
|
|
112
|
+
|
|
113
|
+
## Risks
|
|
114
|
+
- [Risk 1]
|
|
115
|
+
- [Risk 2]
|
|
116
|
+
|
|
117
|
+
## Alternative
|
|
118
|
+
If the recommended topology doesn't improve scores: [alternative]
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
## Rules
|
|
122
|
+
|
|
123
|
+
1. **Do NOT recommend changes if current score >0.9 and topology seems adequate.** A working harness that scores well should not be restructured speculatively. Write architecture.json with `recommended_topology` equal to `current_topology` and confidence "high".
|
|
124
|
+
|
|
125
|
+
2. **Always provide concrete migration steps, not just "switch to X".** Each step should describe exactly what code to add/change and what it should accomplish.
|
|
126
|
+
|
|
127
|
+
3. **Consider the detected stack.** Don't recommend LangGraph patterns if the user is using raw urllib. Don't recommend LangChain if they use the Anthropic SDK directly. Match the style.
|
|
128
|
+
|
|
129
|
+
4. **Consider API key availability.** If only ANTHROPIC_API_KEY is available, don't recommend a pattern that requires multiple providers. Check `config.json` → `api_keys`.
|
|
130
|
+
|
|
131
|
+
5. **Migration should be incremental.** Each step in `migration_path` corresponds to one evolution iteration. The proposer will implement one step at a time. Steps should be independently valuable (each step should improve or at least not regress the score).
|
|
132
|
+
|
|
133
|
+
6. **Rate confidence honestly:**
|
|
134
|
+
- `"high"` — strong signal match, clear improvement path, similar patterns known to work
|
|
135
|
+
- `"medium"` — reasonable hypothesis but task-specific factors could change the outcome
|
|
136
|
+
- `"low"` — speculative, insufficient data, or signals are ambiguous
|
|
137
|
+
|
|
138
|
+
7. **Do NOT modify any harness code.** You only analyze and recommend. The proposer implements.
|
|
139
|
+
|
|
140
|
+
8. **Do NOT modify files in `eval/` or `baseline/`.** These are immutable.
|
|
141
|
+
|
|
142
|
+
## What You Do NOT Do
|
|
143
|
+
|
|
144
|
+
- Do NOT write or modify harness code — you produce analysis and recommendations only
|
|
145
|
+
- Do NOT run evaluations — the evolve skill handles that
|
|
146
|
+
- Do NOT modify `eval/`, `baseline/`, or any existing harness version
|
|
147
|
+
- Do NOT create files outside of `.harness-evolver/architecture.json` and `.harness-evolver/architecture.md`
|
|
@@ -92,6 +92,16 @@ Write a clear `proposal.md` that includes:
|
|
|
92
92
|
|
|
93
93
|
Append a summary to `PROPOSER_HISTORY.md`.
|
|
94
94
|
|
|
95
|
+
## Architecture Guidance (if available)
|
|
96
|
+
|
|
97
|
+
If `.harness-evolver/architecture.json` exists, read it in Phase 1 (ORIENT). The architect agent has recommended a target topology and migration path.
|
|
98
|
+
|
|
99
|
+
- Work TOWARD the recommended topology incrementally — one migration step per iteration
|
|
100
|
+
- Do NOT rewrite the entire harness in one iteration
|
|
101
|
+
- Document which migration step you are implementing in `proposal.md`
|
|
102
|
+
- If a migration step causes regression, note it and consider reverting or deviating
|
|
103
|
+
- If `architecture.json` does NOT exist, ignore this section and evolve freely
|
|
104
|
+
|
|
95
105
|
## Rules
|
|
96
106
|
|
|
97
107
|
1. **Every change motivated by evidence.** Cite the task ID, trace line, or score delta that justifies the change. Never change code "to see what happens."
|
package/bin/install.js
CHANGED
|
@@ -70,6 +70,15 @@ function checkPython() {
|
|
|
70
70
|
}
|
|
71
71
|
}
|
|
72
72
|
|
|
73
|
+
function checkCommand(cmd) {
|
|
74
|
+
try {
|
|
75
|
+
execSync(cmd, { stdio: "pipe" });
|
|
76
|
+
return true;
|
|
77
|
+
} catch {
|
|
78
|
+
return false;
|
|
79
|
+
}
|
|
80
|
+
}
|
|
81
|
+
|
|
73
82
|
function installForRuntime(runtimeDir, scope) {
|
|
74
83
|
const baseDir = scope === "local"
|
|
75
84
|
? path.join(process.cwd(), runtimeDir)
|
|
@@ -223,7 +232,90 @@ async function main() {
|
|
|
223
232
|
fs.writeFileSync(versionPath, VERSION);
|
|
224
233
|
console.log(` ${GREEN}✓${RESET} VERSION ${VERSION}`);
|
|
225
234
|
|
|
226
|
-
console.log(`\n ${GREEN}Done!${RESET}
|
|
235
|
+
console.log(`\n ${GREEN}Done!${RESET} Restart Claude Code, then run ${BRIGHT_MAGENTA}/harness-evolver:init${RESET}\n`);
|
|
236
|
+
|
|
237
|
+
// Optional integrations
|
|
238
|
+
console.log(` ${YELLOW}Install optional integrations?${RESET}\n`);
|
|
239
|
+
console.log(` These enhance the proposer with rich traces and up-to-date documentation.\n`);
|
|
240
|
+
|
|
241
|
+
// LangSmith CLI
|
|
242
|
+
const hasLangsmithCli = checkCommand("langsmith-cli --version");
|
|
243
|
+
if (hasLangsmithCli) {
|
|
244
|
+
console.log(` ${GREEN}✓${RESET} langsmith-cli already installed`);
|
|
245
|
+
} else {
|
|
246
|
+
console.log(` ${BOLD}LangSmith CLI${RESET} — rich trace analysis (error rates, latency, token usage)`);
|
|
247
|
+
console.log(` ${DIM}uv tool install langsmith-cli && langsmith-cli auth login${RESET}`);
|
|
248
|
+
const lsAnswer = await ask(rl, `\n ${YELLOW}Install langsmith-cli? [y/N]:${RESET} `);
|
|
249
|
+
if (lsAnswer.trim().toLowerCase() === "y") {
|
|
250
|
+
console.log(`\n Installing langsmith-cli...`);
|
|
251
|
+
try {
|
|
252
|
+
execSync("uv tool install langsmith-cli", { stdio: "inherit" });
|
|
253
|
+
console.log(`\n ${GREEN}✓${RESET} langsmith-cli installed`);
|
|
254
|
+
console.log(` ${YELLOW}Run ${BOLD}langsmith-cli auth login${RESET}${YELLOW} to authenticate with your LangSmith API key.${RESET}\n`);
|
|
255
|
+
} catch {
|
|
256
|
+
console.log(`\n ${RED}Failed.${RESET} Install manually: uv tool install langsmith-cli\n`);
|
|
257
|
+
}
|
|
258
|
+
}
|
|
259
|
+
}
|
|
260
|
+
|
|
261
|
+
// Context7 MCP
|
|
262
|
+
const hasContext7 = (() => {
|
|
263
|
+
try {
|
|
264
|
+
for (const p of [path.join(HOME, ".claude", "settings.json"), path.join(HOME, ".claude.json")]) {
|
|
265
|
+
if (fs.existsSync(p)) {
|
|
266
|
+
const s = JSON.parse(fs.readFileSync(p, "utf8"));
|
|
267
|
+
if (s.mcpServers && (s.mcpServers.context7 || s.mcpServers.Context7)) return true;
|
|
268
|
+
}
|
|
269
|
+
}
|
|
270
|
+
} catch {}
|
|
271
|
+
return false;
|
|
272
|
+
})();
|
|
273
|
+
if (hasContext7) {
|
|
274
|
+
console.log(` ${GREEN}✓${RESET} Context7 MCP already configured`);
|
|
275
|
+
} else {
|
|
276
|
+
console.log(`\n ${BOLD}Context7 MCP${RESET} — up-to-date library documentation (LangChain, OpenAI, etc.)`);
|
|
277
|
+
console.log(` ${DIM}claude mcp add context7 -- npx -y @upstash/context7-mcp@latest${RESET}`);
|
|
278
|
+
const c7Answer = await ask(rl, `\n ${YELLOW}Install Context7 MCP? [y/N]:${RESET} `);
|
|
279
|
+
if (c7Answer.trim().toLowerCase() === "y") {
|
|
280
|
+
console.log(`\n Installing Context7 MCP...`);
|
|
281
|
+
try {
|
|
282
|
+
execSync("claude mcp add context7 -- npx -y @upstash/context7-mcp@latest", { stdio: "inherit" });
|
|
283
|
+
console.log(`\n ${GREEN}✓${RESET} Context7 MCP configured`);
|
|
284
|
+
} catch {
|
|
285
|
+
console.log(`\n ${RED}Failed.${RESET} Install manually: claude mcp add context7 -- npx -y @upstash/context7-mcp@latest\n`);
|
|
286
|
+
}
|
|
287
|
+
}
|
|
288
|
+
}
|
|
289
|
+
|
|
290
|
+
// LangChain Docs MCP
|
|
291
|
+
const hasLcDocs = (() => {
|
|
292
|
+
try {
|
|
293
|
+
for (const p of [path.join(HOME, ".claude", "settings.json"), path.join(HOME, ".claude.json")]) {
|
|
294
|
+
if (fs.existsSync(p)) {
|
|
295
|
+
const s = JSON.parse(fs.readFileSync(p, "utf8"));
|
|
296
|
+
if (s.mcpServers && (s.mcpServers["docs-langchain"] || s.mcpServers["LangChain Docs"])) return true;
|
|
297
|
+
}
|
|
298
|
+
}
|
|
299
|
+
} catch {}
|
|
300
|
+
return false;
|
|
301
|
+
})();
|
|
302
|
+
if (hasLcDocs) {
|
|
303
|
+
console.log(` ${GREEN}✓${RESET} LangChain Docs MCP already configured`);
|
|
304
|
+
} else {
|
|
305
|
+
console.log(`\n ${BOLD}LangChain Docs MCP${RESET} — LangChain/LangGraph/LangSmith documentation search`);
|
|
306
|
+
console.log(` ${DIM}claude mcp add docs-langchain --transport http https://docs.langchain.com/mcp${RESET}`);
|
|
307
|
+
const lcAnswer = await ask(rl, `\n ${YELLOW}Install LangChain Docs MCP? [y/N]:${RESET} `);
|
|
308
|
+
if (lcAnswer.trim().toLowerCase() === "y") {
|
|
309
|
+
console.log(`\n Installing LangChain Docs MCP...`);
|
|
310
|
+
try {
|
|
311
|
+
execSync("claude mcp add docs-langchain --transport http https://docs.langchain.com/mcp", { stdio: "inherit" });
|
|
312
|
+
console.log(`\n ${GREEN}✓${RESET} LangChain Docs MCP configured`);
|
|
313
|
+
} catch {
|
|
314
|
+
console.log(`\n ${RED}Failed.${RESET} Install manually: claude mcp add docs-langchain --transport http https://docs.langchain.com/mcp\n`);
|
|
315
|
+
}
|
|
316
|
+
}
|
|
317
|
+
}
|
|
318
|
+
|
|
227
319
|
console.log(`\n ${DIM}Quick start with example:${RESET}`);
|
|
228
320
|
console.log(` cp -r ~/.harness-evolver/examples/classifier ./my-project`);
|
|
229
321
|
console.log(` cd my-project && claude`);
|
package/package.json
CHANGED
|
@@ -0,0 +1,108 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: harness-evolver:architect
|
|
3
|
+
description: "Use when the user wants to analyze harness architecture, get a topology recommendation, understand if their agent pattern is optimal, or after stagnation in the evolution loop."
|
|
4
|
+
argument-hint: "[--force]"
|
|
5
|
+
allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent]
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# /harness-evolver:architect
|
|
9
|
+
|
|
10
|
+
Analyze the current harness architecture and recommend the optimal multi-agent topology.
|
|
11
|
+
|
|
12
|
+
## Prerequisites
|
|
13
|
+
|
|
14
|
+
`.harness-evolver/` must exist. If not, tell user to run `harness-evolver:init` first.
|
|
15
|
+
|
|
16
|
+
```bash
|
|
17
|
+
if [ ! -d ".harness-evolver" ]; then
|
|
18
|
+
echo "ERROR: .harness-evolver/ not found. Run /harness-evolver:init first."
|
|
19
|
+
exit 1
|
|
20
|
+
fi
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
## Resolve Tool Path
|
|
24
|
+
|
|
25
|
+
```bash
|
|
26
|
+
TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo "$HOME/.harness-evolver/tools")
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
Use `$TOOLS` prefix for all tool calls below.
|
|
30
|
+
|
|
31
|
+
## Step 1: Run Architecture Analysis
|
|
32
|
+
|
|
33
|
+
Build the command based on what exists:
|
|
34
|
+
|
|
35
|
+
```bash
|
|
36
|
+
CMD="python3 $TOOLS/analyze_architecture.py --harness .harness-evolver/baseline/harness.py"
|
|
37
|
+
|
|
38
|
+
# Add traces from best version if evolution has run
|
|
39
|
+
if [ -f ".harness-evolver/summary.json" ]; then
|
|
40
|
+
BEST=$(python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(s.get('best',{}).get('version',''))")
|
|
41
|
+
if [ -n "$BEST" ] && [ -d ".harness-evolver/harnesses/$BEST/traces" ]; then
|
|
42
|
+
CMD="$CMD --traces-dir .harness-evolver/harnesses/$BEST/traces"
|
|
43
|
+
fi
|
|
44
|
+
CMD="$CMD --summary .harness-evolver/summary.json"
|
|
45
|
+
fi
|
|
46
|
+
|
|
47
|
+
CMD="$CMD -o .harness-evolver/architecture_signals.json"
|
|
48
|
+
|
|
49
|
+
eval $CMD
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
Check exit code. If it fails, report the error and stop.
|
|
53
|
+
|
|
54
|
+
## Step 2: Spawn Architect Agent
|
|
55
|
+
|
|
56
|
+
Spawn the `harness-evolver-architect` agent with:
|
|
57
|
+
|
|
58
|
+
> Analyze the harness and recommend the optimal multi-agent topology.
|
|
59
|
+
> Raw signals are at `.harness-evolver/architecture_signals.json`.
|
|
60
|
+
> Write `.harness-evolver/architecture.json` and `.harness-evolver/architecture.md`.
|
|
61
|
+
|
|
62
|
+
The architect agent will:
|
|
63
|
+
1. Read the signals JSON
|
|
64
|
+
2. Read the harness code and config
|
|
65
|
+
3. Classify the current topology
|
|
66
|
+
4. Assess if it matches task complexity
|
|
67
|
+
5. Recommend the optimal topology with migration steps
|
|
68
|
+
6. Write `architecture.json` and `architecture.md`
|
|
69
|
+
|
|
70
|
+
## Step 3: Report
|
|
71
|
+
|
|
72
|
+
After the architect agent completes, read the outputs and print a summary:
|
|
73
|
+
|
|
74
|
+
```
|
|
75
|
+
Architecture Analysis Complete
|
|
76
|
+
==============================
|
|
77
|
+
Current topology: {current_topology}
|
|
78
|
+
Recommended topology: {recommended_topology}
|
|
79
|
+
Confidence: {confidence}
|
|
80
|
+
|
|
81
|
+
Reasoning: {reasoning}
|
|
82
|
+
|
|
83
|
+
Migration Path:
|
|
84
|
+
1. {step 1 description}
|
|
85
|
+
2. {step 2 description}
|
|
86
|
+
...
|
|
87
|
+
|
|
88
|
+
Risks:
|
|
89
|
+
- {risk 1}
|
|
90
|
+
- {risk 2}
|
|
91
|
+
|
|
92
|
+
Next: Run /harness-evolver:evolve — the proposer will follow the migration path.
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
If the architect recommends no change (current = recommended), report:
|
|
96
|
+
|
|
97
|
+
```
|
|
98
|
+
Architecture Analysis Complete
|
|
99
|
+
==============================
|
|
100
|
+
Current topology: {topology} — looks optimal for these tasks.
|
|
101
|
+
No architecture change recommended. Score: {score}
|
|
102
|
+
|
|
103
|
+
The proposer can continue evolving within the current topology.
|
|
104
|
+
```
|
|
105
|
+
|
|
106
|
+
## Arguments
|
|
107
|
+
|
|
108
|
+
- `--force` — re-run analysis even if `architecture.json` already exists. Without this flag, if `architecture.json` exists, just display the existing recommendation.
|
package/skills/evolve/SKILL.md
CHANGED
|
@@ -92,3 +92,8 @@ Read `summary.json`. Print: `Iteration {i}/{N}: {version} scored {score} (best:
|
|
|
92
92
|
- Improvement over baseline (absolute and %)
|
|
93
93
|
- Total iterations run
|
|
94
94
|
- Suggest: "The best harness is at `.harness-evolver/harnesses/{best}/harness.py`. Copy it to your project."
|
|
95
|
+
|
|
96
|
+
If the loop stopped due to stagnation AND `.harness-evolver/architecture.json` does NOT exist:
|
|
97
|
+
|
|
98
|
+
> The proposer may have hit an architectural ceiling. Run `/harness-evolver:architect`
|
|
99
|
+
> to analyze whether a different agent topology could help.
|
package/skills/init/SKILL.md
CHANGED
|
@@ -57,6 +57,21 @@ Add `--harness-config config.json` if a config exists.
|
|
|
57
57
|
- Baseline score
|
|
58
58
|
- Next: `harness-evolver:evolve` to start
|
|
59
59
|
|
|
60
|
+
## Architecture Hint
|
|
61
|
+
|
|
62
|
+
After init completes, run a quick architecture analysis:
|
|
63
|
+
|
|
64
|
+
```bash
|
|
65
|
+
python3 $TOOLS/analyze_architecture.py --harness .harness-evolver/baseline/harness.py
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
If the analysis suggests the current topology may not be optimal for the task complexity, mention it:
|
|
69
|
+
|
|
70
|
+
> Architecture note: Current topology is "{topology}". For tasks with {characteristics},
|
|
71
|
+
> consider running `/harness-evolver:architect` for a detailed recommendation.
|
|
72
|
+
|
|
73
|
+
This is advisory only — do not spawn the architect agent.
|
|
74
|
+
|
|
60
75
|
## Gotchas
|
|
61
76
|
|
|
62
77
|
- The harness must write valid JSON to `--output`. If the user's code returns non-JSON, the wrapper must serialize it.
|
|
@@ -0,0 +1,512 @@
|
|
|
1
|
+
#!/usr/bin/env python3
|
|
2
|
+
"""Analyze harness architecture to detect current topology and produce signals.
|
|
3
|
+
|
|
4
|
+
Usage:
|
|
5
|
+
analyze_architecture.py --harness PATH [--traces-dir PATH] [--summary PATH] [-o output.json]
|
|
6
|
+
|
|
7
|
+
Performs AST-based analysis of harness code, optional trace analysis, and optional
|
|
8
|
+
score analysis to classify the current agent topology and produce structured signals
|
|
9
|
+
for the architect agent.
|
|
10
|
+
|
|
11
|
+
Stdlib-only. No external dependencies.
|
|
12
|
+
"""
|
|
13
|
+
|
|
14
|
+
import argparse
|
|
15
|
+
import ast
|
|
16
|
+
import json
|
|
17
|
+
import os
|
|
18
|
+
import re
|
|
19
|
+
import sys
|
|
20
|
+
|
|
21
|
+
|
|
22
|
+
# --- AST Analysis ---
|
|
23
|
+
|
|
24
|
+
LLM_API_DOMAINS = [
|
|
25
|
+
"api.anthropic.com",
|
|
26
|
+
"api.openai.com",
|
|
27
|
+
"generativelanguage.googleapis.com",
|
|
28
|
+
]
|
|
29
|
+
|
|
30
|
+
LLM_SDK_MODULES = {"openai", "anthropic", "langchain_openai", "langchain_anthropic",
|
|
31
|
+
"langchain_core", "langchain_community", "langchain"}
|
|
32
|
+
|
|
33
|
+
RETRIEVAL_MODULES = {"chromadb", "pinecone", "qdrant_client", "weaviate"}
|
|
34
|
+
|
|
35
|
+
RETRIEVAL_METHOD_NAMES = {"similarity_search", "query"}
|
|
36
|
+
|
|
37
|
+
GRAPH_FRAMEWORK_CLASSES = {"StateGraph"}
|
|
38
|
+
GRAPH_FRAMEWORK_METHODS = {"add_node", "add_edge"}
|
|
39
|
+
|
|
40
|
+
PARALLEL_PATTERNS = {"gather"} # asyncio.gather
|
|
41
|
+
PARALLEL_CLASSES = {"ThreadPoolExecutor", "ProcessPoolExecutor"}
|
|
42
|
+
|
|
43
|
+
TOOL_DICT_KEYS = {"name", "description", "parameters"}
|
|
44
|
+
|
|
45
|
+
|
|
46
|
+
def _get_all_imports(tree):
|
|
47
|
+
"""Extract all imported module root names."""
|
|
48
|
+
imports = set()
|
|
49
|
+
for node in ast.walk(tree):
|
|
50
|
+
if isinstance(node, ast.Import):
|
|
51
|
+
for alias in node.names:
|
|
52
|
+
imports.add(alias.name.split(".")[0])
|
|
53
|
+
elif isinstance(node, ast.ImportFrom):
|
|
54
|
+
if node.module:
|
|
55
|
+
imports.add(node.module.split(".")[0])
|
|
56
|
+
return imports
|
|
57
|
+
|
|
58
|
+
|
|
59
|
+
def _get_all_import_modules(tree):
|
|
60
|
+
"""Extract all imported module full names (including submodules)."""
|
|
61
|
+
modules = set()
|
|
62
|
+
for node in ast.walk(tree):
|
|
63
|
+
if isinstance(node, ast.Import):
|
|
64
|
+
for alias in node.names:
|
|
65
|
+
modules.add(alias.name)
|
|
66
|
+
elif isinstance(node, ast.ImportFrom):
|
|
67
|
+
if node.module:
|
|
68
|
+
modules.add(node.module)
|
|
69
|
+
return modules
|
|
70
|
+
|
|
71
|
+
|
|
72
|
+
def _count_string_matches(tree, patterns):
|
|
73
|
+
"""Count AST string constants that contain any of the given patterns."""
|
|
74
|
+
count = 0
|
|
75
|
+
for node in ast.walk(tree):
|
|
76
|
+
if isinstance(node, ast.Constant) and isinstance(node.value, str):
|
|
77
|
+
for pattern in patterns:
|
|
78
|
+
if pattern in node.value:
|
|
79
|
+
count += 1
|
|
80
|
+
break
|
|
81
|
+
return count
|
|
82
|
+
|
|
83
|
+
|
|
84
|
+
def _count_llm_calls(tree, imports, source_text):
|
|
85
|
+
"""Count LLM API calls: urllib requests to known domains + SDK client calls."""
|
|
86
|
+
count = 0
|
|
87
|
+
|
|
88
|
+
# Count urllib.request calls with LLM API domains in string constants
|
|
89
|
+
count += _count_string_matches(tree, LLM_API_DOMAINS)
|
|
90
|
+
|
|
91
|
+
# Count SDK imports that imply LLM calls (each import of an LLM SDK = at least 1 call site)
|
|
92
|
+
full_modules = _get_all_import_modules(tree)
|
|
93
|
+
sdk_found = set()
|
|
94
|
+
for mod in full_modules:
|
|
95
|
+
root = mod.split(".")[0]
|
|
96
|
+
if root in LLM_SDK_MODULES:
|
|
97
|
+
sdk_found.add(root)
|
|
98
|
+
|
|
99
|
+
# For SDK users, look for actual call patterns like .create, .chat, .invoke, .run
|
|
100
|
+
llm_call_methods = {"create", "chat", "invoke", "run", "generate", "predict",
|
|
101
|
+
"complete", "completions"}
|
|
102
|
+
for node in ast.walk(tree):
|
|
103
|
+
if isinstance(node, ast.Call):
|
|
104
|
+
if isinstance(node.func, ast.Attribute):
|
|
105
|
+
if node.func.attr in llm_call_methods and sdk_found:
|
|
106
|
+
count += 1
|
|
107
|
+
|
|
108
|
+
# If we found SDK imports but no explicit call methods, count 1 per SDK
|
|
109
|
+
if sdk_found and count == 0:
|
|
110
|
+
count = len(sdk_found)
|
|
111
|
+
|
|
112
|
+
return max(count, _count_string_matches(tree, LLM_API_DOMAINS))
|
|
113
|
+
|
|
114
|
+
|
|
115
|
+
def _has_loop_around_llm(tree, source_text):
|
|
116
|
+
"""Check if any LLM call is inside a loop (for/while)."""
|
|
117
|
+
for node in ast.walk(tree):
|
|
118
|
+
if isinstance(node, (ast.For, ast.While)):
|
|
119
|
+
# Walk the loop body looking for LLM call signals
|
|
120
|
+
for child in ast.walk(node):
|
|
121
|
+
# Check for urllib.request.urlopen in a loop
|
|
122
|
+
if isinstance(child, ast.Attribute) and child.attr == "urlopen":
|
|
123
|
+
return True
|
|
124
|
+
# Check for SDK call methods in a loop
|
|
125
|
+
if isinstance(child, ast.Call) and isinstance(child.func, ast.Attribute):
|
|
126
|
+
if child.func.attr in {"create", "chat", "invoke", "run",
|
|
127
|
+
"generate", "predict", "complete"}:
|
|
128
|
+
return True
|
|
129
|
+
# Check for LLM API domain strings in a loop
|
|
130
|
+
if isinstance(child, ast.Constant) and isinstance(child.value, str):
|
|
131
|
+
for domain in LLM_API_DOMAINS:
|
|
132
|
+
if domain in child.value:
|
|
133
|
+
return True
|
|
134
|
+
return False
|
|
135
|
+
|
|
136
|
+
|
|
137
|
+
def _has_tool_definitions(tree):
|
|
138
|
+
"""Check for tool definitions: dicts with name/description/parameters keys, or @tool decorators."""
|
|
139
|
+
# Check for @tool decorator
|
|
140
|
+
for node in ast.walk(tree):
|
|
141
|
+
if isinstance(node, ast.FunctionDef):
|
|
142
|
+
for decorator in node.decorator_list:
|
|
143
|
+
if isinstance(decorator, ast.Name) and decorator.id == "tool":
|
|
144
|
+
return True
|
|
145
|
+
if isinstance(decorator, ast.Attribute) and decorator.attr == "tool":
|
|
146
|
+
return True
|
|
147
|
+
|
|
148
|
+
# Check for dicts with tool-like keys
|
|
149
|
+
for node in ast.walk(tree):
|
|
150
|
+
if isinstance(node, ast.Dict):
|
|
151
|
+
keys = set()
|
|
152
|
+
for key in node.keys:
|
|
153
|
+
if isinstance(key, ast.Constant) and isinstance(key.value, str):
|
|
154
|
+
keys.add(key.value)
|
|
155
|
+
if TOOL_DICT_KEYS.issubset(keys):
|
|
156
|
+
return True
|
|
157
|
+
|
|
158
|
+
return False
|
|
159
|
+
|
|
160
|
+
|
|
161
|
+
def _has_retrieval(tree, imports):
|
|
162
|
+
"""Check for retrieval patterns: vector DB imports or .similarity_search/.query calls."""
|
|
163
|
+
if imports & RETRIEVAL_MODULES:
|
|
164
|
+
return True
|
|
165
|
+
|
|
166
|
+
for node in ast.walk(tree):
|
|
167
|
+
if isinstance(node, ast.Attribute):
|
|
168
|
+
if node.attr in RETRIEVAL_METHOD_NAMES:
|
|
169
|
+
return True
|
|
170
|
+
|
|
171
|
+
return False
|
|
172
|
+
|
|
173
|
+
|
|
174
|
+
def _has_graph_framework(tree, full_modules):
|
|
175
|
+
"""Check for graph framework usage (LangGraph StateGraph, add_node, add_edge)."""
|
|
176
|
+
# Check if langgraph is imported
|
|
177
|
+
for mod in full_modules:
|
|
178
|
+
if "langgraph" in mod:
|
|
179
|
+
return True
|
|
180
|
+
|
|
181
|
+
# Check for StateGraph usage
|
|
182
|
+
for node in ast.walk(tree):
|
|
183
|
+
if isinstance(node, ast.Name) and node.id in GRAPH_FRAMEWORK_CLASSES:
|
|
184
|
+
return True
|
|
185
|
+
if isinstance(node, ast.Attribute):
|
|
186
|
+
if node.attr in GRAPH_FRAMEWORK_CLASSES or node.attr in GRAPH_FRAMEWORK_METHODS:
|
|
187
|
+
return True
|
|
188
|
+
|
|
189
|
+
return False
|
|
190
|
+
|
|
191
|
+
|
|
192
|
+
def _has_parallel_execution(tree, imports):
|
|
193
|
+
"""Check for asyncio.gather, concurrent.futures, ThreadPoolExecutor."""
|
|
194
|
+
if "concurrent" in imports:
|
|
195
|
+
return True
|
|
196
|
+
|
|
197
|
+
for node in ast.walk(tree):
|
|
198
|
+
if isinstance(node, ast.Attribute):
|
|
199
|
+
if node.attr == "gather":
|
|
200
|
+
return True
|
|
201
|
+
if node.attr in PARALLEL_CLASSES:
|
|
202
|
+
return True
|
|
203
|
+
if isinstance(node, ast.Name) and node.id in PARALLEL_CLASSES:
|
|
204
|
+
return True
|
|
205
|
+
|
|
206
|
+
return False
|
|
207
|
+
|
|
208
|
+
|
|
209
|
+
def _has_error_handling_around_llm(tree):
|
|
210
|
+
"""Check if LLM calls are wrapped in try/except."""
|
|
211
|
+
for node in ast.walk(tree):
|
|
212
|
+
if isinstance(node, ast.Try):
|
|
213
|
+
# Walk the try body for LLM signals
|
|
214
|
+
for child in ast.walk(node):
|
|
215
|
+
if isinstance(child, ast.Attribute) and child.attr == "urlopen":
|
|
216
|
+
return True
|
|
217
|
+
if isinstance(child, ast.Call) and isinstance(child.func, ast.Attribute):
|
|
218
|
+
if child.func.attr in {"create", "chat", "invoke", "run",
|
|
219
|
+
"generate", "predict", "complete"}:
|
|
220
|
+
return True
|
|
221
|
+
if isinstance(child, ast.Constant) and isinstance(child.value, str):
|
|
222
|
+
for domain in LLM_API_DOMAINS:
|
|
223
|
+
if domain in child.value:
|
|
224
|
+
return True
|
|
225
|
+
return False
|
|
226
|
+
|
|
227
|
+
|
|
228
|
+
def _count_functions(tree):
|
|
229
|
+
"""Count function definitions (top-level and nested)."""
|
|
230
|
+
count = 0
|
|
231
|
+
for node in ast.walk(tree):
|
|
232
|
+
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
|
|
233
|
+
count += 1
|
|
234
|
+
return count
|
|
235
|
+
|
|
236
|
+
|
|
237
|
+
def _count_classes(tree):
|
|
238
|
+
"""Count class definitions."""
|
|
239
|
+
count = 0
|
|
240
|
+
for node in ast.walk(tree):
|
|
241
|
+
if isinstance(node, ast.ClassDef):
|
|
242
|
+
count += 1
|
|
243
|
+
return count
|
|
244
|
+
|
|
245
|
+
|
|
246
|
+
def _estimate_topology(signals):
|
|
247
|
+
"""Classify the current topology based on code signals."""
|
|
248
|
+
if signals["has_graph_framework"]:
|
|
249
|
+
if signals["has_parallel_execution"]:
|
|
250
|
+
return "parallel"
|
|
251
|
+
return "hierarchical"
|
|
252
|
+
|
|
253
|
+
if signals["has_retrieval"]:
|
|
254
|
+
return "rag"
|
|
255
|
+
|
|
256
|
+
if signals["has_loop_around_llm"]:
|
|
257
|
+
if signals["has_tool_definitions"]:
|
|
258
|
+
return "react-loop"
|
|
259
|
+
return "react-loop"
|
|
260
|
+
|
|
261
|
+
if signals["llm_call_count"] >= 3:
|
|
262
|
+
if signals["has_tool_definitions"]:
|
|
263
|
+
return "react-loop"
|
|
264
|
+
return "chain"
|
|
265
|
+
|
|
266
|
+
if signals["llm_call_count"] == 2:
|
|
267
|
+
return "chain"
|
|
268
|
+
|
|
269
|
+
if signals["llm_call_count"] <= 1:
|
|
270
|
+
return "single-call"
|
|
271
|
+
|
|
272
|
+
return "single-call"
|
|
273
|
+
|
|
274
|
+
|
|
275
|
+
def analyze_code(harness_path):
|
|
276
|
+
"""Analyze a harness Python file and return code signals."""
|
|
277
|
+
with open(harness_path) as f:
|
|
278
|
+
source = f.read()
|
|
279
|
+
|
|
280
|
+
try:
|
|
281
|
+
tree = ast.parse(source)
|
|
282
|
+
except SyntaxError:
|
|
283
|
+
return {
|
|
284
|
+
"llm_call_count": 0,
|
|
285
|
+
"has_loop_around_llm": False,
|
|
286
|
+
"has_tool_definitions": False,
|
|
287
|
+
"has_retrieval": False,
|
|
288
|
+
"has_graph_framework": False,
|
|
289
|
+
"has_parallel_execution": False,
|
|
290
|
+
"has_error_handling": False,
|
|
291
|
+
"estimated_topology": "unknown",
|
|
292
|
+
"code_lines": len(source.splitlines()),
|
|
293
|
+
"function_count": 0,
|
|
294
|
+
"class_count": 0,
|
|
295
|
+
}
|
|
296
|
+
|
|
297
|
+
imports = _get_all_imports(tree)
|
|
298
|
+
full_modules = _get_all_import_modules(tree)
|
|
299
|
+
|
|
300
|
+
llm_call_count = _count_llm_calls(tree, imports, source)
|
|
301
|
+
has_loop = _has_loop_around_llm(tree, source)
|
|
302
|
+
has_tools = _has_tool_definitions(tree)
|
|
303
|
+
has_retrieval = _has_retrieval(tree, imports)
|
|
304
|
+
has_graph = _has_graph_framework(tree, full_modules)
|
|
305
|
+
has_parallel = _has_parallel_execution(tree, imports)
|
|
306
|
+
has_error = _has_error_handling_around_llm(tree)
|
|
307
|
+
|
|
308
|
+
signals = {
|
|
309
|
+
"llm_call_count": llm_call_count,
|
|
310
|
+
"has_loop_around_llm": has_loop,
|
|
311
|
+
"has_tool_definitions": has_tools,
|
|
312
|
+
"has_retrieval": has_retrieval,
|
|
313
|
+
"has_graph_framework": has_graph,
|
|
314
|
+
"has_parallel_execution": has_parallel,
|
|
315
|
+
"has_error_handling": has_error,
|
|
316
|
+
"code_lines": len(source.splitlines()),
|
|
317
|
+
"function_count": _count_functions(tree),
|
|
318
|
+
"class_count": _count_classes(tree),
|
|
319
|
+
}
|
|
320
|
+
signals["estimated_topology"] = _estimate_topology(signals)
|
|
321
|
+
|
|
322
|
+
return signals
|
|
323
|
+
|
|
324
|
+
|
|
325
|
+
# --- Trace Analysis ---
|
|
326
|
+
|
|
327
|
+
def analyze_traces(traces_dir):
|
|
328
|
+
"""Analyze execution traces for error patterns, timing, and failures."""
|
|
329
|
+
if not os.path.isdir(traces_dir):
|
|
330
|
+
return None
|
|
331
|
+
|
|
332
|
+
result = {
|
|
333
|
+
"error_patterns": [],
|
|
334
|
+
"timing": None,
|
|
335
|
+
"task_failures": [],
|
|
336
|
+
"stderr_lines": 0,
|
|
337
|
+
}
|
|
338
|
+
|
|
339
|
+
# Read stderr.log
|
|
340
|
+
stderr_path = os.path.join(traces_dir, "stderr.log")
|
|
341
|
+
if os.path.isfile(stderr_path):
|
|
342
|
+
try:
|
|
343
|
+
with open(stderr_path) as f:
|
|
344
|
+
stderr = f.read()
|
|
345
|
+
lines = stderr.strip().splitlines()
|
|
346
|
+
result["stderr_lines"] = len(lines)
|
|
347
|
+
|
|
348
|
+
# Detect common error patterns
|
|
349
|
+
error_counts = {}
|
|
350
|
+
for line in lines:
|
|
351
|
+
for pattern in ["Traceback", "Error", "Exception", "Timeout",
|
|
352
|
+
"ConnectionRefused", "HTTPError", "JSONDecodeError",
|
|
353
|
+
"KeyError", "TypeError", "ValueError"]:
|
|
354
|
+
if pattern in line:
|
|
355
|
+
error_counts[pattern] = error_counts.get(pattern, 0) + 1
|
|
356
|
+
|
|
357
|
+
result["error_patterns"] = [
|
|
358
|
+
{"pattern": p, "count": c}
|
|
359
|
+
for p, c in sorted(error_counts.items(), key=lambda x: -x[1])
|
|
360
|
+
]
|
|
361
|
+
except Exception:
|
|
362
|
+
pass
|
|
363
|
+
|
|
364
|
+
# Read timing.json
|
|
365
|
+
timing_path = os.path.join(traces_dir, "timing.json")
|
|
366
|
+
if os.path.isfile(timing_path):
|
|
367
|
+
try:
|
|
368
|
+
with open(timing_path) as f:
|
|
369
|
+
timing = json.load(f)
|
|
370
|
+
result["timing"] = timing
|
|
371
|
+
except (json.JSONDecodeError, Exception):
|
|
372
|
+
pass
|
|
373
|
+
|
|
374
|
+
# Scan per-task output directories for failures
|
|
375
|
+
for entry in sorted(os.listdir(traces_dir)):
|
|
376
|
+
task_dir = os.path.join(traces_dir, entry)
|
|
377
|
+
if os.path.isdir(task_dir) and entry.startswith("task_"):
|
|
378
|
+
output_path = os.path.join(task_dir, "output.json")
|
|
379
|
+
if os.path.isfile(output_path):
|
|
380
|
+
try:
|
|
381
|
+
with open(output_path) as f:
|
|
382
|
+
output = json.load(f)
|
|
383
|
+
# Check for empty or error outputs
|
|
384
|
+
out_value = output.get("output", "")
|
|
385
|
+
if not out_value or out_value in ("error", "unknown", ""):
|
|
386
|
+
result["task_failures"].append({
|
|
387
|
+
"task": entry,
|
|
388
|
+
"output": out_value,
|
|
389
|
+
})
|
|
390
|
+
except (json.JSONDecodeError, Exception):
|
|
391
|
+
result["task_failures"].append({
|
|
392
|
+
"task": entry,
|
|
393
|
+
"output": "parse_error",
|
|
394
|
+
})
|
|
395
|
+
|
|
396
|
+
return result
|
|
397
|
+
|
|
398
|
+
|
|
399
|
+
# --- Score Analysis ---
|
|
400
|
+
|
|
401
|
+
def analyze_scores(summary_path):
|
|
402
|
+
"""Analyze summary.json for stagnation, oscillation, and per-task failures."""
|
|
403
|
+
if not os.path.isfile(summary_path):
|
|
404
|
+
return None
|
|
405
|
+
|
|
406
|
+
try:
|
|
407
|
+
with open(summary_path) as f:
|
|
408
|
+
summary = json.load(f)
|
|
409
|
+
except (json.JSONDecodeError, Exception):
|
|
410
|
+
return None
|
|
411
|
+
|
|
412
|
+
result = {
|
|
413
|
+
"iterations": summary.get("iterations", 0),
|
|
414
|
+
"best_score": 0.0,
|
|
415
|
+
"baseline_score": 0.0,
|
|
416
|
+
"recent_scores": [],
|
|
417
|
+
"is_stagnating": False,
|
|
418
|
+
"is_oscillating": False,
|
|
419
|
+
"score_trend": "unknown",
|
|
420
|
+
}
|
|
421
|
+
|
|
422
|
+
# Extract best score
|
|
423
|
+
best = summary.get("best", {})
|
|
424
|
+
result["best_score"] = best.get("combined_score", 0.0)
|
|
425
|
+
result["baseline_score"] = summary.get("baseline_score", 0.0)
|
|
426
|
+
|
|
427
|
+
# Extract recent version scores
|
|
428
|
+
versions = summary.get("versions", [])
|
|
429
|
+
if isinstance(versions, list):
|
|
430
|
+
recent = versions[-5:] if len(versions) > 5 else versions
|
|
431
|
+
result["recent_scores"] = [
|
|
432
|
+
{"version": v.get("version", "?"), "score": v.get("combined_score", 0.0)}
|
|
433
|
+
for v in recent
|
|
434
|
+
]
|
|
435
|
+
elif isinstance(versions, dict):
|
|
436
|
+
items = sorted(versions.items())
|
|
437
|
+
recent = items[-5:] if len(items) > 5 else items
|
|
438
|
+
result["recent_scores"] = [
|
|
439
|
+
{"version": k, "score": v.get("combined_score", 0.0)}
|
|
440
|
+
for k, v in recent
|
|
441
|
+
]
|
|
442
|
+
|
|
443
|
+
# Detect stagnation (last 3+ scores within 1% of each other)
|
|
444
|
+
scores = [s["score"] for s in result["recent_scores"]]
|
|
445
|
+
if len(scores) >= 3:
|
|
446
|
+
last_3 = scores[-3:]
|
|
447
|
+
spread = max(last_3) - min(last_3)
|
|
448
|
+
if spread <= 0.01:
|
|
449
|
+
result["is_stagnating"] = True
|
|
450
|
+
|
|
451
|
+
# Detect oscillation (alternating up/down for last 4+ scores)
|
|
452
|
+
if len(scores) >= 4:
|
|
453
|
+
deltas = [scores[i+1] - scores[i] for i in range(len(scores)-1)]
|
|
454
|
+
sign_changes = sum(
|
|
455
|
+
1 for i in range(len(deltas)-1)
|
|
456
|
+
if (deltas[i] > 0 and deltas[i+1] < 0) or (deltas[i] < 0 and deltas[i+1] > 0)
|
|
457
|
+
)
|
|
458
|
+
if sign_changes >= len(deltas) - 1:
|
|
459
|
+
result["is_oscillating"] = True
|
|
460
|
+
|
|
461
|
+
# Score trend
|
|
462
|
+
if len(scores) >= 2:
|
|
463
|
+
if scores[-1] > scores[0]:
|
|
464
|
+
result["score_trend"] = "improving"
|
|
465
|
+
elif scores[-1] < scores[0]:
|
|
466
|
+
result["score_trend"] = "declining"
|
|
467
|
+
else:
|
|
468
|
+
result["score_trend"] = "flat"
|
|
469
|
+
|
|
470
|
+
return result
|
|
471
|
+
|
|
472
|
+
|
|
473
|
+
# --- Main ---
|
|
474
|
+
|
|
475
|
+
def main():
|
|
476
|
+
parser = argparse.ArgumentParser(
|
|
477
|
+
description="Analyze harness architecture and produce signals for the architect agent",
|
|
478
|
+
usage="analyze_architecture.py --harness PATH [--traces-dir PATH] [--summary PATH] [-o output.json]",
|
|
479
|
+
)
|
|
480
|
+
parser.add_argument("--harness", required=True, help="Path to harness Python file")
|
|
481
|
+
parser.add_argument("--traces-dir", default=None, help="Path to traces directory")
|
|
482
|
+
parser.add_argument("--summary", default=None, help="Path to summary.json")
|
|
483
|
+
parser.add_argument("-o", "--output", default=None, help="Output JSON path")
|
|
484
|
+
args = parser.parse_args()
|
|
485
|
+
|
|
486
|
+
if not os.path.isfile(args.harness):
|
|
487
|
+
print(json.dumps({"error": f"Harness file not found: {args.harness}"}))
|
|
488
|
+
sys.exit(1)
|
|
489
|
+
|
|
490
|
+
result = {
|
|
491
|
+
"code_signals": analyze_code(args.harness),
|
|
492
|
+
"trace_signals": None,
|
|
493
|
+
"score_signals": None,
|
|
494
|
+
}
|
|
495
|
+
|
|
496
|
+
if args.traces_dir:
|
|
497
|
+
result["trace_signals"] = analyze_traces(args.traces_dir)
|
|
498
|
+
|
|
499
|
+
if args.summary:
|
|
500
|
+
result["score_signals"] = analyze_scores(args.summary)
|
|
501
|
+
|
|
502
|
+
output = json.dumps(result, indent=2)
|
|
503
|
+
|
|
504
|
+
if args.output:
|
|
505
|
+
with open(args.output, "w") as f:
|
|
506
|
+
f.write(output + "\n")
|
|
507
|
+
else:
|
|
508
|
+
print(output)
|
|
509
|
+
|
|
510
|
+
|
|
511
|
+
if __name__ == "__main__":
|
|
512
|
+
main()
|
package/tools/init.py
CHANGED
|
@@ -317,6 +317,29 @@ def main():
|
|
|
317
317
|
print("\nRecommendation: install Context7 MCP for up-to-date documentation:")
|
|
318
318
|
print(" claude mcp add context7 -- npx -y @upstash/context7-mcp@latest")
|
|
319
319
|
|
|
320
|
+
# Architecture analysis (quick, advisory)
|
|
321
|
+
analyze_py = os.path.join(tools, "analyze_architecture.py")
|
|
322
|
+
if os.path.exists(analyze_py):
|
|
323
|
+
try:
|
|
324
|
+
r = subprocess.run(
|
|
325
|
+
["python3", analyze_py, "--harness", args.harness],
|
|
326
|
+
capture_output=True, text=True, timeout=30,
|
|
327
|
+
)
|
|
328
|
+
if r.returncode == 0 and r.stdout.strip():
|
|
329
|
+
arch_signals = json.loads(r.stdout)
|
|
330
|
+
config["architecture"] = {
|
|
331
|
+
"current_topology": arch_signals.get("code_signals", {}).get("estimated_topology", "unknown"),
|
|
332
|
+
"auto_analyzed": True,
|
|
333
|
+
}
|
|
334
|
+
# Re-write config with architecture
|
|
335
|
+
with open(os.path.join(base, "config.json"), "w") as f:
|
|
336
|
+
json.dump(config, f, indent=2)
|
|
337
|
+
topo = config["architecture"]["current_topology"]
|
|
338
|
+
if topo != "unknown":
|
|
339
|
+
print(f"Architecture: {topo}")
|
|
340
|
+
except Exception:
|
|
341
|
+
pass
|
|
342
|
+
|
|
320
343
|
# 5. Validate baseline harness
|
|
321
344
|
print("Validating baseline harness...")
|
|
322
345
|
val_args = ["python3", evaluate_py, "validate",
|