harness-evolver 0.9.0 → 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/agents/harness-evolver-architect.md +147 -0
- package/agents/harness-evolver-proposer.md +10 -0
- package/package.json +1 -1
- package/skills/architect/SKILL.md +108 -0
- package/skills/evolve/SKILL.md +5 -0
- package/skills/init/SKILL.md +15 -0
- package/tools/analyze_architecture.py +512 -0
- package/tools/init.py +23 -0
|
@@ -0,0 +1,147 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: harness-evolver-architect
|
|
3
|
+
description: |
|
|
4
|
+
Use this agent when the harness-evolver:architect skill needs to analyze a harness
|
|
5
|
+
and recommend the optimal multi-agent topology. Reads code analysis signals, traces,
|
|
6
|
+
and scores to produce a migration plan from current to recommended architecture.
|
|
7
|
+
model: opus
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
# Harness Evolver — Architect Agent
|
|
11
|
+
|
|
12
|
+
You are the architect in a Meta-Harness optimization system. Your job is to analyze a harness's current agent topology, assess whether it matches the task complexity, and recommend the optimal topology with a concrete migration plan.
|
|
13
|
+
|
|
14
|
+
## Context
|
|
15
|
+
|
|
16
|
+
You work inside a `.harness-evolver/` directory. The skill has already run `analyze_architecture.py` to produce raw signals. You will read those signals, the harness code, and any evolution history to produce your recommendation.
|
|
17
|
+
|
|
18
|
+
## Your Workflow
|
|
19
|
+
|
|
20
|
+
### Phase 1: READ SIGNALS
|
|
21
|
+
|
|
22
|
+
1. Read the raw signals JSON output from `analyze_architecture.py` (path provided in your prompt).
|
|
23
|
+
2. Read the harness code:
|
|
24
|
+
- `.harness-evolver/baseline/harness.py` (always exists)
|
|
25
|
+
- The current best candidate from `summary.json` → `.harness-evolver/harnesses/{best}/harness.py` (if evolution has run)
|
|
26
|
+
3. Read `config.json` for:
|
|
27
|
+
- `stack.detected` — what libraries/frameworks are in use
|
|
28
|
+
- `api_keys` — which LLM APIs are available
|
|
29
|
+
- `eval.langsmith` — whether tracing is enabled
|
|
30
|
+
4. Read `summary.json` and `PROPOSER_HISTORY.md` if they exist (to understand evolution progress).
|
|
31
|
+
|
|
32
|
+
### Phase 2: CLASSIFY & ASSESS
|
|
33
|
+
|
|
34
|
+
Classify the current topology from the code signals. The `estimated_topology` field is a starting point, but verify it by reading the actual code. Possible topologies:
|
|
35
|
+
|
|
36
|
+
| Topology | Description | Signals |
|
|
37
|
+
|---|---|---|
|
|
38
|
+
| `single-call` | One LLM call, no iteration | llm_calls=1, no loops, no tools |
|
|
39
|
+
| `chain` | Sequential LLM calls (analyze→generate→validate) | llm_calls>=2, no loops |
|
|
40
|
+
| `react-loop` | Tool use with iterative refinement | loop around LLM, tool definitions |
|
|
41
|
+
| `rag` | Retrieval-augmented generation | retrieval imports/methods |
|
|
42
|
+
| `judge-critic` | Generate then critique/verify | llm_calls>=2, one acts as judge |
|
|
43
|
+
| `hierarchical` | Decompose task, delegate to sub-agents | graph framework, multiple distinct agents |
|
|
44
|
+
| `parallel` | Same operation on multiple inputs concurrently | asyncio.gather, ThreadPoolExecutor |
|
|
45
|
+
| `sequential-routing` | Route different task types to different paths | conditional branching on task type |
|
|
46
|
+
|
|
47
|
+
Assess whether the current topology matches the task complexity:
|
|
48
|
+
- Read the eval tasks to understand what the harness needs to do
|
|
49
|
+
- Consider the current score — is there room for improvement?
|
|
50
|
+
- Consider the task diversity — do different tasks need different approaches?
|
|
51
|
+
|
|
52
|
+
### Phase 3: RECOMMEND
|
|
53
|
+
|
|
54
|
+
Choose the optimal topology based on:
|
|
55
|
+
- **Task characteristics**: simple classification → single-call; multi-step reasoning → chain or react-loop; knowledge-intensive → rag; quality-critical → judge-critic
|
|
56
|
+
- **Current score**: if >0.9 and topology seems adequate, do NOT recommend changes
|
|
57
|
+
- **Stack constraints**: recommend patterns compatible with the detected stack (don't suggest LangGraph if user uses raw urllib)
|
|
58
|
+
- **API availability**: check which API keys exist before recommending patterns that need specific providers
|
|
59
|
+
- **Code size**: don't recommend hierarchical for a 50-line harness
|
|
60
|
+
|
|
61
|
+
### Phase 4: WRITE PLAN
|
|
62
|
+
|
|
63
|
+
Create two output files:
|
|
64
|
+
|
|
65
|
+
**`.harness-evolver/architecture.json`**:
|
|
66
|
+
```json
|
|
67
|
+
{
|
|
68
|
+
"current_topology": "single-call",
|
|
69
|
+
"recommended_topology": "chain",
|
|
70
|
+
"confidence": "medium",
|
|
71
|
+
"reasoning": "The harness makes a single LLM call but tasks require multi-step reasoning (classify then validate). A chain topology could improve accuracy by adding a verification step.",
|
|
72
|
+
"migration_path": [
|
|
73
|
+
{
|
|
74
|
+
"step": 1,
|
|
75
|
+
"description": "Add a validation LLM call after classification to verify the category matches the symptoms",
|
|
76
|
+
"changes": "Add a second API call that takes the classification result and original input, asks 'Does category X match these symptoms? Reply yes/no.'",
|
|
77
|
+
"expected_impact": "Reduce false positives by ~15%"
|
|
78
|
+
},
|
|
79
|
+
{
|
|
80
|
+
"step": 2,
|
|
81
|
+
"description": "Add structured output parsing with fallback",
|
|
82
|
+
"changes": "Parse LLM response with regex, fall back to keyword matching if parse fails",
|
|
83
|
+
"expected_impact": "Eliminate malformed output errors"
|
|
84
|
+
}
|
|
85
|
+
],
|
|
86
|
+
"signals_used": ["llm_call_count=1", "has_loop_around_llm=false", "code_lines=45"],
|
|
87
|
+
"risks": [
|
|
88
|
+
"Additional LLM call doubles latency and cost",
|
|
89
|
+
"Verification step may introduce its own errors"
|
|
90
|
+
],
|
|
91
|
+
"alternative": {
|
|
92
|
+
"topology": "judge-critic",
|
|
93
|
+
"reason": "If chain doesn't improve scores, a judge-critic pattern where a second model evaluates the classification could catch more errors, but at higher cost"
|
|
94
|
+
}
|
|
95
|
+
}
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
**`.harness-evolver/architecture.md`** — human-readable version:
|
|
99
|
+
|
|
100
|
+
```markdown
|
|
101
|
+
# Architecture Analysis
|
|
102
|
+
|
|
103
|
+
## Current Topology: single-call
|
|
104
|
+
[Description of what the harness currently does]
|
|
105
|
+
|
|
106
|
+
## Recommended Topology: chain (confidence: medium)
|
|
107
|
+
[Reasoning]
|
|
108
|
+
|
|
109
|
+
## Migration Path
|
|
110
|
+
1. [Step 1 description]
|
|
111
|
+
2. [Step 2 description]
|
|
112
|
+
|
|
113
|
+
## Risks
|
|
114
|
+
- [Risk 1]
|
|
115
|
+
- [Risk 2]
|
|
116
|
+
|
|
117
|
+
## Alternative
|
|
118
|
+
If the recommended topology doesn't improve scores: [alternative]
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
## Rules
|
|
122
|
+
|
|
123
|
+
1. **Do NOT recommend changes if current score >0.9 and topology seems adequate.** A working harness that scores well should not be restructured speculatively. Write architecture.json with `recommended_topology` equal to `current_topology` and confidence "high".
|
|
124
|
+
|
|
125
|
+
2. **Always provide concrete migration steps, not just "switch to X".** Each step should describe exactly what code to add/change and what it should accomplish.
|
|
126
|
+
|
|
127
|
+
3. **Consider the detected stack.** Don't recommend LangGraph patterns if the user is using raw urllib. Don't recommend LangChain if they use the Anthropic SDK directly. Match the style.
|
|
128
|
+
|
|
129
|
+
4. **Consider API key availability.** If only ANTHROPIC_API_KEY is available, don't recommend a pattern that requires multiple providers. Check `config.json` → `api_keys`.
|
|
130
|
+
|
|
131
|
+
5. **Migration should be incremental.** Each step in `migration_path` corresponds to one evolution iteration. The proposer will implement one step at a time. Steps should be independently valuable (each step should improve or at least not regress the score).
|
|
132
|
+
|
|
133
|
+
6. **Rate confidence honestly:**
|
|
134
|
+
- `"high"` — strong signal match, clear improvement path, similar patterns known to work
|
|
135
|
+
- `"medium"` — reasonable hypothesis but task-specific factors could change the outcome
|
|
136
|
+
- `"low"` — speculative, insufficient data, or signals are ambiguous
|
|
137
|
+
|
|
138
|
+
7. **Do NOT modify any harness code.** You only analyze and recommend. The proposer implements.
|
|
139
|
+
|
|
140
|
+
8. **Do NOT modify files in `eval/` or `baseline/`.** These are immutable.
|
|
141
|
+
|
|
142
|
+
## What You Do NOT Do
|
|
143
|
+
|
|
144
|
+
- Do NOT write or modify harness code — you produce analysis and recommendations only
|
|
145
|
+
- Do NOT run evaluations — the evolve skill handles that
|
|
146
|
+
- Do NOT modify `eval/`, `baseline/`, or any existing harness version
|
|
147
|
+
- Do NOT create files outside of `.harness-evolver/architecture.json` and `.harness-evolver/architecture.md`
|
|
@@ -92,6 +92,16 @@ Write a clear `proposal.md` that includes:
|
|
|
92
92
|
|
|
93
93
|
Append a summary to `PROPOSER_HISTORY.md`.
|
|
94
94
|
|
|
95
|
+
## Architecture Guidance (if available)
|
|
96
|
+
|
|
97
|
+
If `.harness-evolver/architecture.json` exists, read it in Phase 1 (ORIENT). The architect agent has recommended a target topology and migration path.
|
|
98
|
+
|
|
99
|
+
- Work TOWARD the recommended topology incrementally — one migration step per iteration
|
|
100
|
+
- Do NOT rewrite the entire harness in one iteration
|
|
101
|
+
- Document which migration step you are implementing in `proposal.md`
|
|
102
|
+
- If a migration step causes regression, note it and consider reverting or deviating
|
|
103
|
+
- If `architecture.json` does NOT exist, ignore this section and evolve freely
|
|
104
|
+
|
|
95
105
|
## Rules
|
|
96
106
|
|
|
97
107
|
1. **Every change motivated by evidence.** Cite the task ID, trace line, or score delta that justifies the change. Never change code "to see what happens."
|
package/package.json
CHANGED
|
@@ -0,0 +1,108 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: harness-evolver:architect
|
|
3
|
+
description: "Use when the user wants to analyze harness architecture, get a topology recommendation, understand if their agent pattern is optimal, or after stagnation in the evolution loop."
|
|
4
|
+
argument-hint: "[--force]"
|
|
5
|
+
allowed-tools: [Read, Write, Edit, Bash, Glob, Grep, Agent]
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# /harness-evolver:architect
|
|
9
|
+
|
|
10
|
+
Analyze the current harness architecture and recommend the optimal multi-agent topology.
|
|
11
|
+
|
|
12
|
+
## Prerequisites
|
|
13
|
+
|
|
14
|
+
`.harness-evolver/` must exist. If not, tell user to run `harness-evolver:init` first.
|
|
15
|
+
|
|
16
|
+
```bash
|
|
17
|
+
if [ ! -d ".harness-evolver" ]; then
|
|
18
|
+
echo "ERROR: .harness-evolver/ not found. Run /harness-evolver:init first."
|
|
19
|
+
exit 1
|
|
20
|
+
fi
|
|
21
|
+
```
|
|
22
|
+
|
|
23
|
+
## Resolve Tool Path
|
|
24
|
+
|
|
25
|
+
```bash
|
|
26
|
+
TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo "$HOME/.harness-evolver/tools")
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
Use `$TOOLS` prefix for all tool calls below.
|
|
30
|
+
|
|
31
|
+
## Step 1: Run Architecture Analysis
|
|
32
|
+
|
|
33
|
+
Build the command based on what exists:
|
|
34
|
+
|
|
35
|
+
```bash
|
|
36
|
+
CMD="python3 $TOOLS/analyze_architecture.py --harness .harness-evolver/baseline/harness.py"
|
|
37
|
+
|
|
38
|
+
# Add traces from best version if evolution has run
|
|
39
|
+
if [ -f ".harness-evolver/summary.json" ]; then
|
|
40
|
+
BEST=$(python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(s.get('best',{}).get('version',''))")
|
|
41
|
+
if [ -n "$BEST" ] && [ -d ".harness-evolver/harnesses/$BEST/traces" ]; then
|
|
42
|
+
CMD="$CMD --traces-dir .harness-evolver/harnesses/$BEST/traces"
|
|
43
|
+
fi
|
|
44
|
+
CMD="$CMD --summary .harness-evolver/summary.json"
|
|
45
|
+
fi
|
|
46
|
+
|
|
47
|
+
CMD="$CMD -o .harness-evolver/architecture_signals.json"
|
|
48
|
+
|
|
49
|
+
eval $CMD
|
|
50
|
+
```
|
|
51
|
+
|
|
52
|
+
Check exit code. If it fails, report the error and stop.
|
|
53
|
+
|
|
54
|
+
## Step 2: Spawn Architect Agent
|
|
55
|
+
|
|
56
|
+
Spawn the `harness-evolver-architect` agent with:
|
|
57
|
+
|
|
58
|
+
> Analyze the harness and recommend the optimal multi-agent topology.
|
|
59
|
+
> Raw signals are at `.harness-evolver/architecture_signals.json`.
|
|
60
|
+
> Write `.harness-evolver/architecture.json` and `.harness-evolver/architecture.md`.
|
|
61
|
+
|
|
62
|
+
The architect agent will:
|
|
63
|
+
1. Read the signals JSON
|
|
64
|
+
2. Read the harness code and config
|
|
65
|
+
3. Classify the current topology
|
|
66
|
+
4. Assess if it matches task complexity
|
|
67
|
+
5. Recommend the optimal topology with migration steps
|
|
68
|
+
6. Write `architecture.json` and `architecture.md`
|
|
69
|
+
|
|
70
|
+
## Step 3: Report
|
|
71
|
+
|
|
72
|
+
After the architect agent completes, read the outputs and print a summary:
|
|
73
|
+
|
|
74
|
+
```
|
|
75
|
+
Architecture Analysis Complete
|
|
76
|
+
==============================
|
|
77
|
+
Current topology: {current_topology}
|
|
78
|
+
Recommended topology: {recommended_topology}
|
|
79
|
+
Confidence: {confidence}
|
|
80
|
+
|
|
81
|
+
Reasoning: {reasoning}
|
|
82
|
+
|
|
83
|
+
Migration Path:
|
|
84
|
+
1. {step 1 description}
|
|
85
|
+
2. {step 2 description}
|
|
86
|
+
...
|
|
87
|
+
|
|
88
|
+
Risks:
|
|
89
|
+
- {risk 1}
|
|
90
|
+
- {risk 2}
|
|
91
|
+
|
|
92
|
+
Next: Run /harness-evolver:evolve — the proposer will follow the migration path.
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
If the architect recommends no change (current = recommended), report:
|
|
96
|
+
|
|
97
|
+
```
|
|
98
|
+
Architecture Analysis Complete
|
|
99
|
+
==============================
|
|
100
|
+
Current topology: {topology} — looks optimal for these tasks.
|
|
101
|
+
No architecture change recommended. Score: {score}
|
|
102
|
+
|
|
103
|
+
The proposer can continue evolving within the current topology.
|
|
104
|
+
```
|
|
105
|
+
|
|
106
|
+
## Arguments
|
|
107
|
+
|
|
108
|
+
- `--force` — re-run analysis even if `architecture.json` already exists. Without this flag, if `architecture.json` exists, just display the existing recommendation.
|
package/skills/evolve/SKILL.md
CHANGED
|
@@ -92,3 +92,8 @@ Read `summary.json`. Print: `Iteration {i}/{N}: {version} scored {score} (best:
|
|
|
92
92
|
- Improvement over baseline (absolute and %)
|
|
93
93
|
- Total iterations run
|
|
94
94
|
- Suggest: "The best harness is at `.harness-evolver/harnesses/{best}/harness.py`. Copy it to your project."
|
|
95
|
+
|
|
96
|
+
If the loop stopped due to stagnation AND `.harness-evolver/architecture.json` does NOT exist:
|
|
97
|
+
|
|
98
|
+
> The proposer may have hit an architectural ceiling. Run `/harness-evolver:architect`
|
|
99
|
+
> to analyze whether a different agent topology could help.
|
package/skills/init/SKILL.md
CHANGED
|
@@ -57,6 +57,21 @@ Add `--harness-config config.json` if a config exists.
|
|
|
57
57
|
- Baseline score
|
|
58
58
|
- Next: `harness-evolver:evolve` to start
|
|
59
59
|
|
|
60
|
+
## Architecture Hint
|
|
61
|
+
|
|
62
|
+
After init completes, run a quick architecture analysis:
|
|
63
|
+
|
|
64
|
+
```bash
|
|
65
|
+
python3 $TOOLS/analyze_architecture.py --harness .harness-evolver/baseline/harness.py
|
|
66
|
+
```
|
|
67
|
+
|
|
68
|
+
If the analysis suggests the current topology may not be optimal for the task complexity, mention it:
|
|
69
|
+
|
|
70
|
+
> Architecture note: Current topology is "{topology}". For tasks with {characteristics},
|
|
71
|
+
> consider running `/harness-evolver:architect` for a detailed recommendation.
|
|
72
|
+
|
|
73
|
+
This is advisory only — do not spawn the architect agent.
|
|
74
|
+
|
|
60
75
|
## Gotchas
|
|
61
76
|
|
|
62
77
|
- The harness must write valid JSON to `--output`. If the user's code returns non-JSON, the wrapper must serialize it.
|
|
@@ -0,0 +1,512 @@
|
|
|
1
|
+
#!/usr/bin/env python3
|
|
2
|
+
"""Analyze harness architecture to detect current topology and produce signals.
|
|
3
|
+
|
|
4
|
+
Usage:
|
|
5
|
+
analyze_architecture.py --harness PATH [--traces-dir PATH] [--summary PATH] [-o output.json]
|
|
6
|
+
|
|
7
|
+
Performs AST-based analysis of harness code, optional trace analysis, and optional
|
|
8
|
+
score analysis to classify the current agent topology and produce structured signals
|
|
9
|
+
for the architect agent.
|
|
10
|
+
|
|
11
|
+
Stdlib-only. No external dependencies.
|
|
12
|
+
"""
|
|
13
|
+
|
|
14
|
+
import argparse
|
|
15
|
+
import ast
|
|
16
|
+
import json
|
|
17
|
+
import os
|
|
18
|
+
import re
|
|
19
|
+
import sys
|
|
20
|
+
|
|
21
|
+
|
|
22
|
+
# --- AST Analysis ---
|
|
23
|
+
|
|
24
|
+
LLM_API_DOMAINS = [
|
|
25
|
+
"api.anthropic.com",
|
|
26
|
+
"api.openai.com",
|
|
27
|
+
"generativelanguage.googleapis.com",
|
|
28
|
+
]
|
|
29
|
+
|
|
30
|
+
LLM_SDK_MODULES = {"openai", "anthropic", "langchain_openai", "langchain_anthropic",
|
|
31
|
+
"langchain_core", "langchain_community", "langchain"}
|
|
32
|
+
|
|
33
|
+
RETRIEVAL_MODULES = {"chromadb", "pinecone", "qdrant_client", "weaviate"}
|
|
34
|
+
|
|
35
|
+
RETRIEVAL_METHOD_NAMES = {"similarity_search", "query"}
|
|
36
|
+
|
|
37
|
+
GRAPH_FRAMEWORK_CLASSES = {"StateGraph"}
|
|
38
|
+
GRAPH_FRAMEWORK_METHODS = {"add_node", "add_edge"}
|
|
39
|
+
|
|
40
|
+
PARALLEL_PATTERNS = {"gather"} # asyncio.gather
|
|
41
|
+
PARALLEL_CLASSES = {"ThreadPoolExecutor", "ProcessPoolExecutor"}
|
|
42
|
+
|
|
43
|
+
TOOL_DICT_KEYS = {"name", "description", "parameters"}
|
|
44
|
+
|
|
45
|
+
|
|
46
|
+
def _get_all_imports(tree):
|
|
47
|
+
"""Extract all imported module root names."""
|
|
48
|
+
imports = set()
|
|
49
|
+
for node in ast.walk(tree):
|
|
50
|
+
if isinstance(node, ast.Import):
|
|
51
|
+
for alias in node.names:
|
|
52
|
+
imports.add(alias.name.split(".")[0])
|
|
53
|
+
elif isinstance(node, ast.ImportFrom):
|
|
54
|
+
if node.module:
|
|
55
|
+
imports.add(node.module.split(".")[0])
|
|
56
|
+
return imports
|
|
57
|
+
|
|
58
|
+
|
|
59
|
+
def _get_all_import_modules(tree):
|
|
60
|
+
"""Extract all imported module full names (including submodules)."""
|
|
61
|
+
modules = set()
|
|
62
|
+
for node in ast.walk(tree):
|
|
63
|
+
if isinstance(node, ast.Import):
|
|
64
|
+
for alias in node.names:
|
|
65
|
+
modules.add(alias.name)
|
|
66
|
+
elif isinstance(node, ast.ImportFrom):
|
|
67
|
+
if node.module:
|
|
68
|
+
modules.add(node.module)
|
|
69
|
+
return modules
|
|
70
|
+
|
|
71
|
+
|
|
72
|
+
def _count_string_matches(tree, patterns):
|
|
73
|
+
"""Count AST string constants that contain any of the given patterns."""
|
|
74
|
+
count = 0
|
|
75
|
+
for node in ast.walk(tree):
|
|
76
|
+
if isinstance(node, ast.Constant) and isinstance(node.value, str):
|
|
77
|
+
for pattern in patterns:
|
|
78
|
+
if pattern in node.value:
|
|
79
|
+
count += 1
|
|
80
|
+
break
|
|
81
|
+
return count
|
|
82
|
+
|
|
83
|
+
|
|
84
|
+
def _count_llm_calls(tree, imports, source_text):
|
|
85
|
+
"""Count LLM API calls: urllib requests to known domains + SDK client calls."""
|
|
86
|
+
count = 0
|
|
87
|
+
|
|
88
|
+
# Count urllib.request calls with LLM API domains in string constants
|
|
89
|
+
count += _count_string_matches(tree, LLM_API_DOMAINS)
|
|
90
|
+
|
|
91
|
+
# Count SDK imports that imply LLM calls (each import of an LLM SDK = at least 1 call site)
|
|
92
|
+
full_modules = _get_all_import_modules(tree)
|
|
93
|
+
sdk_found = set()
|
|
94
|
+
for mod in full_modules:
|
|
95
|
+
root = mod.split(".")[0]
|
|
96
|
+
if root in LLM_SDK_MODULES:
|
|
97
|
+
sdk_found.add(root)
|
|
98
|
+
|
|
99
|
+
# For SDK users, look for actual call patterns like .create, .chat, .invoke, .run
|
|
100
|
+
llm_call_methods = {"create", "chat", "invoke", "run", "generate", "predict",
|
|
101
|
+
"complete", "completions"}
|
|
102
|
+
for node in ast.walk(tree):
|
|
103
|
+
if isinstance(node, ast.Call):
|
|
104
|
+
if isinstance(node.func, ast.Attribute):
|
|
105
|
+
if node.func.attr in llm_call_methods and sdk_found:
|
|
106
|
+
count += 1
|
|
107
|
+
|
|
108
|
+
# If we found SDK imports but no explicit call methods, count 1 per SDK
|
|
109
|
+
if sdk_found and count == 0:
|
|
110
|
+
count = len(sdk_found)
|
|
111
|
+
|
|
112
|
+
return max(count, _count_string_matches(tree, LLM_API_DOMAINS))
|
|
113
|
+
|
|
114
|
+
|
|
115
|
+
def _has_loop_around_llm(tree, source_text):
|
|
116
|
+
"""Check if any LLM call is inside a loop (for/while)."""
|
|
117
|
+
for node in ast.walk(tree):
|
|
118
|
+
if isinstance(node, (ast.For, ast.While)):
|
|
119
|
+
# Walk the loop body looking for LLM call signals
|
|
120
|
+
for child in ast.walk(node):
|
|
121
|
+
# Check for urllib.request.urlopen in a loop
|
|
122
|
+
if isinstance(child, ast.Attribute) and child.attr == "urlopen":
|
|
123
|
+
return True
|
|
124
|
+
# Check for SDK call methods in a loop
|
|
125
|
+
if isinstance(child, ast.Call) and isinstance(child.func, ast.Attribute):
|
|
126
|
+
if child.func.attr in {"create", "chat", "invoke", "run",
|
|
127
|
+
"generate", "predict", "complete"}:
|
|
128
|
+
return True
|
|
129
|
+
# Check for LLM API domain strings in a loop
|
|
130
|
+
if isinstance(child, ast.Constant) and isinstance(child.value, str):
|
|
131
|
+
for domain in LLM_API_DOMAINS:
|
|
132
|
+
if domain in child.value:
|
|
133
|
+
return True
|
|
134
|
+
return False
|
|
135
|
+
|
|
136
|
+
|
|
137
|
+
def _has_tool_definitions(tree):
|
|
138
|
+
"""Check for tool definitions: dicts with name/description/parameters keys, or @tool decorators."""
|
|
139
|
+
# Check for @tool decorator
|
|
140
|
+
for node in ast.walk(tree):
|
|
141
|
+
if isinstance(node, ast.FunctionDef):
|
|
142
|
+
for decorator in node.decorator_list:
|
|
143
|
+
if isinstance(decorator, ast.Name) and decorator.id == "tool":
|
|
144
|
+
return True
|
|
145
|
+
if isinstance(decorator, ast.Attribute) and decorator.attr == "tool":
|
|
146
|
+
return True
|
|
147
|
+
|
|
148
|
+
# Check for dicts with tool-like keys
|
|
149
|
+
for node in ast.walk(tree):
|
|
150
|
+
if isinstance(node, ast.Dict):
|
|
151
|
+
keys = set()
|
|
152
|
+
for key in node.keys:
|
|
153
|
+
if isinstance(key, ast.Constant) and isinstance(key.value, str):
|
|
154
|
+
keys.add(key.value)
|
|
155
|
+
if TOOL_DICT_KEYS.issubset(keys):
|
|
156
|
+
return True
|
|
157
|
+
|
|
158
|
+
return False
|
|
159
|
+
|
|
160
|
+
|
|
161
|
+
def _has_retrieval(tree, imports):
|
|
162
|
+
"""Check for retrieval patterns: vector DB imports or .similarity_search/.query calls."""
|
|
163
|
+
if imports & RETRIEVAL_MODULES:
|
|
164
|
+
return True
|
|
165
|
+
|
|
166
|
+
for node in ast.walk(tree):
|
|
167
|
+
if isinstance(node, ast.Attribute):
|
|
168
|
+
if node.attr in RETRIEVAL_METHOD_NAMES:
|
|
169
|
+
return True
|
|
170
|
+
|
|
171
|
+
return False
|
|
172
|
+
|
|
173
|
+
|
|
174
|
+
def _has_graph_framework(tree, full_modules):
|
|
175
|
+
"""Check for graph framework usage (LangGraph StateGraph, add_node, add_edge)."""
|
|
176
|
+
# Check if langgraph is imported
|
|
177
|
+
for mod in full_modules:
|
|
178
|
+
if "langgraph" in mod:
|
|
179
|
+
return True
|
|
180
|
+
|
|
181
|
+
# Check for StateGraph usage
|
|
182
|
+
for node in ast.walk(tree):
|
|
183
|
+
if isinstance(node, ast.Name) and node.id in GRAPH_FRAMEWORK_CLASSES:
|
|
184
|
+
return True
|
|
185
|
+
if isinstance(node, ast.Attribute):
|
|
186
|
+
if node.attr in GRAPH_FRAMEWORK_CLASSES or node.attr in GRAPH_FRAMEWORK_METHODS:
|
|
187
|
+
return True
|
|
188
|
+
|
|
189
|
+
return False
|
|
190
|
+
|
|
191
|
+
|
|
192
|
+
def _has_parallel_execution(tree, imports):
|
|
193
|
+
"""Check for asyncio.gather, concurrent.futures, ThreadPoolExecutor."""
|
|
194
|
+
if "concurrent" in imports:
|
|
195
|
+
return True
|
|
196
|
+
|
|
197
|
+
for node in ast.walk(tree):
|
|
198
|
+
if isinstance(node, ast.Attribute):
|
|
199
|
+
if node.attr == "gather":
|
|
200
|
+
return True
|
|
201
|
+
if node.attr in PARALLEL_CLASSES:
|
|
202
|
+
return True
|
|
203
|
+
if isinstance(node, ast.Name) and node.id in PARALLEL_CLASSES:
|
|
204
|
+
return True
|
|
205
|
+
|
|
206
|
+
return False
|
|
207
|
+
|
|
208
|
+
|
|
209
|
+
def _has_error_handling_around_llm(tree):
|
|
210
|
+
"""Check if LLM calls are wrapped in try/except."""
|
|
211
|
+
for node in ast.walk(tree):
|
|
212
|
+
if isinstance(node, ast.Try):
|
|
213
|
+
# Walk the try body for LLM signals
|
|
214
|
+
for child in ast.walk(node):
|
|
215
|
+
if isinstance(child, ast.Attribute) and child.attr == "urlopen":
|
|
216
|
+
return True
|
|
217
|
+
if isinstance(child, ast.Call) and isinstance(child.func, ast.Attribute):
|
|
218
|
+
if child.func.attr in {"create", "chat", "invoke", "run",
|
|
219
|
+
"generate", "predict", "complete"}:
|
|
220
|
+
return True
|
|
221
|
+
if isinstance(child, ast.Constant) and isinstance(child.value, str):
|
|
222
|
+
for domain in LLM_API_DOMAINS:
|
|
223
|
+
if domain in child.value:
|
|
224
|
+
return True
|
|
225
|
+
return False
|
|
226
|
+
|
|
227
|
+
|
|
228
|
+
def _count_functions(tree):
|
|
229
|
+
"""Count function definitions (top-level and nested)."""
|
|
230
|
+
count = 0
|
|
231
|
+
for node in ast.walk(tree):
|
|
232
|
+
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
|
|
233
|
+
count += 1
|
|
234
|
+
return count
|
|
235
|
+
|
|
236
|
+
|
|
237
|
+
def _count_classes(tree):
|
|
238
|
+
"""Count class definitions."""
|
|
239
|
+
count = 0
|
|
240
|
+
for node in ast.walk(tree):
|
|
241
|
+
if isinstance(node, ast.ClassDef):
|
|
242
|
+
count += 1
|
|
243
|
+
return count
|
|
244
|
+
|
|
245
|
+
|
|
246
|
+
def _estimate_topology(signals):
|
|
247
|
+
"""Classify the current topology based on code signals."""
|
|
248
|
+
if signals["has_graph_framework"]:
|
|
249
|
+
if signals["has_parallel_execution"]:
|
|
250
|
+
return "parallel"
|
|
251
|
+
return "hierarchical"
|
|
252
|
+
|
|
253
|
+
if signals["has_retrieval"]:
|
|
254
|
+
return "rag"
|
|
255
|
+
|
|
256
|
+
if signals["has_loop_around_llm"]:
|
|
257
|
+
if signals["has_tool_definitions"]:
|
|
258
|
+
return "react-loop"
|
|
259
|
+
return "react-loop"
|
|
260
|
+
|
|
261
|
+
if signals["llm_call_count"] >= 3:
|
|
262
|
+
if signals["has_tool_definitions"]:
|
|
263
|
+
return "react-loop"
|
|
264
|
+
return "chain"
|
|
265
|
+
|
|
266
|
+
if signals["llm_call_count"] == 2:
|
|
267
|
+
return "chain"
|
|
268
|
+
|
|
269
|
+
if signals["llm_call_count"] <= 1:
|
|
270
|
+
return "single-call"
|
|
271
|
+
|
|
272
|
+
return "single-call"
|
|
273
|
+
|
|
274
|
+
|
|
275
|
+
def analyze_code(harness_path):
|
|
276
|
+
"""Analyze a harness Python file and return code signals."""
|
|
277
|
+
with open(harness_path) as f:
|
|
278
|
+
source = f.read()
|
|
279
|
+
|
|
280
|
+
try:
|
|
281
|
+
tree = ast.parse(source)
|
|
282
|
+
except SyntaxError:
|
|
283
|
+
return {
|
|
284
|
+
"llm_call_count": 0,
|
|
285
|
+
"has_loop_around_llm": False,
|
|
286
|
+
"has_tool_definitions": False,
|
|
287
|
+
"has_retrieval": False,
|
|
288
|
+
"has_graph_framework": False,
|
|
289
|
+
"has_parallel_execution": False,
|
|
290
|
+
"has_error_handling": False,
|
|
291
|
+
"estimated_topology": "unknown",
|
|
292
|
+
"code_lines": len(source.splitlines()),
|
|
293
|
+
"function_count": 0,
|
|
294
|
+
"class_count": 0,
|
|
295
|
+
}
|
|
296
|
+
|
|
297
|
+
imports = _get_all_imports(tree)
|
|
298
|
+
full_modules = _get_all_import_modules(tree)
|
|
299
|
+
|
|
300
|
+
llm_call_count = _count_llm_calls(tree, imports, source)
|
|
301
|
+
has_loop = _has_loop_around_llm(tree, source)
|
|
302
|
+
has_tools = _has_tool_definitions(tree)
|
|
303
|
+
has_retrieval = _has_retrieval(tree, imports)
|
|
304
|
+
has_graph = _has_graph_framework(tree, full_modules)
|
|
305
|
+
has_parallel = _has_parallel_execution(tree, imports)
|
|
306
|
+
has_error = _has_error_handling_around_llm(tree)
|
|
307
|
+
|
|
308
|
+
signals = {
|
|
309
|
+
"llm_call_count": llm_call_count,
|
|
310
|
+
"has_loop_around_llm": has_loop,
|
|
311
|
+
"has_tool_definitions": has_tools,
|
|
312
|
+
"has_retrieval": has_retrieval,
|
|
313
|
+
"has_graph_framework": has_graph,
|
|
314
|
+
"has_parallel_execution": has_parallel,
|
|
315
|
+
"has_error_handling": has_error,
|
|
316
|
+
"code_lines": len(source.splitlines()),
|
|
317
|
+
"function_count": _count_functions(tree),
|
|
318
|
+
"class_count": _count_classes(tree),
|
|
319
|
+
}
|
|
320
|
+
signals["estimated_topology"] = _estimate_topology(signals)
|
|
321
|
+
|
|
322
|
+
return signals
|
|
323
|
+
|
|
324
|
+
|
|
325
|
+
# --- Trace Analysis ---
|
|
326
|
+
|
|
327
|
+
def analyze_traces(traces_dir):
|
|
328
|
+
"""Analyze execution traces for error patterns, timing, and failures."""
|
|
329
|
+
if not os.path.isdir(traces_dir):
|
|
330
|
+
return None
|
|
331
|
+
|
|
332
|
+
result = {
|
|
333
|
+
"error_patterns": [],
|
|
334
|
+
"timing": None,
|
|
335
|
+
"task_failures": [],
|
|
336
|
+
"stderr_lines": 0,
|
|
337
|
+
}
|
|
338
|
+
|
|
339
|
+
# Read stderr.log
|
|
340
|
+
stderr_path = os.path.join(traces_dir, "stderr.log")
|
|
341
|
+
if os.path.isfile(stderr_path):
|
|
342
|
+
try:
|
|
343
|
+
with open(stderr_path) as f:
|
|
344
|
+
stderr = f.read()
|
|
345
|
+
lines = stderr.strip().splitlines()
|
|
346
|
+
result["stderr_lines"] = len(lines)
|
|
347
|
+
|
|
348
|
+
# Detect common error patterns
|
|
349
|
+
error_counts = {}
|
|
350
|
+
for line in lines:
|
|
351
|
+
for pattern in ["Traceback", "Error", "Exception", "Timeout",
|
|
352
|
+
"ConnectionRefused", "HTTPError", "JSONDecodeError",
|
|
353
|
+
"KeyError", "TypeError", "ValueError"]:
|
|
354
|
+
if pattern in line:
|
|
355
|
+
error_counts[pattern] = error_counts.get(pattern, 0) + 1
|
|
356
|
+
|
|
357
|
+
result["error_patterns"] = [
|
|
358
|
+
{"pattern": p, "count": c}
|
|
359
|
+
for p, c in sorted(error_counts.items(), key=lambda x: -x[1])
|
|
360
|
+
]
|
|
361
|
+
except Exception:
|
|
362
|
+
pass
|
|
363
|
+
|
|
364
|
+
# Read timing.json
|
|
365
|
+
timing_path = os.path.join(traces_dir, "timing.json")
|
|
366
|
+
if os.path.isfile(timing_path):
|
|
367
|
+
try:
|
|
368
|
+
with open(timing_path) as f:
|
|
369
|
+
timing = json.load(f)
|
|
370
|
+
result["timing"] = timing
|
|
371
|
+
except (json.JSONDecodeError, Exception):
|
|
372
|
+
pass
|
|
373
|
+
|
|
374
|
+
# Scan per-task output directories for failures
|
|
375
|
+
for entry in sorted(os.listdir(traces_dir)):
|
|
376
|
+
task_dir = os.path.join(traces_dir, entry)
|
|
377
|
+
if os.path.isdir(task_dir) and entry.startswith("task_"):
|
|
378
|
+
output_path = os.path.join(task_dir, "output.json")
|
|
379
|
+
if os.path.isfile(output_path):
|
|
380
|
+
try:
|
|
381
|
+
with open(output_path) as f:
|
|
382
|
+
output = json.load(f)
|
|
383
|
+
# Check for empty or error outputs
|
|
384
|
+
out_value = output.get("output", "")
|
|
385
|
+
if not out_value or out_value in ("error", "unknown", ""):
|
|
386
|
+
result["task_failures"].append({
|
|
387
|
+
"task": entry,
|
|
388
|
+
"output": out_value,
|
|
389
|
+
})
|
|
390
|
+
except (json.JSONDecodeError, Exception):
|
|
391
|
+
result["task_failures"].append({
|
|
392
|
+
"task": entry,
|
|
393
|
+
"output": "parse_error",
|
|
394
|
+
})
|
|
395
|
+
|
|
396
|
+
return result
|
|
397
|
+
|
|
398
|
+
|
|
399
|
+
# --- Score Analysis ---
|
|
400
|
+
|
|
401
|
+
def analyze_scores(summary_path):
|
|
402
|
+
"""Analyze summary.json for stagnation, oscillation, and per-task failures."""
|
|
403
|
+
if not os.path.isfile(summary_path):
|
|
404
|
+
return None
|
|
405
|
+
|
|
406
|
+
try:
|
|
407
|
+
with open(summary_path) as f:
|
|
408
|
+
summary = json.load(f)
|
|
409
|
+
except (json.JSONDecodeError, Exception):
|
|
410
|
+
return None
|
|
411
|
+
|
|
412
|
+
result = {
|
|
413
|
+
"iterations": summary.get("iterations", 0),
|
|
414
|
+
"best_score": 0.0,
|
|
415
|
+
"baseline_score": 0.0,
|
|
416
|
+
"recent_scores": [],
|
|
417
|
+
"is_stagnating": False,
|
|
418
|
+
"is_oscillating": False,
|
|
419
|
+
"score_trend": "unknown",
|
|
420
|
+
}
|
|
421
|
+
|
|
422
|
+
# Extract best score
|
|
423
|
+
best = summary.get("best", {})
|
|
424
|
+
result["best_score"] = best.get("combined_score", 0.0)
|
|
425
|
+
result["baseline_score"] = summary.get("baseline_score", 0.0)
|
|
426
|
+
|
|
427
|
+
# Extract recent version scores
|
|
428
|
+
versions = summary.get("versions", [])
|
|
429
|
+
if isinstance(versions, list):
|
|
430
|
+
recent = versions[-5:] if len(versions) > 5 else versions
|
|
431
|
+
result["recent_scores"] = [
|
|
432
|
+
{"version": v.get("version", "?"), "score": v.get("combined_score", 0.0)}
|
|
433
|
+
for v in recent
|
|
434
|
+
]
|
|
435
|
+
elif isinstance(versions, dict):
|
|
436
|
+
items = sorted(versions.items())
|
|
437
|
+
recent = items[-5:] if len(items) > 5 else items
|
|
438
|
+
result["recent_scores"] = [
|
|
439
|
+
{"version": k, "score": v.get("combined_score", 0.0)}
|
|
440
|
+
for k, v in recent
|
|
441
|
+
]
|
|
442
|
+
|
|
443
|
+
# Detect stagnation (last 3+ scores within 1% of each other)
|
|
444
|
+
scores = [s["score"] for s in result["recent_scores"]]
|
|
445
|
+
if len(scores) >= 3:
|
|
446
|
+
last_3 = scores[-3:]
|
|
447
|
+
spread = max(last_3) - min(last_3)
|
|
448
|
+
if spread <= 0.01:
|
|
449
|
+
result["is_stagnating"] = True
|
|
450
|
+
|
|
451
|
+
# Detect oscillation (alternating up/down for last 4+ scores)
|
|
452
|
+
if len(scores) >= 4:
|
|
453
|
+
deltas = [scores[i+1] - scores[i] for i in range(len(scores)-1)]
|
|
454
|
+
sign_changes = sum(
|
|
455
|
+
1 for i in range(len(deltas)-1)
|
|
456
|
+
if (deltas[i] > 0 and deltas[i+1] < 0) or (deltas[i] < 0 and deltas[i+1] > 0)
|
|
457
|
+
)
|
|
458
|
+
if sign_changes >= len(deltas) - 1:
|
|
459
|
+
result["is_oscillating"] = True
|
|
460
|
+
|
|
461
|
+
# Score trend
|
|
462
|
+
if len(scores) >= 2:
|
|
463
|
+
if scores[-1] > scores[0]:
|
|
464
|
+
result["score_trend"] = "improving"
|
|
465
|
+
elif scores[-1] < scores[0]:
|
|
466
|
+
result["score_trend"] = "declining"
|
|
467
|
+
else:
|
|
468
|
+
result["score_trend"] = "flat"
|
|
469
|
+
|
|
470
|
+
return result
|
|
471
|
+
|
|
472
|
+
|
|
473
|
+
# --- Main ---
|
|
474
|
+
|
|
475
|
+
def main():
|
|
476
|
+
parser = argparse.ArgumentParser(
|
|
477
|
+
description="Analyze harness architecture and produce signals for the architect agent",
|
|
478
|
+
usage="analyze_architecture.py --harness PATH [--traces-dir PATH] [--summary PATH] [-o output.json]",
|
|
479
|
+
)
|
|
480
|
+
parser.add_argument("--harness", required=True, help="Path to harness Python file")
|
|
481
|
+
parser.add_argument("--traces-dir", default=None, help="Path to traces directory")
|
|
482
|
+
parser.add_argument("--summary", default=None, help="Path to summary.json")
|
|
483
|
+
parser.add_argument("-o", "--output", default=None, help="Output JSON path")
|
|
484
|
+
args = parser.parse_args()
|
|
485
|
+
|
|
486
|
+
if not os.path.isfile(args.harness):
|
|
487
|
+
print(json.dumps({"error": f"Harness file not found: {args.harness}"}))
|
|
488
|
+
sys.exit(1)
|
|
489
|
+
|
|
490
|
+
result = {
|
|
491
|
+
"code_signals": analyze_code(args.harness),
|
|
492
|
+
"trace_signals": None,
|
|
493
|
+
"score_signals": None,
|
|
494
|
+
}
|
|
495
|
+
|
|
496
|
+
if args.traces_dir:
|
|
497
|
+
result["trace_signals"] = analyze_traces(args.traces_dir)
|
|
498
|
+
|
|
499
|
+
if args.summary:
|
|
500
|
+
result["score_signals"] = analyze_scores(args.summary)
|
|
501
|
+
|
|
502
|
+
output = json.dumps(result, indent=2)
|
|
503
|
+
|
|
504
|
+
if args.output:
|
|
505
|
+
with open(args.output, "w") as f:
|
|
506
|
+
f.write(output + "\n")
|
|
507
|
+
else:
|
|
508
|
+
print(output)
|
|
509
|
+
|
|
510
|
+
|
|
511
|
+
if __name__ == "__main__":
|
|
512
|
+
main()
|
package/tools/init.py
CHANGED
|
@@ -317,6 +317,29 @@ def main():
|
|
|
317
317
|
print("\nRecommendation: install Context7 MCP for up-to-date documentation:")
|
|
318
318
|
print(" claude mcp add context7 -- npx -y @upstash/context7-mcp@latest")
|
|
319
319
|
|
|
320
|
+
# Architecture analysis (quick, advisory)
|
|
321
|
+
analyze_py = os.path.join(tools, "analyze_architecture.py")
|
|
322
|
+
if os.path.exists(analyze_py):
|
|
323
|
+
try:
|
|
324
|
+
r = subprocess.run(
|
|
325
|
+
["python3", analyze_py, "--harness", args.harness],
|
|
326
|
+
capture_output=True, text=True, timeout=30,
|
|
327
|
+
)
|
|
328
|
+
if r.returncode == 0 and r.stdout.strip():
|
|
329
|
+
arch_signals = json.loads(r.stdout)
|
|
330
|
+
config["architecture"] = {
|
|
331
|
+
"current_topology": arch_signals.get("code_signals", {}).get("estimated_topology", "unknown"),
|
|
332
|
+
"auto_analyzed": True,
|
|
333
|
+
}
|
|
334
|
+
# Re-write config with architecture
|
|
335
|
+
with open(os.path.join(base, "config.json"), "w") as f:
|
|
336
|
+
json.dump(config, f, indent=2)
|
|
337
|
+
topo = config["architecture"]["current_topology"]
|
|
338
|
+
if topo != "unknown":
|
|
339
|
+
print(f"Architecture: {topo}")
|
|
340
|
+
except Exception:
|
|
341
|
+
pass
|
|
342
|
+
|
|
320
343
|
# 5. Validate baseline harness
|
|
321
344
|
print("Validating baseline harness...")
|
|
322
345
|
val_args = ["python3", evaluate_py, "validate",
|