harness-evolver 3.0.5 → 3.1.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +8 -4
- package/agents/evolver-evaluator.md +152 -0
- package/agents/evolver-proposer.md +28 -10
- package/bin/install.js +19 -21
- package/package.json +1 -1
- package/skills/evolve/SKILL.md +54 -2
- package/skills/setup/SKILL.md +9 -7
- package/tools/__pycache__/detect_stack.cpython-314.pyc +0 -0
- package/tools/__pycache__/trace_logger.cpython-314.pyc +0 -0
- package/tools/run_eval.py +31 -24
- package/tools/setup.py +30 -33
- package/tools/detect_stack.py +0 -173
package/README.md
CHANGED
|
@@ -47,7 +47,7 @@ claude
|
|
|
47
47
|
<table>
|
|
48
48
|
<tr>
|
|
49
49
|
<td><b>LangSmith-Native</b></td>
|
|
50
|
-
<td>No custom eval scripts or task files. Uses LangSmith Datasets for test inputs, Experiments for results, and
|
|
50
|
+
<td>No custom eval scripts or task files. Uses LangSmith Datasets for test inputs, Experiments for results, and an agent-based LLM-as-judge for scoring via langsmith-cli. No external API keys needed. Everything is visible in the LangSmith UI.</td>
|
|
51
51
|
</tr>
|
|
52
52
|
<tr>
|
|
53
53
|
<td><b>Real Code Evolution</b></td>
|
|
@@ -92,6 +92,7 @@ claude
|
|
|
92
92
|
| **Architect** | Recommends multi-agent topology changes | Blue |
|
|
93
93
|
| **Critic** | Validates evaluator quality, detects gaming | Red |
|
|
94
94
|
| **TestGen** | Generates test inputs for LangSmith datasets | Cyan |
|
|
95
|
+
| **Evaluator** | LLM-as-judge — reads outputs via langsmith-cli, scores correctness | Yellow |
|
|
95
96
|
|
|
96
97
|
---
|
|
97
98
|
|
|
@@ -104,7 +105,8 @@ claude
|
|
|
104
105
|
+- 1.5 Gather trace insights (cluster errors, tokens, latency)
|
|
105
106
|
+- 1.8 Analyze per-task failures (adaptive briefings)
|
|
106
107
|
+- 2. Spawn 5 proposers in parallel (each in a git worktree)
|
|
107
|
-
+- 3.
|
|
108
|
+
+- 3. Run target for each candidate (client.evaluate() -> code-based evaluators)
|
|
109
|
+
+- 3.5 Spawn evaluator agent (reads outputs via langsmith-cli, judges, writes scores)
|
|
108
110
|
+- 4. Compare experiments -> select winner + per-task champion
|
|
109
111
|
+- 5. Merge winning worktree into main branch
|
|
110
112
|
+- 5.5 Test suite growth (add regression examples to dataset)
|
|
@@ -119,13 +121,15 @@ claude
|
|
|
119
121
|
## Requirements
|
|
120
122
|
|
|
121
123
|
- **LangSmith account** + `LANGSMITH_API_KEY`
|
|
122
|
-
- **Python 3.10+** with `langsmith`
|
|
124
|
+
- **Python 3.10+** with `langsmith` package
|
|
125
|
+
- **langsmith-cli** (`uv tool install langsmith-cli`) — required for evaluator agent
|
|
123
126
|
- **Git** (for worktree-based isolation)
|
|
124
127
|
- **Claude Code** (or Cursor/Codex/Windsurf)
|
|
125
128
|
|
|
126
129
|
```bash
|
|
127
130
|
export LANGSMITH_API_KEY="lsv2_pt_..."
|
|
128
|
-
pip install langsmith
|
|
131
|
+
pip install langsmith
|
|
132
|
+
uv tool install langsmith-cli
|
|
129
133
|
```
|
|
130
134
|
|
|
131
135
|
---
|
|
@@ -0,0 +1,152 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: evolver-evaluator
|
|
3
|
+
description: |
|
|
4
|
+
Use this agent to evaluate experiment outputs using LLM-as-judge.
|
|
5
|
+
Reads run inputs/outputs from LangSmith via langsmith-cli, judges correctness,
|
|
6
|
+
and writes scores back as feedback. No external API keys needed.
|
|
7
|
+
tools: Read, Bash, Glob, Grep
|
|
8
|
+
color: yellow
|
|
9
|
+
---
|
|
10
|
+
|
|
11
|
+
# Evolver — Evaluator Agent (v3)
|
|
12
|
+
|
|
13
|
+
You are an LLM evaluation judge. Your job is to read the outputs of an experiment from LangSmith, evaluate each one for correctness, and write scores back as feedback.
|
|
14
|
+
|
|
15
|
+
You ARE the LLM-as-judge. You replace the need for an external LLM API call.
|
|
16
|
+
|
|
17
|
+
## Bootstrap
|
|
18
|
+
|
|
19
|
+
1. Verify langsmith-cli is available:
|
|
20
|
+
```bash
|
|
21
|
+
langsmith-cli --version
|
|
22
|
+
```
|
|
23
|
+
If this fails, report the error and stop — langsmith-cli is required.
|
|
24
|
+
|
|
25
|
+
2. Your prompt contains `<experiment>`, `<evaluators>`, and `<context>` blocks. Parse them to understand:
|
|
26
|
+
- Which experiment to evaluate
|
|
27
|
+
- What evaluation criteria to apply
|
|
28
|
+
- What the agent is supposed to do (domain context)
|
|
29
|
+
|
|
30
|
+
## Tool: langsmith-cli
|
|
31
|
+
|
|
32
|
+
You interact with LangSmith exclusively through `langsmith-cli`. Always use `--json` for machine-readable output.
|
|
33
|
+
|
|
34
|
+
### Reading experiment outputs
|
|
35
|
+
|
|
36
|
+
```bash
|
|
37
|
+
langsmith-cli --json runs list \
|
|
38
|
+
--project "{experiment_name}" \
|
|
39
|
+
--fields id,inputs,outputs,error,reference_example_id \
|
|
40
|
+
--is-root \
|
|
41
|
+
--limit 200
|
|
42
|
+
```
|
|
43
|
+
|
|
44
|
+
This returns one JSON object per line (JSONL). Each line has:
|
|
45
|
+
- `id` — the run ID (needed to write feedback)
|
|
46
|
+
- `inputs` — what was sent to the agent
|
|
47
|
+
- `outputs` — what the agent responded
|
|
48
|
+
- `error` — error message if the run failed
|
|
49
|
+
- `reference_example_id` — links back to the dataset example
|
|
50
|
+
|
|
51
|
+
### Writing scores
|
|
52
|
+
|
|
53
|
+
For EACH run, after judging it:
|
|
54
|
+
|
|
55
|
+
```bash
|
|
56
|
+
langsmith-cli --json feedback create {run_id} \
|
|
57
|
+
--key "{evaluator_key}" \
|
|
58
|
+
--score {score} \
|
|
59
|
+
--comment "{brief_reasoning}" \
|
|
60
|
+
--source model
|
|
61
|
+
```
|
|
62
|
+
|
|
63
|
+
Use `--source model` since this is an LLM-generated evaluation.
|
|
64
|
+
|
|
65
|
+
## Your Workflow
|
|
66
|
+
|
|
67
|
+
### Phase 1: Read All Outputs
|
|
68
|
+
|
|
69
|
+
Fetch all runs from the experiment. Save the output to a file for reference:
|
|
70
|
+
|
|
71
|
+
```bash
|
|
72
|
+
langsmith-cli --json runs list \
|
|
73
|
+
--project "{experiment_name}" \
|
|
74
|
+
--fields id,inputs,outputs,error,reference_example_id \
|
|
75
|
+
--is-root --limit 200 \
|
|
76
|
+
--output experiment_runs.jsonl
|
|
77
|
+
```
|
|
78
|
+
|
|
79
|
+
Then read `experiment_runs.jsonl` to see all results.
|
|
80
|
+
|
|
81
|
+
### Phase 2: Evaluate Each Run
|
|
82
|
+
|
|
83
|
+
For each run, apply the requested evaluators. The evaluators you may be asked to judge:
|
|
84
|
+
|
|
85
|
+
#### correctness
|
|
86
|
+
Judge: **Is the output a correct, accurate, and complete response to the input?**
|
|
87
|
+
|
|
88
|
+
Scoring:
|
|
89
|
+
- `1.0` — Correct and complete. The response accurately addresses the input.
|
|
90
|
+
- `0.0` — Incorrect, incomplete, or off-topic.
|
|
91
|
+
|
|
92
|
+
Consider:
|
|
93
|
+
- Does the response answer what was asked?
|
|
94
|
+
- Is the information factually accurate?
|
|
95
|
+
- Are there hallucinations or made-up facts?
|
|
96
|
+
- Is the response relevant to the domain?
|
|
97
|
+
|
|
98
|
+
#### conciseness
|
|
99
|
+
Judge: **Is the response appropriately concise without sacrificing quality?**
|
|
100
|
+
|
|
101
|
+
Scoring:
|
|
102
|
+
- `1.0` — Concise and complete. No unnecessary verbosity.
|
|
103
|
+
- `0.0` — Excessively verbose, repetitive, or padded.
|
|
104
|
+
|
|
105
|
+
### Phase 3: Write All Scores
|
|
106
|
+
|
|
107
|
+
For each run you evaluated, write feedback via `langsmith-cli feedback create`.
|
|
108
|
+
|
|
109
|
+
Write scores in batches — evaluate all runs first, then write all scores. This is more efficient than alternating between reading and writing.
|
|
110
|
+
|
|
111
|
+
Example for one run:
|
|
112
|
+
```bash
|
|
113
|
+
langsmith-cli --json feedback create "run-uuid-here" \
|
|
114
|
+
--key correctness \
|
|
115
|
+
--score 1.0 \
|
|
116
|
+
--comment "Response correctly identifies the applicable regulation and provides accurate guidance." \
|
|
117
|
+
--source model
|
|
118
|
+
```
|
|
119
|
+
|
|
120
|
+
### Phase 4: Summary
|
|
121
|
+
|
|
122
|
+
After writing all scores, compute the aggregate:
|
|
123
|
+
|
|
124
|
+
```bash
|
|
125
|
+
langsmith-cli --json feedback list --run-id "{any_run_id}" --key correctness
|
|
126
|
+
```
|
|
127
|
+
|
|
128
|
+
## Error Handling
|
|
129
|
+
|
|
130
|
+
- If a run has `error` set and empty `outputs`: score it `0.0` with comment "Run failed: {error}"
|
|
131
|
+
- If a run has `outputs` but they contain an error message: score `0.0` with comment explaining the failure
|
|
132
|
+
- If `outputs` is empty but no error: score `0.0` with comment "Empty output"
|
|
133
|
+
|
|
134
|
+
## Rules
|
|
135
|
+
|
|
136
|
+
1. **Be a fair judge** — evaluate based on the criteria, not your preferences
|
|
137
|
+
2. **Brief comments** — keep feedback comments under 200 characters
|
|
138
|
+
3. **Binary scoring for correctness** — use 1.0 or 0.0, not partial scores (unless instructed otherwise)
|
|
139
|
+
4. **Score EVERY run** — don't skip any, even failed ones
|
|
140
|
+
5. **Domain awareness** — use the `<context>` block to understand what constitutes a "correct" answer in this domain
|
|
141
|
+
|
|
142
|
+
## Return Protocol
|
|
143
|
+
|
|
144
|
+
When done, end your response with:
|
|
145
|
+
|
|
146
|
+
## EVALUATION COMPLETE
|
|
147
|
+
- **Experiment**: {experiment_name}
|
|
148
|
+
- **Runs evaluated**: {N}
|
|
149
|
+
- **Evaluators applied**: {list}
|
|
150
|
+
- **Mean score**: {score}
|
|
151
|
+
- **Pass rate**: {N}/{total} ({percent}%)
|
|
152
|
+
- **Common failure patterns**: {brief list}
|
|
@@ -64,7 +64,34 @@ Based on your strategy and diagnosis, modify the code:
|
|
|
64
64
|
- **Error handling**: retry logic, fallback strategies, timeout handling
|
|
65
65
|
- **Model selection**: which model for which task
|
|
66
66
|
|
|
67
|
-
|
|
67
|
+
### Phase 3.5: Consult Documentation (MANDATORY)
|
|
68
|
+
|
|
69
|
+
**Before writing ANY code**, you MUST consult Context7 for every library you'll be modifying or using. This is NOT optional.
|
|
70
|
+
|
|
71
|
+
**Step 1 — Identify libraries from the code you read:**
|
|
72
|
+
Read the imports in the files you're about to modify. For each framework/library (LangGraph, OpenAI, Anthropic, CrewAI, etc.):
|
|
73
|
+
|
|
74
|
+
**Step 2 — Resolve library ID:**
|
|
75
|
+
```
|
|
76
|
+
resolve-library-id(libraryName: "langgraph", query: "what you're trying to do")
|
|
77
|
+
```
|
|
78
|
+
This returns up to 10 matches. Pick the one with the highest relevance.
|
|
79
|
+
|
|
80
|
+
**Step 3 — Query docs for your specific task:**
|
|
81
|
+
```
|
|
82
|
+
get-library-docs(libraryId: "/langchain-ai/langgraph", query: "conditional edges StateGraph", topic: "routing")
|
|
83
|
+
```
|
|
84
|
+
Ask about the SPECIFIC API you're going to use or change.
|
|
85
|
+
|
|
86
|
+
**Examples of what to query:**
|
|
87
|
+
- About to modify a StateGraph? → `query: "StateGraph add_conditional_edges"`
|
|
88
|
+
- Changing prompt template? → `query: "ChatPromptTemplate from_messages"` for langchain
|
|
89
|
+
- Adding a tool? → `query: "StructuredTool create tool definition"` for langchain
|
|
90
|
+
- Changing model? → `query: "ChatOpenAI model parameters temperature"` for openai
|
|
91
|
+
|
|
92
|
+
**Why this matters:** Your training data may be outdated. Libraries change APIs between versions. A quick Context7 lookup takes seconds and prevents proposing code that uses deprecated or incorrect patterns. The documentation is the source of truth, not your model knowledge.
|
|
93
|
+
|
|
94
|
+
**If Context7 MCP is not available:** Note in proposal.md "API patterns not verified against current docs — verify before deploying."
|
|
68
95
|
|
|
69
96
|
### Phase 4: Commit and Document
|
|
70
97
|
|
|
@@ -99,15 +126,6 @@ If `production_seed.json` exists:
|
|
|
99
126
|
|
|
100
127
|
Prioritize changes that fix real production failures over synthetic test failures.
|
|
101
128
|
|
|
102
|
-
## Context7 — Documentation Lookup
|
|
103
|
-
|
|
104
|
-
Use Context7 MCP tools proactively when:
|
|
105
|
-
- Writing code that uses a library API
|
|
106
|
-
- Unsure about method signatures or patterns
|
|
107
|
-
- Checking if a better approach exists in the latest version
|
|
108
|
-
|
|
109
|
-
If Context7 is not available, proceed with model knowledge but note in proposal.md.
|
|
110
|
-
|
|
111
129
|
## Rules
|
|
112
130
|
|
|
113
131
|
1. **Read before writing** — understand the code before changing it
|
package/bin/install.js
CHANGED
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
/**
|
|
3
3
|
* Harness Evolver v3 installer.
|
|
4
4
|
* Copies skills/agents/tools to runtime directories (GSD pattern).
|
|
5
|
-
* Installs Python dependencies (langsmith
|
|
5
|
+
* Installs Python dependencies (langsmith) and langsmith-cli.
|
|
6
6
|
*
|
|
7
7
|
* Usage: npx harness-evolver@latest
|
|
8
8
|
*/
|
|
@@ -225,15 +225,15 @@ function installPythonDeps() {
|
|
|
225
225
|
|
|
226
226
|
// Install/upgrade deps in the venv
|
|
227
227
|
const installCommands = [
|
|
228
|
-
`uv pip install --python "${venvPython}" langsmith
|
|
229
|
-
`"${venvPip}" install --upgrade langsmith
|
|
230
|
-
`"${venvPython}" -m pip install --upgrade langsmith
|
|
228
|
+
`uv pip install --python "${venvPython}" langsmith`,
|
|
229
|
+
`"${venvPip}" install --upgrade langsmith`,
|
|
230
|
+
`"${venvPython}" -m pip install --upgrade langsmith`,
|
|
231
231
|
];
|
|
232
232
|
|
|
233
233
|
for (const cmd of installCommands) {
|
|
234
234
|
try {
|
|
235
235
|
execSync(cmd, { stdio: "pipe", timeout: 120000 });
|
|
236
|
-
console.log(` ${GREEN}✓${RESET} langsmith
|
|
236
|
+
console.log(` ${GREEN}✓${RESET} langsmith installed in venv`);
|
|
237
237
|
return true;
|
|
238
238
|
} catch {
|
|
239
239
|
continue;
|
|
@@ -241,7 +241,7 @@ function installPythonDeps() {
|
|
|
241
241
|
}
|
|
242
242
|
|
|
243
243
|
console.log(` ${YELLOW}!${RESET} Could not install packages in venv.`);
|
|
244
|
-
console.log(` Run manually: ${BOLD}~/.evolver/venv/bin/pip install langsmith
|
|
244
|
+
console.log(` Run manually: ${BOLD}~/.evolver/venv/bin/pip install langsmith${RESET}`);
|
|
245
245
|
return false;
|
|
246
246
|
}
|
|
247
247
|
|
|
@@ -303,26 +303,24 @@ async function configureLangSmith(rl) {
|
|
|
303
303
|
}
|
|
304
304
|
}
|
|
305
305
|
|
|
306
|
-
// --- Step 2: langsmith-cli ---
|
|
306
|
+
// --- Step 2: langsmith-cli (required for evaluator agent) ---
|
|
307
307
|
if (hasLangsmithCli) {
|
|
308
308
|
console.log(` ${GREEN}✓${RESET} langsmith-cli installed`);
|
|
309
309
|
} else {
|
|
310
|
-
console.log(`\n ${BOLD}langsmith-cli${RESET} —
|
|
311
|
-
console.log(` ${DIM}
|
|
312
|
-
|
|
313
|
-
|
|
314
|
-
|
|
315
|
-
|
|
316
|
-
execSync("uv tool install langsmith-cli 2>/dev/null || pip install langsmith-cli 2>/dev/null || pip3 install langsmith-cli", { stdio: "pipe", timeout: 60000 });
|
|
317
|
-
console.log(` ${GREEN}✓${RESET} langsmith-cli installed`);
|
|
310
|
+
console.log(`\n ${BOLD}langsmith-cli${RESET} — ${YELLOW}required${RESET} for LLM-as-judge evaluation`);
|
|
311
|
+
console.log(` ${DIM}The evaluator agent uses it to read experiment outputs and write scores.${RESET}`);
|
|
312
|
+
console.log(`\n Installing langsmith-cli...`);
|
|
313
|
+
try {
|
|
314
|
+
execSync("uv tool install langsmith-cli 2>/dev/null || pip install langsmith-cli 2>/dev/null || pip3 install langsmith-cli", { stdio: "pipe", timeout: 60000 });
|
|
315
|
+
console.log(` ${GREEN}✓${RESET} langsmith-cli installed`);
|
|
318
316
|
|
|
319
|
-
|
|
320
|
-
|
|
321
|
-
|
|
322
|
-
}
|
|
323
|
-
} catch {
|
|
324
|
-
console.log(` ${YELLOW}!${RESET} Could not install. Try manually: ${DIM}uv tool install langsmith-cli${RESET}`);
|
|
317
|
+
// If we have a key, auto-authenticate
|
|
318
|
+
if (hasKey && fs.existsSync(langsmithCredsFile)) {
|
|
319
|
+
console.log(` ${GREEN}✓${RESET} langsmith-cli auto-authenticated (credentials file exists)`);
|
|
325
320
|
}
|
|
321
|
+
} catch {
|
|
322
|
+
console.log(` ${RED}!${RESET} Could not install langsmith-cli.`);
|
|
323
|
+
console.log(` ${BOLD}This is required.${RESET} Install manually: ${DIM}uv tool install langsmith-cli${RESET}`);
|
|
326
324
|
}
|
|
327
325
|
}
|
|
328
326
|
}
|
package/package.json
CHANGED
package/skills/evolve/SKILL.md
CHANGED
|
@@ -172,7 +172,7 @@ If ALL_PASSING: D gets `creative`, E gets `efficiency`.
|
|
|
172
172
|
|
|
173
173
|
Wait for all 5 to complete.
|
|
174
174
|
|
|
175
|
-
### 3.
|
|
175
|
+
### 3. Run Target for Each Candidate
|
|
176
176
|
|
|
177
177
|
For each worktree that has changes (proposer committed something):
|
|
178
178
|
|
|
@@ -184,7 +184,59 @@ $EVOLVER_PY $TOOLS/run_eval.py \
|
|
|
184
184
|
--timeout 120
|
|
185
185
|
```
|
|
186
186
|
|
|
187
|
-
Each candidate becomes a separate LangSmith experiment.
|
|
187
|
+
Each candidate becomes a separate LangSmith experiment. This step runs the agent and applies code-based evaluators (has_output, token_efficiency) only.
|
|
188
|
+
|
|
189
|
+
Collect all experiment names from the output (the `"experiment"` field in each JSON output).
|
|
190
|
+
|
|
191
|
+
### 3.5. LLM-as-Judge Evaluation (Evaluator Agent)
|
|
192
|
+
|
|
193
|
+
Check if the config has LLM-based evaluators (correctness, conciseness):
|
|
194
|
+
|
|
195
|
+
```bash
|
|
196
|
+
python3 -c "import json; c=json.load(open('.evolver.json')); llm=[k for k in c['evaluators'] if k in ('correctness','conciseness')]; print(','.join(llm) if llm else '')"
|
|
197
|
+
```
|
|
198
|
+
|
|
199
|
+
If LLM evaluators are configured, first verify langsmith-cli is available:
|
|
200
|
+
|
|
201
|
+
```bash
|
|
202
|
+
command -v langsmith-cli >/dev/null 2>&1 || { echo "ERROR: langsmith-cli not found. Install with: uv tool install langsmith-cli"; exit 1; }
|
|
203
|
+
```
|
|
204
|
+
|
|
205
|
+
Then spawn ONE evaluator agent that scores ALL candidates in a single pass. This is more efficient than spawning one agent per candidate:
|
|
206
|
+
|
|
207
|
+
```
|
|
208
|
+
Agent(
|
|
209
|
+
subagent_type: "evolver-evaluator",
|
|
210
|
+
description: "Evaluate all candidates for iteration v{NNN}",
|
|
211
|
+
prompt: |
|
|
212
|
+
<experiment>
|
|
213
|
+
Evaluate the following experiments (one per candidate):
|
|
214
|
+
- {experiment_name_a}
|
|
215
|
+
- {experiment_name_b}
|
|
216
|
+
- {experiment_name_c}
|
|
217
|
+
- {experiment_name_d}
|
|
218
|
+
- {experiment_name_e}
|
|
219
|
+
</experiment>
|
|
220
|
+
|
|
221
|
+
<evaluators>
|
|
222
|
+
Apply these evaluators to each run in each experiment:
|
|
223
|
+
- {llm_evaluator_list, e.g. "correctness", "conciseness"}
|
|
224
|
+
</evaluators>
|
|
225
|
+
|
|
226
|
+
<context>
|
|
227
|
+
Agent type: {framework} agent
|
|
228
|
+
Domain: {description from .evolver.json or entry point context}
|
|
229
|
+
Entry point: {entry_point}
|
|
230
|
+
|
|
231
|
+
For each experiment:
|
|
232
|
+
1. Read all runs via: langsmith-cli --json runs list --project "{experiment_name}" --fields id,inputs,outputs,error --is-root --limit 200
|
|
233
|
+
2. Judge each run's output against the input
|
|
234
|
+
3. Write scores via: langsmith-cli --json feedback create {run_id} --key {evaluator} --score {0.0|1.0} --comment "{reason}" --source model
|
|
235
|
+
</context>
|
|
236
|
+
)
|
|
237
|
+
```
|
|
238
|
+
|
|
239
|
+
Wait for the evaluator agent to complete before proceeding.
|
|
188
240
|
|
|
189
241
|
### 4. Compare All Candidates
|
|
190
242
|
|
package/skills/setup/SKILL.md
CHANGED
|
@@ -42,25 +42,27 @@ TOOLS=$([ -d ".evolver/tools" ] && echo ".evolver/tools" || echo "$HOME/.evolver
|
|
|
42
42
|
EVOLVER_PY=$([ -f "$HOME/.evolver/venv/bin/python" ] && echo "$HOME/.evolver/venv/bin/python" || echo "python3")
|
|
43
43
|
```
|
|
44
44
|
|
|
45
|
-
Use `$EVOLVER_PY` instead of `python3` for ALL tool invocations. This ensures the venv with langsmith
|
|
45
|
+
Use `$EVOLVER_PY` instead of `python3` for ALL tool invocations. This ensures the venv with langsmith is used.
|
|
46
46
|
|
|
47
47
|
## Phase 1: Explore Project (automatic)
|
|
48
48
|
|
|
49
49
|
```bash
|
|
50
|
-
find . -maxdepth 3 -type f -name "*.py" | head -30
|
|
51
|
-
$EVOLVER_PY $TOOLS/detect_stack.py .
|
|
50
|
+
find . -maxdepth 3 -type f -name "*.py" -not -path "*/.venv/*" -not -path "*/node_modules/*" -not -path "*/__pycache__/*" | head -30
|
|
52
51
|
```
|
|
53
52
|
|
|
53
|
+
**Monorepo detection**: if the project root has multiple subdirectories with their own `main.py` or `pyproject.toml`, it's a monorepo. Use AskUserQuestion to ask WHICH app to optimize before proceeding — do NOT scan everything.
|
|
54
|
+
|
|
54
55
|
Look for:
|
|
55
56
|
- Entry points: files with `if __name__`, or named `main.py`, `app.py`, `agent.py`, `graph.py`, `pipeline.py`
|
|
56
|
-
- Framework: LangGraph, CrewAI, OpenAI SDK, Anthropic SDK, etc.
|
|
57
57
|
- Existing LangSmith config: `LANGCHAIN_PROJECT` / `LANGSMITH_PROJECT` in env or `.env`
|
|
58
58
|
- Existing test data: JSON files with inputs, CSV files, etc.
|
|
59
59
|
- Dependencies: `requirements.txt`, `pyproject.toml`
|
|
60
60
|
|
|
61
|
-
|
|
61
|
+
To identify the **framework**, read the entry point file and its immediate imports. The proposer agents will use Context7 MCP for detailed documentation lookup — you don't need to detect every library, just identify the main framework (LangGraph, CrewAI, OpenAI Agents SDK, etc.) from the imports you see.
|
|
62
|
+
|
|
63
|
+
Identify the **run command** — how to execute the agent:
|
|
62
64
|
- `python main.py` (if it accepts `--input` flag)
|
|
63
|
-
-
|
|
65
|
+
- The command in the project's README, Makefile, or scripts/
|
|
64
66
|
|
|
65
67
|
## Phase 2: Confirm Detection (interactive)
|
|
66
68
|
|
|
@@ -199,4 +201,4 @@ Next: run /evolver:evolve to start optimizing.
|
|
|
199
201
|
- If `.evolver.json` already exists, ask before overwriting.
|
|
200
202
|
- If the agent needs a venv, the run command should activate it: `cd {dir} && .venv/bin/python main.py`
|
|
201
203
|
- If LangSmith connection fails, check API key and network.
|
|
202
|
-
- The setup
|
|
204
|
+
- The setup requires `langsmith` (Python SDK) and `langsmith-cli` (for evaluator agent).
|
|
Binary file
|
|
Binary file
|
package/tools/run_eval.py
CHANGED
|
@@ -2,7 +2,9 @@
|
|
|
2
2
|
"""Run LangSmith evaluation for a candidate in a worktree.
|
|
3
3
|
|
|
4
4
|
Wraps client.evaluate() — runs the user's agent against the dataset
|
|
5
|
-
with
|
|
5
|
+
with code-based evaluators only (has_output, token_efficiency).
|
|
6
|
+
LLM-as-judge scoring (correctness, conciseness) is handled post-hoc
|
|
7
|
+
by the evolver-evaluator agent via langsmith-cli.
|
|
6
8
|
|
|
7
9
|
Usage:
|
|
8
10
|
python3 run_eval.py \
|
|
@@ -11,7 +13,7 @@ Usage:
|
|
|
11
13
|
--experiment-prefix v001a \
|
|
12
14
|
[--timeout 120]
|
|
13
15
|
|
|
14
|
-
Requires: pip install langsmith
|
|
16
|
+
Requires: pip install langsmith
|
|
15
17
|
"""
|
|
16
18
|
|
|
17
19
|
import argparse
|
|
@@ -124,34 +126,30 @@ def make_target(entry_point, cwd):
|
|
|
124
126
|
|
|
125
127
|
|
|
126
128
|
def load_evaluators(evaluator_keys):
|
|
127
|
-
"""Load evaluators
|
|
128
|
-
from openevals.llm import create_llm_as_judge
|
|
129
|
-
from openevals.prompts import CORRECTNESS_PROMPT, CONCISENESS_PROMPT
|
|
129
|
+
"""Load code-based evaluators only.
|
|
130
130
|
|
|
131
|
+
LLM-as-judge evaluators (correctness, conciseness) are handled
|
|
132
|
+
post-hoc by the evolver-evaluator agent via langsmith-cli.
|
|
133
|
+
"""
|
|
131
134
|
evaluators = []
|
|
135
|
+
|
|
136
|
+
# Always include has_output — verifies the agent produced something
|
|
137
|
+
def has_output_eval(inputs, outputs, **kwargs):
|
|
138
|
+
has = bool(outputs and outputs.get("output", outputs.get("answer", "")))
|
|
139
|
+
return {"key": "has_output", "score": 1.0 if has else 0.0}
|
|
140
|
+
evaluators.append(has_output_eval)
|
|
141
|
+
|
|
132
142
|
for key in evaluator_keys:
|
|
133
|
-
if key == "
|
|
134
|
-
|
|
135
|
-
|
|
136
|
-
feedback_key="correctness",
|
|
137
|
-
model="openai:gpt-4.1-mini",
|
|
138
|
-
))
|
|
139
|
-
elif key == "conciseness":
|
|
140
|
-
evaluators.append(create_llm_as_judge(
|
|
141
|
-
prompt=CONCISENESS_PROMPT,
|
|
142
|
-
feedback_key="conciseness",
|
|
143
|
-
model="openai:gpt-4.1-mini",
|
|
144
|
-
))
|
|
145
|
-
elif key == "latency":
|
|
146
|
-
def latency_eval(inputs, outputs, **kwargs):
|
|
147
|
-
return {"key": "has_output", "score": 1.0 if outputs else 0.0}
|
|
148
|
-
evaluators.append(latency_eval)
|
|
143
|
+
if key == "latency":
|
|
144
|
+
# Latency is captured in traces, just check output exists
|
|
145
|
+
pass # has_output already covers this
|
|
149
146
|
elif key == "token_efficiency":
|
|
150
147
|
def token_eval(inputs, outputs, **kwargs):
|
|
151
148
|
output_text = str(outputs.get("output", outputs.get("answer", "")))
|
|
152
149
|
score = min(1.0, 2000 / max(len(output_text), 1))
|
|
153
150
|
return {"key": "token_efficiency", "score": score}
|
|
154
151
|
evaluators.append(token_eval)
|
|
152
|
+
# correctness, conciseness — skipped, handled by evaluator agent
|
|
155
153
|
|
|
156
154
|
return evaluators
|
|
157
155
|
|
|
@@ -176,10 +174,16 @@ def main():
|
|
|
176
174
|
target = make_target(config["entry_point"], args.worktree_path)
|
|
177
175
|
evaluators = load_evaluators(config["evaluators"])
|
|
178
176
|
|
|
177
|
+
# Identify which evaluators need the agent (LLM-as-judge)
|
|
178
|
+
llm_evaluators = [k for k in config["evaluators"] if k in ("correctness", "conciseness")]
|
|
179
|
+
code_evaluators = [k for k in config["evaluators"] if k not in ("correctness", "conciseness")]
|
|
180
|
+
|
|
179
181
|
print(f"Running evaluation: {args.experiment_prefix}")
|
|
180
182
|
print(f" Dataset: {config['dataset']}")
|
|
181
183
|
print(f" Worktree: {args.worktree_path}")
|
|
182
|
-
print(f"
|
|
184
|
+
print(f" Code evaluators: {['has_output'] + code_evaluators}")
|
|
185
|
+
if llm_evaluators:
|
|
186
|
+
print(f" Pending LLM evaluators (agent): {llm_evaluators}")
|
|
183
187
|
|
|
184
188
|
try:
|
|
185
189
|
results = client.evaluate(
|
|
@@ -192,7 +196,7 @@ def main():
|
|
|
192
196
|
|
|
193
197
|
experiment_name = results.experiment_name
|
|
194
198
|
|
|
195
|
-
# Calculate mean score
|
|
199
|
+
# Calculate mean score from code-based evaluators only
|
|
196
200
|
scores = []
|
|
197
201
|
per_example = {}
|
|
198
202
|
for result in results:
|
|
@@ -218,10 +222,13 @@ def main():
|
|
|
218
222
|
"num_examples": len(per_example),
|
|
219
223
|
"num_scores": len(scores),
|
|
220
224
|
"per_example": per_example,
|
|
225
|
+
"pending_llm_evaluators": llm_evaluators,
|
|
221
226
|
}
|
|
222
227
|
|
|
223
228
|
print(json.dumps(output))
|
|
224
|
-
print(f"\
|
|
229
|
+
print(f"\nTarget runs complete: {len(per_example)} examples")
|
|
230
|
+
if llm_evaluators:
|
|
231
|
+
print(f"Awaiting evaluator agent for: {llm_evaluators}")
|
|
225
232
|
|
|
226
233
|
except Exception as e:
|
|
227
234
|
print(f"Evaluation failed: {e}", file=sys.stderr)
|
package/tools/setup.py
CHANGED
|
@@ -19,7 +19,7 @@ Usage:
|
|
|
19
19
|
[--production-project my-prod-project] \
|
|
20
20
|
[--evaluators correctness,conciseness]
|
|
21
21
|
|
|
22
|
-
Requires: pip install langsmith
|
|
22
|
+
Requires: pip install langsmith
|
|
23
23
|
"""
|
|
24
24
|
|
|
25
25
|
import argparse
|
|
@@ -78,16 +78,12 @@ def ensure_langsmith_api_key():
|
|
|
78
78
|
|
|
79
79
|
|
|
80
80
|
def check_dependencies():
|
|
81
|
-
"""Verify langsmith
|
|
81
|
+
"""Verify langsmith is installed."""
|
|
82
82
|
missing = []
|
|
83
83
|
try:
|
|
84
84
|
import langsmith # noqa: F401
|
|
85
85
|
except ImportError:
|
|
86
86
|
missing.append("langsmith")
|
|
87
|
-
try:
|
|
88
|
-
import openevals # noqa: F401
|
|
89
|
-
except ImportError:
|
|
90
|
-
missing.append("openevals")
|
|
91
87
|
return missing
|
|
92
88
|
|
|
93
89
|
|
|
@@ -177,17 +173,19 @@ def create_empty_dataset(client, dataset_name):
|
|
|
177
173
|
|
|
178
174
|
|
|
179
175
|
def get_evaluators(goals, evaluator_names=None):
|
|
180
|
-
"""Build evaluator list based on optimization goals.
|
|
181
|
-
from openevals.llm import create_llm_as_judge
|
|
182
|
-
from openevals.prompts import CORRECTNESS_PROMPT, CONCISENESS_PROMPT
|
|
176
|
+
"""Build evaluator list based on optimization goals.
|
|
183
177
|
|
|
178
|
+
Returns only code-based evaluators. LLM-as-judge evaluators
|
|
179
|
+
(correctness, conciseness) are handled post-hoc by the
|
|
180
|
+
evolver-evaluator agent via langsmith-cli.
|
|
181
|
+
"""
|
|
184
182
|
evaluators = []
|
|
185
183
|
evaluator_keys = []
|
|
186
184
|
|
|
187
|
-
# Map goals to
|
|
188
|
-
|
|
189
|
-
"accuracy":
|
|
190
|
-
"conciseness":
|
|
185
|
+
# Map goals to evaluator keys (LLM-based are recorded but not instantiated)
|
|
186
|
+
goal_to_key = {
|
|
187
|
+
"accuracy": "correctness",
|
|
188
|
+
"conciseness": "conciseness",
|
|
191
189
|
}
|
|
192
190
|
|
|
193
191
|
if evaluator_names:
|
|
@@ -195,39 +193,33 @@ def get_evaluators(goals, evaluator_names=None):
|
|
|
195
193
|
else:
|
|
196
194
|
names = []
|
|
197
195
|
for goal in goals:
|
|
198
|
-
if goal in
|
|
199
|
-
names.append(
|
|
196
|
+
if goal in goal_to_key:
|
|
197
|
+
names.append(goal_to_key[goal])
|
|
200
198
|
if not names:
|
|
201
199
|
names = ["correctness"] # default
|
|
202
200
|
|
|
201
|
+
# Record all evaluator keys (for config) but only instantiate code-based ones
|
|
203
202
|
for name in names:
|
|
204
203
|
if name in ("correctness", "accuracy"):
|
|
205
|
-
evaluators.append(create_llm_as_judge(
|
|
206
|
-
prompt=CORRECTNESS_PROMPT,
|
|
207
|
-
feedback_key="correctness",
|
|
208
|
-
model="openai:gpt-4.1-mini",
|
|
209
|
-
))
|
|
210
204
|
evaluator_keys.append("correctness")
|
|
205
|
+
# LLM-as-judge — handled by evaluator agent, not here
|
|
211
206
|
elif name in ("conciseness", "brevity"):
|
|
212
|
-
evaluators.append(create_llm_as_judge(
|
|
213
|
-
prompt=CONCISENESS_PROMPT,
|
|
214
|
-
feedback_key="conciseness",
|
|
215
|
-
model="openai:gpt-4.1-mini",
|
|
216
|
-
))
|
|
217
207
|
evaluator_keys.append("conciseness")
|
|
208
|
+
# LLM-as-judge — handled by evaluator agent, not here
|
|
209
|
+
|
|
210
|
+
# Always include has_output
|
|
211
|
+
def has_output_eval(inputs, outputs, **kwargs):
|
|
212
|
+
has = bool(outputs and outputs.get("output", outputs.get("answer", "")))
|
|
213
|
+
return {"key": "has_output", "score": 1.0 if has else 0.0}
|
|
214
|
+
evaluators.append(has_output_eval)
|
|
218
215
|
|
|
219
216
|
# Code-based evaluators for latency/tokens
|
|
220
217
|
if "latency" in goals:
|
|
221
|
-
def latency_eval(inputs, outputs, **kwargs):
|
|
222
|
-
# Latency is captured in traces, not scored here
|
|
223
|
-
return {"key": "has_output", "score": 1.0 if outputs else 0.0}
|
|
224
|
-
evaluators.append(latency_eval)
|
|
225
218
|
evaluator_keys.append("latency")
|
|
226
219
|
|
|
227
220
|
if "token_efficiency" in goals:
|
|
228
221
|
def token_eval(inputs, outputs, **kwargs):
|
|
229
222
|
output_text = str(outputs.get("output", outputs.get("answer", "")))
|
|
230
|
-
# Penalize very long outputs (>2000 chars)
|
|
231
223
|
score = min(1.0, 2000 / max(len(output_text), 1))
|
|
232
224
|
return {"key": "token_efficiency", "score": score}
|
|
233
225
|
evaluators.append(token_eval)
|
|
@@ -386,18 +378,23 @@ def main():
|
|
|
386
378
|
print(f"Configuring evaluators for goals: {goals}")
|
|
387
379
|
evaluators, evaluator_keys = get_evaluators(goals, args.evaluators)
|
|
388
380
|
print(f" Active evaluators: {evaluator_keys}")
|
|
381
|
+
llm_evaluators = [k for k in evaluator_keys if k in ("correctness", "conciseness")]
|
|
382
|
+
if llm_evaluators:
|
|
383
|
+
print(f" LLM evaluators (agent-based): {llm_evaluators}")
|
|
389
384
|
|
|
390
|
-
# Run baseline
|
|
385
|
+
# Run baseline (code-based evaluators only; LLM scoring done by evaluator agent)
|
|
391
386
|
baseline_experiment = None
|
|
392
387
|
baseline_score = 0.0
|
|
393
388
|
if not args.skip_baseline and count > 0:
|
|
394
|
-
print(f"Running baseline
|
|
389
|
+
print(f"Running baseline target ({count} examples)...")
|
|
395
390
|
try:
|
|
396
391
|
baseline_experiment, baseline_score = run_baseline(
|
|
397
392
|
client, dataset_name, args.entry_point, evaluators,
|
|
398
393
|
)
|
|
399
|
-
print(f" Baseline score: {baseline_score:.3f}")
|
|
394
|
+
print(f" Baseline has_output score: {baseline_score:.3f}")
|
|
400
395
|
print(f" Experiment: {baseline_experiment}")
|
|
396
|
+
if llm_evaluators:
|
|
397
|
+
print(f" Note: LLM scoring pending — evaluator agent will run during /evolver:evolve")
|
|
401
398
|
except Exception as e:
|
|
402
399
|
print(f" Baseline evaluation failed: {e}", file=sys.stderr)
|
|
403
400
|
print(" Continuing with score 0.0")
|
package/tools/detect_stack.py
DELETED
|
@@ -1,173 +0,0 @@
|
|
|
1
|
-
#!/usr/bin/env python3
|
|
2
|
-
"""Detect the technology stack of a harness by analyzing Python imports via AST.
|
|
3
|
-
|
|
4
|
-
Usage:
|
|
5
|
-
detect_stack.py <file_or_directory> [-o output.json]
|
|
6
|
-
|
|
7
|
-
Maps imports to known libraries and their Context7 IDs for documentation lookup.
|
|
8
|
-
Stdlib-only. No external dependencies.
|
|
9
|
-
"""
|
|
10
|
-
|
|
11
|
-
import ast
|
|
12
|
-
import json
|
|
13
|
-
import os
|
|
14
|
-
import sys
|
|
15
|
-
|
|
16
|
-
KNOWN_LIBRARIES = {
|
|
17
|
-
"langchain": {
|
|
18
|
-
"context7_id": "/langchain-ai/langchain",
|
|
19
|
-
"display": "LangChain",
|
|
20
|
-
"modules": ["langchain", "langchain_core", "langchain_openai",
|
|
21
|
-
"langchain_anthropic", "langchain_community"],
|
|
22
|
-
},
|
|
23
|
-
"langgraph": {
|
|
24
|
-
"context7_id": "/langchain-ai/langgraph",
|
|
25
|
-
"display": "LangGraph",
|
|
26
|
-
"modules": ["langgraph"],
|
|
27
|
-
},
|
|
28
|
-
"llamaindex": {
|
|
29
|
-
"context7_id": "/run-llama/llama_index",
|
|
30
|
-
"display": "LlamaIndex",
|
|
31
|
-
"modules": ["llama_index"],
|
|
32
|
-
},
|
|
33
|
-
"openai": {
|
|
34
|
-
"context7_id": "/openai/openai-python",
|
|
35
|
-
"display": "OpenAI Python SDK",
|
|
36
|
-
"modules": ["openai"],
|
|
37
|
-
},
|
|
38
|
-
"anthropic": {
|
|
39
|
-
"context7_id": "/anthropics/anthropic-sdk-python",
|
|
40
|
-
"display": "Anthropic Python SDK",
|
|
41
|
-
"modules": ["anthropic"],
|
|
42
|
-
},
|
|
43
|
-
"dspy": {
|
|
44
|
-
"context7_id": "/stanfordnlp/dspy",
|
|
45
|
-
"display": "DSPy",
|
|
46
|
-
"modules": ["dspy"],
|
|
47
|
-
},
|
|
48
|
-
"crewai": {
|
|
49
|
-
"context7_id": "/crewAIInc/crewAI",
|
|
50
|
-
"display": "CrewAI",
|
|
51
|
-
"modules": ["crewai"],
|
|
52
|
-
},
|
|
53
|
-
"autogen": {
|
|
54
|
-
"context7_id": "/microsoft/autogen",
|
|
55
|
-
"display": "AutoGen",
|
|
56
|
-
"modules": ["autogen"],
|
|
57
|
-
},
|
|
58
|
-
"chromadb": {
|
|
59
|
-
"context7_id": "/chroma-core/chroma",
|
|
60
|
-
"display": "ChromaDB",
|
|
61
|
-
"modules": ["chromadb"],
|
|
62
|
-
},
|
|
63
|
-
"pinecone": {
|
|
64
|
-
"context7_id": "/pinecone-io/pinecone-python-client",
|
|
65
|
-
"display": "Pinecone",
|
|
66
|
-
"modules": ["pinecone"],
|
|
67
|
-
},
|
|
68
|
-
"qdrant": {
|
|
69
|
-
"context7_id": "/qdrant/qdrant",
|
|
70
|
-
"display": "Qdrant",
|
|
71
|
-
"modules": ["qdrant_client"],
|
|
72
|
-
},
|
|
73
|
-
"weaviate": {
|
|
74
|
-
"context7_id": "/weaviate/weaviate",
|
|
75
|
-
"display": "Weaviate",
|
|
76
|
-
"modules": ["weaviate"],
|
|
77
|
-
},
|
|
78
|
-
"fastapi": {
|
|
79
|
-
"context7_id": "/fastapi/fastapi",
|
|
80
|
-
"display": "FastAPI",
|
|
81
|
-
"modules": ["fastapi"],
|
|
82
|
-
},
|
|
83
|
-
"flask": {
|
|
84
|
-
"context7_id": "/pallets/flask",
|
|
85
|
-
"display": "Flask",
|
|
86
|
-
"modules": ["flask"],
|
|
87
|
-
},
|
|
88
|
-
"pydantic": {
|
|
89
|
-
"context7_id": "/pydantic/pydantic",
|
|
90
|
-
"display": "Pydantic",
|
|
91
|
-
"modules": ["pydantic"],
|
|
92
|
-
},
|
|
93
|
-
"pandas": {
|
|
94
|
-
"context7_id": "/pandas-dev/pandas",
|
|
95
|
-
"display": "Pandas",
|
|
96
|
-
"modules": ["pandas"],
|
|
97
|
-
},
|
|
98
|
-
"numpy": {
|
|
99
|
-
"context7_id": "/numpy/numpy",
|
|
100
|
-
"display": "NumPy",
|
|
101
|
-
"modules": ["numpy"],
|
|
102
|
-
},
|
|
103
|
-
}
|
|
104
|
-
|
|
105
|
-
|
|
106
|
-
def detect_from_file(filepath):
|
|
107
|
-
"""Analyze imports of a Python file and return detected stack."""
|
|
108
|
-
with open(filepath) as f:
|
|
109
|
-
try:
|
|
110
|
-
tree = ast.parse(f.read())
|
|
111
|
-
except SyntaxError:
|
|
112
|
-
return {}
|
|
113
|
-
|
|
114
|
-
imports = set()
|
|
115
|
-
for node in ast.walk(tree):
|
|
116
|
-
if isinstance(node, ast.Import):
|
|
117
|
-
for alias in node.names:
|
|
118
|
-
imports.add(alias.name.split(".")[0])
|
|
119
|
-
elif isinstance(node, ast.ImportFrom):
|
|
120
|
-
if node.module:
|
|
121
|
-
imports.add(node.module.split(".")[0])
|
|
122
|
-
|
|
123
|
-
detected = {}
|
|
124
|
-
for lib_key, lib_info in KNOWN_LIBRARIES.items():
|
|
125
|
-
found = imports & set(lib_info["modules"])
|
|
126
|
-
if found:
|
|
127
|
-
detected[lib_key] = {
|
|
128
|
-
"context7_id": lib_info["context7_id"],
|
|
129
|
-
"display": lib_info["display"],
|
|
130
|
-
"modules_found": sorted(found),
|
|
131
|
-
}
|
|
132
|
-
|
|
133
|
-
return detected
|
|
134
|
-
|
|
135
|
-
|
|
136
|
-
def detect_from_directory(directory):
|
|
137
|
-
"""Analyze all .py files in a directory and consolidate the stack."""
|
|
138
|
-
all_detected = {}
|
|
139
|
-
for root, dirs, files in os.walk(directory):
|
|
140
|
-
for f in files:
|
|
141
|
-
if f.endswith(".py"):
|
|
142
|
-
filepath = os.path.join(root, f)
|
|
143
|
-
file_detected = detect_from_file(filepath)
|
|
144
|
-
for lib_key, lib_info in file_detected.items():
|
|
145
|
-
if lib_key not in all_detected:
|
|
146
|
-
all_detected[lib_key] = lib_info
|
|
147
|
-
else:
|
|
148
|
-
existing = set(all_detected[lib_key]["modules_found"])
|
|
149
|
-
existing.update(lib_info["modules_found"])
|
|
150
|
-
all_detected[lib_key]["modules_found"] = sorted(existing)
|
|
151
|
-
return all_detected
|
|
152
|
-
|
|
153
|
-
|
|
154
|
-
if __name__ == "__main__":
|
|
155
|
-
import argparse
|
|
156
|
-
|
|
157
|
-
parser = argparse.ArgumentParser(description="Detect stack from Python files")
|
|
158
|
-
parser.add_argument("path", help="File or directory to analyze")
|
|
159
|
-
parser.add_argument("--output", "-o", help="Output JSON path")
|
|
160
|
-
args = parser.parse_args()
|
|
161
|
-
|
|
162
|
-
if os.path.isfile(args.path):
|
|
163
|
-
result = detect_from_file(args.path)
|
|
164
|
-
else:
|
|
165
|
-
result = detect_from_directory(args.path)
|
|
166
|
-
|
|
167
|
-
output = json.dumps(result, indent=2)
|
|
168
|
-
|
|
169
|
-
if args.output:
|
|
170
|
-
with open(args.output, "w") as f:
|
|
171
|
-
f.write(output)
|
|
172
|
-
else:
|
|
173
|
-
print(output)
|