harness-evolver 2.6.1 → 2.8.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/agents/harness-evolver-proposer.md +18 -0
- package/agents/harness-evolver-testgen.md +15 -0
- package/package.json +1 -1
- package/skills/evolve/SKILL.md +26 -0
- package/skills/init/SKILL.md +58 -1
- package/tools/__pycache__/init.cpython-313.pyc +0 -0
- package/tools/__pycache__/seed_from_traces.cpython-313.pyc +0 -0
- package/tools/analyze_architecture.py +56 -2
- package/tools/evaluate.py +29 -5
- package/tools/init.py +107 -16
- package/tools/seed_from_traces.py +454 -0
|
@@ -40,6 +40,24 @@ These insights are generated from LangSmith traces cross-referenced with per-tas
|
|
|
40
40
|
|
|
41
41
|
If trace insights are not available, proceed with manual trace analysis as described in Phase 2.
|
|
42
42
|
|
|
43
|
+
## Production Insights
|
|
44
|
+
|
|
45
|
+
If `.harness-evolver/production_seed.json` exists in your `<files_to_read>`, it contains **real production data** from the app's LangSmith project:
|
|
46
|
+
|
|
47
|
+
- `categories` — real traffic distribution (which domains/routes get the most queries)
|
|
48
|
+
- `error_patterns` — actual production errors and their frequency
|
|
49
|
+
- `negative_feedback_inputs` — queries where users gave thumbs-down
|
|
50
|
+
- `slow_queries` — high-latency queries that may indicate bottlenecks
|
|
51
|
+
- `sample_inputs` — real user inputs grouped by category
|
|
52
|
+
|
|
53
|
+
Use this data to:
|
|
54
|
+
1. **Prioritize changes that fix real production failures** over synthetic test failures
|
|
55
|
+
2. **Match the real traffic distribution** — if 60% of production queries are domain A, optimize for domain A
|
|
56
|
+
3. **Focus on negative feedback patterns** — these are confirmed bad user experiences
|
|
57
|
+
4. **Address latency outliers** — slow queries may need different routing, caching, or model selection
|
|
58
|
+
|
|
59
|
+
Production data complements trace_insights.json. Trace insights show what happened in *harness evaluation runs*. Production insights show what happens in *real-world usage*.
|
|
60
|
+
|
|
43
61
|
## Context7 — Enrich Your Knowledge
|
|
44
62
|
|
|
45
63
|
You have access to Context7 MCP tools (`resolve-library-id` and `get-library-docs`) for looking up **current, version-specific documentation** of any library.
|
|
@@ -36,6 +36,19 @@ Read the harness source code to understand:
|
|
|
36
36
|
- What are its likely failure modes?
|
|
37
37
|
- Are there any data files (knowledge bases, docs, etc.) that define the domain?
|
|
38
38
|
|
|
39
|
+
### Phase 1.5: Use Production Traces (if available)
|
|
40
|
+
|
|
41
|
+
If your prompt contains a `<production_traces>` block, this is **real data from production LangSmith traces**. This is the most valuable signal you have — real user inputs beat synthetic ones.
|
|
42
|
+
|
|
43
|
+
When production traces are available:
|
|
44
|
+
1. Read the traffic distribution — generate tasks proportional to real usage (if 60% of queries are domain A, 60% of tasks should cover domain A)
|
|
45
|
+
2. Use actual user phrasing as inspiration — real inputs show abbreviations, typos, informal language
|
|
46
|
+
3. Base edge cases on real error patterns — the errors listed are genuine failures, not imagined scenarios
|
|
47
|
+
4. Prioritize negative feedback traces — these are confirmed bad experiences that MUST be covered
|
|
48
|
+
5. Include slow queries as edge cases — high-latency traces may reveal timeout or complexity issues
|
|
49
|
+
|
|
50
|
+
**Do NOT just copy production inputs verbatim.** Use them as inspiration to generate VARIATIONS that test the same capabilities.
|
|
51
|
+
|
|
39
52
|
### Phase 2: Design Test Distribution
|
|
40
53
|
|
|
41
54
|
Plan 30 test cases with this distribution:
|
|
@@ -44,6 +57,8 @@ Plan 30 test cases with this distribution:
|
|
|
44
57
|
- **20% Cross-Domain** (6 tasks): inputs spanning multiple categories or requiring nuanced judgment
|
|
45
58
|
- **20% Adversarial** (6 tasks): misleading, ambiguous, or designed to expose weaknesses
|
|
46
59
|
|
|
60
|
+
If production traces are available, adjust the distribution to match real traffic patterns instead of uniform.
|
|
61
|
+
|
|
47
62
|
Ensure all categories/topics from the harness are covered.
|
|
48
63
|
|
|
49
64
|
### Phase 3: Generate Tasks
|
package/package.json
CHANGED
package/skills/evolve/SKILL.md
CHANGED
|
@@ -34,6 +34,27 @@ For each iteration:
|
|
|
34
34
|
python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(f'v{s[\"iterations\"]+1:03d}')"
|
|
35
35
|
```
|
|
36
36
|
|
|
37
|
+
### 1.4. Gather Production Insights (first iteration only)
|
|
38
|
+
|
|
39
|
+
On the **first iteration**, if the project has a production LangSmith project configured but no production seed yet, fetch it:
|
|
40
|
+
|
|
41
|
+
```bash
|
|
42
|
+
PROD_PROJECT=$(python3 -c "
|
|
43
|
+
import json, os
|
|
44
|
+
c = json.load(open('.harness-evolver/config.json'))
|
|
45
|
+
print(c.get('eval', {}).get('production_project', ''))
|
|
46
|
+
" 2>/dev/null)
|
|
47
|
+
if [ -n "$PROD_PROJECT" ] && [ ! -f ".harness-evolver/production_seed.json" ] && [ -n "$LANGSMITH_API_KEY" ]; then
|
|
48
|
+
python3 $TOOLS/seed_from_traces.py \
|
|
49
|
+
--project "$PROD_PROJECT" \
|
|
50
|
+
--output-md .harness-evolver/production_seed.md \
|
|
51
|
+
--output-json .harness-evolver/production_seed.json \
|
|
52
|
+
--limit 100 2>/dev/null
|
|
53
|
+
fi
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
The `production_seed.json` is included in all proposers' `<files_to_read>` so they have real-world context about how the agent is actually used in production.
|
|
57
|
+
|
|
37
58
|
### 1.5. Gather LangSmith Traces (MANDATORY after every evaluation)
|
|
38
59
|
|
|
39
60
|
**Run these commands unconditionally after EVERY evaluation** (including baseline). Do NOT guess project names — discover them.
|
|
@@ -255,6 +276,7 @@ Agent(
|
|
|
255
276
|
- .harness-evolver/langsmith_stats.json (if exists)
|
|
256
277
|
- .harness-evolver/langsmith_runs.json (if exists)
|
|
257
278
|
- .harness-evolver/trace_insights.json (if exists)
|
|
279
|
+
- .harness-evolver/production_seed.json (if exists)
|
|
258
280
|
- .harness-evolver/architecture.json (if exists)
|
|
259
281
|
</files_to_read>
|
|
260
282
|
|
|
@@ -295,6 +317,7 @@ Agent(
|
|
|
295
317
|
- .harness-evolver/langsmith_diagnosis.json (if exists)
|
|
296
318
|
- .harness-evolver/langsmith_runs.json (if exists)
|
|
297
319
|
- .harness-evolver/trace_insights.json (if exists)
|
|
320
|
+
- .harness-evolver/production_seed.json (if exists)
|
|
298
321
|
- .harness-evolver/architecture.json (if exists)
|
|
299
322
|
</files_to_read>
|
|
300
323
|
|
|
@@ -334,6 +357,7 @@ Agent(
|
|
|
334
357
|
- .harness-evolver/langsmith_diagnosis.json (if exists)
|
|
335
358
|
- .harness-evolver/langsmith_runs.json (if exists)
|
|
336
359
|
- .harness-evolver/trace_insights.json (if exists)
|
|
360
|
+
- .harness-evolver/production_seed.json (if exists)
|
|
337
361
|
- .harness-evolver/architecture.json (if exists)
|
|
338
362
|
</files_to_read>
|
|
339
363
|
|
|
@@ -377,6 +401,7 @@ Agent(
|
|
|
377
401
|
- .harness-evolver/harnesses/{best_version}/scores.json
|
|
378
402
|
- .harness-evolver/langsmith_runs.json (if exists)
|
|
379
403
|
- .harness-evolver/trace_insights.json (if exists)
|
|
404
|
+
- .harness-evolver/production_seed.json (if exists)
|
|
380
405
|
- .harness-evolver/architecture.json (if exists)
|
|
381
406
|
</files_to_read>
|
|
382
407
|
|
|
@@ -438,6 +463,7 @@ Agent(
|
|
|
438
463
|
- .harness-evolver/harnesses/{best_version}/scores.json
|
|
439
464
|
- .harness-evolver/langsmith_runs.json (if exists)
|
|
440
465
|
- .harness-evolver/trace_insights.json (if exists)
|
|
466
|
+
- .harness-evolver/production_seed.json (if exists)
|
|
441
467
|
- .harness-evolver/architecture.json (if exists)
|
|
442
468
|
</files_to_read>
|
|
443
469
|
|
package/skills/init/SKILL.md
CHANGED
|
@@ -80,11 +80,21 @@ Agent(
|
|
|
80
80
|
- /home/rp/Desktop/test-crewai/README.md
|
|
81
81
|
</files_to_read>
|
|
82
82
|
|
|
83
|
+
<production_traces>
|
|
84
|
+
{IF .harness-evolver/production_seed.md EXISTS, paste its full contents here.
|
|
85
|
+
This file contains real production inputs, traffic distribution, error patterns,
|
|
86
|
+
and user feedback from LangSmith. Use it to generate REALISTIC test cases that
|
|
87
|
+
match actual usage patterns instead of synthetic ones.
|
|
88
|
+
|
|
89
|
+
If the file does not exist, omit this entire block.}
|
|
90
|
+
</production_traces>
|
|
91
|
+
|
|
83
92
|
<output>
|
|
84
93
|
Create directory tasks/ (at project root) with 30 files: task_001.json through task_030.json.
|
|
85
94
|
Format: {"id": "task_001", "input": "...", "metadata": {"difficulty": "easy|medium|hard", "type": "standard|edge|cross_domain|adversarial"}}
|
|
86
95
|
No "expected" field needed — the judge subagent will score outputs.
|
|
87
96
|
Distribution: 40% standard, 20% edge, 20% cross-domain, 20% adversarial.
|
|
97
|
+
If production traces are available, match the real traffic distribution instead of uniform.
|
|
88
98
|
</output>
|
|
89
99
|
)
|
|
90
100
|
```
|
|
@@ -93,16 +103,60 @@ Wait for `## TESTGEN COMPLETE`. If the subagent fails or returns with no tasks,
|
|
|
93
103
|
|
|
94
104
|
Print: "Generated {N} test cases from code analysis."
|
|
95
105
|
|
|
106
|
+
If `.harness-evolver/production_seed.md` exists, also print:
|
|
107
|
+
"Tasks enriched with production trace data from LangSmith."
|
|
108
|
+
|
|
96
109
|
## Phase 3: Run Init
|
|
97
110
|
|
|
111
|
+
First, check if the project has a LangSmith production project configured:
|
|
112
|
+
|
|
113
|
+
```bash
|
|
114
|
+
# Auto-detect from env vars or .env
|
|
115
|
+
PROD_PROJECT=$(python3 -c "
|
|
116
|
+
import os
|
|
117
|
+
for v in ('LANGCHAIN_PROJECT', 'LANGSMITH_PROJECT'):
|
|
118
|
+
p = os.environ.get(v, '')
|
|
119
|
+
if p: print(p); exit()
|
|
120
|
+
for f in ('.env', '.env.local'):
|
|
121
|
+
if os.path.exists(f):
|
|
122
|
+
for line in open(f):
|
|
123
|
+
line = line.strip()
|
|
124
|
+
if '=' in line and not line.startswith('#'):
|
|
125
|
+
k, _, val = line.partition('=')
|
|
126
|
+
if k.strip() in ('LANGCHAIN_PROJECT', 'LANGSMITH_PROJECT'):
|
|
127
|
+
print(val.strip().strip('\"').strip(\"'\"))
|
|
128
|
+
exit()
|
|
129
|
+
" 2>/dev/null)
|
|
130
|
+
```
|
|
131
|
+
|
|
98
132
|
```bash
|
|
99
133
|
python3 $TOOLS/init.py [directory] \
|
|
100
134
|
--harness harness.py --eval eval.py --tasks tasks/ \
|
|
101
|
-
--tools-dir $TOOLS
|
|
135
|
+
--tools-dir $TOOLS \
|
|
136
|
+
${PROD_PROJECT:+--langsmith-project "$PROD_PROJECT"}
|
|
102
137
|
```
|
|
103
138
|
|
|
104
139
|
Add `--harness-config config.json` if a config exists.
|
|
105
140
|
|
|
141
|
+
For **LLM-powered agents** that make real API calls (LangGraph, CrewAI, etc.) and take
|
|
142
|
+
more than 30 seconds per invocation, increase the validation timeout:
|
|
143
|
+
|
|
144
|
+
```bash
|
|
145
|
+
python3 $TOOLS/init.py [directory] \
|
|
146
|
+
--harness harness.py --eval eval.py --tasks tasks/ \
|
|
147
|
+
--tools-dir $TOOLS \
|
|
148
|
+
--validation-timeout 120
|
|
149
|
+
```
|
|
150
|
+
|
|
151
|
+
If validation keeps timing out but you've verified the harness works manually, skip it:
|
|
152
|
+
|
|
153
|
+
```bash
|
|
154
|
+
python3 $TOOLS/init.py [directory] \
|
|
155
|
+
--harness harness.py --eval eval.py --tasks tasks/ \
|
|
156
|
+
--tools-dir $TOOLS \
|
|
157
|
+
--skip-validation
|
|
158
|
+
```
|
|
159
|
+
|
|
106
160
|
## After Init — Report
|
|
107
161
|
|
|
108
162
|
- What was detected vs created
|
|
@@ -132,3 +186,6 @@ This is advisory only — do not spawn the architect agent.
|
|
|
132
186
|
- The `expected` field is never shown to the harness — only the eval script sees it.
|
|
133
187
|
- If `.harness-evolver/` already exists, warn before overwriting.
|
|
134
188
|
- If no Python files exist in CWD, the user is probably in the wrong directory.
|
|
189
|
+
- **Monorepo / venv mismatch**: In monorepos with dedicated venvs per app, the system `python3` may differ from the project's Python version. The harness wrapper should re-exec with the correct venv Python. The tools now use `sys.executable` instead of hardcoded `python3`.
|
|
190
|
+
- **Stale site-packages**: If the project uses editable installs (`pip install -e .`), packages in `site-packages/` may have stale copies of data files (e.g. registry YAMLs). Run `uv pip install -e . --force-reinstall --no-deps` to sync.
|
|
191
|
+
- **Validation timeout**: LLM agents making real API calls typically take 15-60s per invocation. Use `--validation-timeout 120` or `--skip-validation` to handle this.
|
|
Binary file
|
|
Binary file
|
|
@@ -472,12 +472,60 @@ def analyze_scores(summary_path):
|
|
|
472
472
|
|
|
473
473
|
# --- Main ---
|
|
474
474
|
|
|
475
|
+
def analyze_multiple(file_paths):
|
|
476
|
+
"""Analyze multiple Python files and merge their signals.
|
|
477
|
+
|
|
478
|
+
Useful in monorepo setups where the harness is a thin wrapper that
|
|
479
|
+
delegates to the actual agent code. Pass the harness AND the main
|
|
480
|
+
agent source files for a comprehensive topology classification.
|
|
481
|
+
"""
|
|
482
|
+
merged = {
|
|
483
|
+
"llm_call_count": 0,
|
|
484
|
+
"has_loop_around_llm": False,
|
|
485
|
+
"has_tool_definitions": False,
|
|
486
|
+
"has_retrieval": False,
|
|
487
|
+
"has_graph_framework": False,
|
|
488
|
+
"has_parallel_execution": False,
|
|
489
|
+
"has_error_handling": False,
|
|
490
|
+
"code_lines": 0,
|
|
491
|
+
"function_count": 0,
|
|
492
|
+
"class_count": 0,
|
|
493
|
+
"files_analyzed": [],
|
|
494
|
+
}
|
|
495
|
+
|
|
496
|
+
for path in file_paths:
|
|
497
|
+
if not os.path.isfile(path):
|
|
498
|
+
continue
|
|
499
|
+
try:
|
|
500
|
+
signals = analyze_code(path)
|
|
501
|
+
except Exception:
|
|
502
|
+
continue
|
|
503
|
+
|
|
504
|
+
merged["llm_call_count"] += signals.get("llm_call_count", 0)
|
|
505
|
+
merged["code_lines"] += signals.get("code_lines", 0)
|
|
506
|
+
merged["function_count"] += signals.get("function_count", 0)
|
|
507
|
+
merged["class_count"] += signals.get("class_count", 0)
|
|
508
|
+
merged["files_analyzed"].append(os.path.basename(path))
|
|
509
|
+
|
|
510
|
+
for bool_key in ["has_loop_around_llm", "has_tool_definitions", "has_retrieval",
|
|
511
|
+
"has_graph_framework", "has_parallel_execution", "has_error_handling"]:
|
|
512
|
+
if signals.get(bool_key):
|
|
513
|
+
merged[bool_key] = True
|
|
514
|
+
|
|
515
|
+
merged["estimated_topology"] = _estimate_topology(merged)
|
|
516
|
+
return merged
|
|
517
|
+
|
|
518
|
+
|
|
475
519
|
def main():
|
|
476
520
|
parser = argparse.ArgumentParser(
|
|
477
521
|
description="Analyze harness architecture and produce signals for the architect agent",
|
|
478
|
-
usage="analyze_architecture.py --harness PATH [--
|
|
522
|
+
usage="analyze_architecture.py --harness PATH [--source-files PATH ...] "
|
|
523
|
+
"[--traces-dir PATH] [--summary PATH] [-o output.json]",
|
|
479
524
|
)
|
|
480
525
|
parser.add_argument("--harness", required=True, help="Path to harness Python file")
|
|
526
|
+
parser.add_argument("--source-files", nargs="*", default=None,
|
|
527
|
+
help="Additional source files to analyze (e.g. the actual agent code). "
|
|
528
|
+
"Useful when the harness is a thin wrapper around a larger system.")
|
|
481
529
|
parser.add_argument("--traces-dir", default=None, help="Path to traces directory")
|
|
482
530
|
parser.add_argument("--summary", default=None, help="Path to summary.json")
|
|
483
531
|
parser.add_argument("-o", "--output", default=None, help="Output JSON path")
|
|
@@ -487,8 +535,14 @@ def main():
|
|
|
487
535
|
print(json.dumps({"error": f"Harness file not found: {args.harness}"}))
|
|
488
536
|
sys.exit(1)
|
|
489
537
|
|
|
538
|
+
if args.source_files:
|
|
539
|
+
all_files = [args.harness] + [f for f in args.source_files if os.path.isfile(f)]
|
|
540
|
+
code_signals = analyze_multiple(all_files)
|
|
541
|
+
else:
|
|
542
|
+
code_signals = analyze_code(args.harness)
|
|
543
|
+
|
|
490
544
|
result = {
|
|
491
|
-
"code_signals":
|
|
545
|
+
"code_signals": code_signals,
|
|
492
546
|
"trace_signals": None,
|
|
493
547
|
"score_signals": None,
|
|
494
548
|
}
|
package/tools/evaluate.py
CHANGED
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
"""Evaluation orchestrator for Harness Evolver.
|
|
3
3
|
|
|
4
4
|
Commands:
|
|
5
|
-
validate --harness PATH [--config PATH]
|
|
5
|
+
validate --harness PATH [--config PATH] [--timeout SECONDS]
|
|
6
6
|
run --harness PATH --tasks-dir PATH --eval PATH --traces-dir PATH --scores PATH
|
|
7
7
|
[--config PATH] [--timeout SECONDS]
|
|
8
8
|
|
|
@@ -20,9 +20,23 @@ import tempfile
|
|
|
20
20
|
import time
|
|
21
21
|
|
|
22
22
|
|
|
23
|
+
def _resolve_python():
|
|
24
|
+
"""Resolve the Python interpreter to use for subprocesses.
|
|
25
|
+
|
|
26
|
+
Prefers the current interpreter (sys.executable) over a hardcoded 'python3'.
|
|
27
|
+
This is critical in monorepo setups where the harness may need a specific
|
|
28
|
+
venv Python (e.g. Python 3.12) while the system 'python3' is a different
|
|
29
|
+
version (e.g. 3.14) with incompatible site-packages.
|
|
30
|
+
"""
|
|
31
|
+
exe = sys.executable
|
|
32
|
+
if exe and os.path.isfile(exe):
|
|
33
|
+
return exe
|
|
34
|
+
return "python3"
|
|
35
|
+
|
|
36
|
+
|
|
23
37
|
def _run_harness_on_task(harness, config, task_input_path, output_path, task_traces_dir, timeout, env=None):
|
|
24
38
|
"""Run the harness on a single task. Returns (success, elapsed_ms, stdout, stderr)."""
|
|
25
|
-
cmd = [
|
|
39
|
+
cmd = [_resolve_python(), harness, "--input", task_input_path, "--output", output_path]
|
|
26
40
|
if task_traces_dir:
|
|
27
41
|
extra_dir = os.path.join(task_traces_dir, "extra")
|
|
28
42
|
os.makedirs(extra_dir, exist_ok=True)
|
|
@@ -48,6 +62,7 @@ def _run_harness_on_task(harness, config, task_input_path, output_path, task_tra
|
|
|
48
62
|
def cmd_validate(args):
|
|
49
63
|
harness = args.harness
|
|
50
64
|
config = getattr(args, "config", None)
|
|
65
|
+
timeout = getattr(args, "timeout", 30) or 30
|
|
51
66
|
|
|
52
67
|
if not os.path.exists(harness):
|
|
53
68
|
print(f"FAIL: harness not found: {harness}", file=sys.stderr)
|
|
@@ -61,11 +76,17 @@ def cmd_validate(args):
|
|
|
61
76
|
json.dump(dummy_task, f)
|
|
62
77
|
|
|
63
78
|
success, elapsed, stdout, stderr = _run_harness_on_task(
|
|
64
|
-
harness, config, input_path, output_path, None, timeout=
|
|
79
|
+
harness, config, input_path, output_path, None, timeout=timeout,
|
|
65
80
|
)
|
|
66
81
|
|
|
67
82
|
if not success:
|
|
68
|
-
|
|
83
|
+
hint = ""
|
|
84
|
+
if "TIMEOUT" in stderr:
|
|
85
|
+
hint = (f"\nHint: validation timed out after {timeout}s. "
|
|
86
|
+
"For LLM-powered agents that make real API calls, "
|
|
87
|
+
"use --timeout to increase the limit: "
|
|
88
|
+
f"evaluate.py validate --harness {harness} --timeout 120")
|
|
89
|
+
print(f"FAIL: harness exited with error.\nstderr: {stderr}{hint}", file=sys.stderr)
|
|
69
90
|
sys.exit(1)
|
|
70
91
|
|
|
71
92
|
if not os.path.exists(output_path):
|
|
@@ -171,7 +192,7 @@ def cmd_run(args):
|
|
|
171
192
|
f.write("\n".join(all_stderr))
|
|
172
193
|
|
|
173
194
|
eval_cmd = [
|
|
174
|
-
|
|
195
|
+
_resolve_python(), eval_script,
|
|
175
196
|
"--results-dir", results_dir,
|
|
176
197
|
"--tasks-dir", tasks_dir,
|
|
177
198
|
"--scores", scores_path,
|
|
@@ -195,6 +216,9 @@ def main():
|
|
|
195
216
|
p_val = sub.add_parser("validate")
|
|
196
217
|
p_val.add_argument("--harness", required=True)
|
|
197
218
|
p_val.add_argument("--config", default=None)
|
|
219
|
+
p_val.add_argument("--timeout", type=int, default=30,
|
|
220
|
+
help="Validation timeout in seconds (default: 30). "
|
|
221
|
+
"Increase for LLM-powered agents that make real API calls.")
|
|
198
222
|
|
|
199
223
|
p_run = sub.add_parser("run")
|
|
200
224
|
p_run.add_argument("--harness", required=True)
|
package/tools/init.py
CHANGED
|
@@ -124,6 +124,40 @@ def _detect_langsmith():
|
|
|
124
124
|
return {"enabled": False}
|
|
125
125
|
|
|
126
126
|
|
|
127
|
+
def _detect_langsmith_project(search_dir="."):
|
|
128
|
+
"""Auto-detect the app's existing LangSmith project name.
|
|
129
|
+
|
|
130
|
+
Checks (in order):
|
|
131
|
+
1. LANGCHAIN_PROJECT env var (standard LangChain convention)
|
|
132
|
+
2. LANGSMITH_PROJECT env var (alternative)
|
|
133
|
+
3. .env file in the project directory
|
|
134
|
+
"""
|
|
135
|
+
for var in ("LANGCHAIN_PROJECT", "LANGSMITH_PROJECT"):
|
|
136
|
+
project = os.environ.get(var)
|
|
137
|
+
if project:
|
|
138
|
+
return project
|
|
139
|
+
|
|
140
|
+
# Parse .env file
|
|
141
|
+
for env_name in (".env", ".env.local"):
|
|
142
|
+
env_path = os.path.join(search_dir, env_name)
|
|
143
|
+
if os.path.exists(env_path):
|
|
144
|
+
try:
|
|
145
|
+
with open(env_path) as f:
|
|
146
|
+
for line in f:
|
|
147
|
+
line = line.strip()
|
|
148
|
+
if line.startswith("#") or "=" not in line:
|
|
149
|
+
continue
|
|
150
|
+
key, _, val = line.partition("=")
|
|
151
|
+
key = key.strip()
|
|
152
|
+
val = val.strip().strip("'\"")
|
|
153
|
+
if key in ("LANGCHAIN_PROJECT", "LANGSMITH_PROJECT") and val:
|
|
154
|
+
return val
|
|
155
|
+
except OSError:
|
|
156
|
+
pass
|
|
157
|
+
|
|
158
|
+
return None
|
|
159
|
+
|
|
160
|
+
|
|
127
161
|
def _check_langsmith_cli():
|
|
128
162
|
"""Check if langsmith-cli is installed."""
|
|
129
163
|
try:
|
|
@@ -134,6 +168,19 @@ def _check_langsmith_cli():
|
|
|
134
168
|
return False
|
|
135
169
|
|
|
136
170
|
|
|
171
|
+
def _resolve_python():
|
|
172
|
+
"""Resolve the Python interpreter for subprocesses.
|
|
173
|
+
|
|
174
|
+
Uses the current interpreter (sys.executable) instead of hardcoded 'python3'.
|
|
175
|
+
This prevents version mismatches in monorepo setups where the harness may
|
|
176
|
+
need a specific venv Python different from the system python3.
|
|
177
|
+
"""
|
|
178
|
+
exe = sys.executable
|
|
179
|
+
if exe and os.path.isfile(exe):
|
|
180
|
+
return exe
|
|
181
|
+
return "python3"
|
|
182
|
+
|
|
183
|
+
|
|
137
184
|
def _detect_stack(harness_path):
|
|
138
185
|
"""Detect technology stack from harness imports."""
|
|
139
186
|
detect_stack_py = os.path.join(os.path.dirname(__file__), "detect_stack.py")
|
|
@@ -141,7 +188,7 @@ def _detect_stack(harness_path):
|
|
|
141
188
|
return {}
|
|
142
189
|
try:
|
|
143
190
|
r = subprocess.run(
|
|
144
|
-
[
|
|
191
|
+
[_resolve_python(), detect_stack_py, harness_path],
|
|
145
192
|
capture_output=True, text=True, timeout=30,
|
|
146
193
|
)
|
|
147
194
|
if r.returncode == 0 and r.stdout.strip():
|
|
@@ -183,6 +230,15 @@ def main():
|
|
|
183
230
|
parser.add_argument("--base-dir", default=None, help="Path for .harness-evolver/")
|
|
184
231
|
parser.add_argument("--harness-config", default=None, help="Path to harness config.json")
|
|
185
232
|
parser.add_argument("--tools-dir", default=None, help="Path to tools directory")
|
|
233
|
+
parser.add_argument("--validation-timeout", type=int, default=30,
|
|
234
|
+
help="Timeout for harness validation in seconds (default: 30). "
|
|
235
|
+
"Increase for LLM-powered agents that make real API calls.")
|
|
236
|
+
parser.add_argument("--skip-validation", action="store_true",
|
|
237
|
+
help="Skip harness validation step. Use when you know the harness "
|
|
238
|
+
"works but validation times out (e.g. real LLM agent calls).")
|
|
239
|
+
parser.add_argument("--langsmith-project", default=None,
|
|
240
|
+
help="Existing LangSmith project name with production traces. "
|
|
241
|
+
"Auto-detected from LANGCHAIN_PROJECT / LANGSMITH_PROJECT env vars or .env file.")
|
|
186
242
|
args = parser.parse_args()
|
|
187
243
|
|
|
188
244
|
# Auto-detect missing args
|
|
@@ -261,6 +317,7 @@ def main():
|
|
|
261
317
|
"args": ["--results-dir", "{results_dir}", "--tasks-dir", "{tasks_dir}",
|
|
262
318
|
"--scores", "{scores}"],
|
|
263
319
|
"langsmith": _detect_langsmith(),
|
|
320
|
+
"production_project": args.langsmith_project or _detect_langsmith_project(search_dir),
|
|
264
321
|
},
|
|
265
322
|
"evolution": {
|
|
266
323
|
"max_iterations": 10,
|
|
@@ -309,7 +366,7 @@ def main():
|
|
|
309
366
|
if os.path.exists(detect_stack_py):
|
|
310
367
|
try:
|
|
311
368
|
r = subprocess.run(
|
|
312
|
-
[
|
|
369
|
+
[_resolve_python(), detect_stack_py, harness_dir],
|
|
313
370
|
capture_output=True, text=True, timeout=30,
|
|
314
371
|
)
|
|
315
372
|
if r.returncode == 0 and r.stdout.strip():
|
|
@@ -338,7 +395,7 @@ def main():
|
|
|
338
395
|
if os.path.exists(analyze_py):
|
|
339
396
|
try:
|
|
340
397
|
r = subprocess.run(
|
|
341
|
-
[
|
|
398
|
+
[_resolve_python(), analyze_py, "--harness", args.harness],
|
|
342
399
|
capture_output=True, text=True, timeout=30,
|
|
343
400
|
)
|
|
344
401
|
if r.returncode == 0 and r.stdout.strip():
|
|
@@ -356,31 +413,65 @@ def main():
|
|
|
356
413
|
except Exception:
|
|
357
414
|
pass
|
|
358
415
|
|
|
416
|
+
# 4.5 Fetch production traces seed (if LangSmith production project detected)
|
|
417
|
+
prod_project = config["eval"].get("production_project")
|
|
418
|
+
if prod_project and os.environ.get("LANGSMITH_API_KEY"):
|
|
419
|
+
seed_py = os.path.join(tools, "seed_from_traces.py")
|
|
420
|
+
if os.path.exists(seed_py):
|
|
421
|
+
print(f"Fetching production traces from LangSmith project '{prod_project}'...")
|
|
422
|
+
try:
|
|
423
|
+
r = subprocess.run(
|
|
424
|
+
[_resolve_python(), seed_py,
|
|
425
|
+
"--project", prod_project,
|
|
426
|
+
"--output-md", os.path.join(base, "production_seed.md"),
|
|
427
|
+
"--output-json", os.path.join(base, "production_seed.json"),
|
|
428
|
+
"--limit", "100"],
|
|
429
|
+
capture_output=True, text=True, timeout=60,
|
|
430
|
+
)
|
|
431
|
+
if r.returncode == 0:
|
|
432
|
+
print(r.stdout.strip())
|
|
433
|
+
else:
|
|
434
|
+
print(f" Could not fetch production traces: {r.stderr.strip()[:200]}")
|
|
435
|
+
except Exception as e:
|
|
436
|
+
print(f" Production trace fetch failed: {e}")
|
|
437
|
+
elif prod_project:
|
|
438
|
+
print(f"Production LangSmith project detected: {prod_project}")
|
|
439
|
+
print(" Set LANGSMITH_API_KEY to auto-fetch production traces during init.")
|
|
440
|
+
|
|
359
441
|
# 5. Validate baseline harness
|
|
360
|
-
print("Validating baseline harness...")
|
|
361
|
-
val_args = ["python3", evaluate_py, "validate",
|
|
362
|
-
"--harness", os.path.join(base, "baseline", "harness.py")]
|
|
363
442
|
config_path = os.path.join(base, "baseline", "config.json")
|
|
364
|
-
if
|
|
365
|
-
|
|
366
|
-
|
|
367
|
-
|
|
368
|
-
|
|
369
|
-
|
|
370
|
-
|
|
443
|
+
if args.skip_validation:
|
|
444
|
+
print("Skipping baseline validation (--skip-validation).")
|
|
445
|
+
else:
|
|
446
|
+
print(f"Validating baseline harness (timeout: {args.validation_timeout}s)...")
|
|
447
|
+
val_args = [_resolve_python(), evaluate_py, "validate",
|
|
448
|
+
"--harness", os.path.join(base, "baseline", "harness.py"),
|
|
449
|
+
"--timeout", str(args.validation_timeout)]
|
|
450
|
+
if os.path.exists(config_path):
|
|
451
|
+
val_args.extend(["--config", config_path])
|
|
452
|
+
r = subprocess.run(val_args, capture_output=True, text=True)
|
|
453
|
+
if r.returncode != 0:
|
|
454
|
+
hint = ""
|
|
455
|
+
if "TIMEOUT" in r.stderr:
|
|
456
|
+
hint = (f"\n\nHint: The harness timed out after {args.validation_timeout}s. "
|
|
457
|
+
"This is common for LLM-powered agents that make real API calls.\n"
|
|
458
|
+
"Try: --validation-timeout 120 (or --skip-validation to bypass)")
|
|
459
|
+
print(f"FAIL: baseline harness validation failed.\n{r.stderr}{hint}", file=sys.stderr)
|
|
460
|
+
sys.exit(1)
|
|
461
|
+
print(r.stdout.strip())
|
|
371
462
|
|
|
372
463
|
# 6. Evaluate baseline
|
|
373
464
|
print("Evaluating baseline harness...")
|
|
374
465
|
baseline_traces = tempfile.mkdtemp()
|
|
375
466
|
baseline_scores = os.path.join(base, "baseline_scores.json")
|
|
376
467
|
eval_args = [
|
|
377
|
-
|
|
468
|
+
_resolve_python(), evaluate_py, "run",
|
|
378
469
|
"--harness", os.path.join(base, "baseline", "harness.py"),
|
|
379
470
|
"--tasks-dir", os.path.join(base, "eval", "tasks"),
|
|
380
471
|
"--eval", os.path.join(base, "eval", "eval.py"),
|
|
381
472
|
"--traces-dir", baseline_traces,
|
|
382
473
|
"--scores", baseline_scores,
|
|
383
|
-
"--timeout",
|
|
474
|
+
"--timeout", str(max(args.validation_timeout, 60)),
|
|
384
475
|
]
|
|
385
476
|
if os.path.exists(config_path):
|
|
386
477
|
eval_args.extend(["--config", config_path])
|
|
@@ -399,7 +490,7 @@ def main():
|
|
|
399
490
|
# 7. Initialize state with baseline score
|
|
400
491
|
print(f"Baseline score: {baseline_score:.2f}")
|
|
401
492
|
r = subprocess.run(
|
|
402
|
-
[
|
|
493
|
+
[_resolve_python(), state_py, "init",
|
|
403
494
|
"--base-dir", base,
|
|
404
495
|
"--baseline-score", str(baseline_score)],
|
|
405
496
|
capture_output=True, text=True,
|
|
@@ -0,0 +1,454 @@
|
|
|
1
|
+
#!/usr/bin/env python3
|
|
2
|
+
"""Fetch and summarize production LangSmith traces for Harness Evolver.
|
|
3
|
+
|
|
4
|
+
Queries the LangSmith REST API directly (urllib, stdlib-only) to fetch
|
|
5
|
+
production traces and produce:
|
|
6
|
+
1. A markdown seed file for the testgen agent (production_seed.md)
|
|
7
|
+
2. A JSON summary for programmatic use (production_seed.json)
|
|
8
|
+
|
|
9
|
+
Usage:
|
|
10
|
+
python3 seed_from_traces.py \
|
|
11
|
+
--project ceppem-langgraph \
|
|
12
|
+
--output-md .harness-evolver/production_seed.md \
|
|
13
|
+
--output-json .harness-evolver/production_seed.json \
|
|
14
|
+
[--api-key-env LANGSMITH_API_KEY] \
|
|
15
|
+
[--limit 100]
|
|
16
|
+
|
|
17
|
+
Stdlib-only. No external dependencies (no langsmith-cli needed).
|
|
18
|
+
"""
|
|
19
|
+
|
|
20
|
+
import argparse
|
|
21
|
+
import json
|
|
22
|
+
import os
|
|
23
|
+
import sys
|
|
24
|
+
import urllib.parse
|
|
25
|
+
import urllib.request
|
|
26
|
+
from collections import Counter
|
|
27
|
+
from datetime import datetime, timezone
|
|
28
|
+
|
|
29
|
+
LANGSMITH_API_BASE = "https://api.smith.langchain.com/api/v1"
|
|
30
|
+
|
|
31
|
+
|
|
32
|
+
def langsmith_request(endpoint, api_key, method="GET", body=None, params=None):
|
|
33
|
+
"""Make a request to the LangSmith REST API."""
|
|
34
|
+
url = f"{LANGSMITH_API_BASE}/{endpoint}"
|
|
35
|
+
if params:
|
|
36
|
+
url += "?" + urllib.parse.urlencode(params)
|
|
37
|
+
|
|
38
|
+
headers = {
|
|
39
|
+
"x-api-key": api_key,
|
|
40
|
+
"Accept": "application/json",
|
|
41
|
+
}
|
|
42
|
+
|
|
43
|
+
data = None
|
|
44
|
+
if body is not None:
|
|
45
|
+
headers["Content-Type"] = "application/json"
|
|
46
|
+
data = json.dumps(body).encode("utf-8")
|
|
47
|
+
|
|
48
|
+
req = urllib.request.Request(url, data=data, headers=headers, method=method)
|
|
49
|
+
try:
|
|
50
|
+
with urllib.request.urlopen(req, timeout=30) as resp:
|
|
51
|
+
return json.loads(resp.read())
|
|
52
|
+
except urllib.error.HTTPError as e:
|
|
53
|
+
body_text = ""
|
|
54
|
+
try:
|
|
55
|
+
body_text = e.read().decode("utf-8", errors="replace")[:500]
|
|
56
|
+
except Exception:
|
|
57
|
+
pass
|
|
58
|
+
print(f"LangSmith API error {e.code}: {body_text}", file=sys.stderr)
|
|
59
|
+
return None
|
|
60
|
+
except Exception as e:
|
|
61
|
+
print(f"LangSmith API request failed: {e}", file=sys.stderr)
|
|
62
|
+
return None
|
|
63
|
+
|
|
64
|
+
|
|
65
|
+
def fetch_runs(project_name, api_key, limit=100):
|
|
66
|
+
"""Fetch recent root runs from a LangSmith project."""
|
|
67
|
+
# Try POST /runs/query first (newer API)
|
|
68
|
+
body = {
|
|
69
|
+
"project_name": project_name,
|
|
70
|
+
"is_root": True,
|
|
71
|
+
"limit": limit,
|
|
72
|
+
}
|
|
73
|
+
result = langsmith_request("runs/query", api_key, method="POST", body=body)
|
|
74
|
+
if result and isinstance(result, dict):
|
|
75
|
+
return result.get("runs", result.get("results", []))
|
|
76
|
+
if result and isinstance(result, list):
|
|
77
|
+
return result
|
|
78
|
+
|
|
79
|
+
# Fallback: GET /runs with query params
|
|
80
|
+
params = {
|
|
81
|
+
"project_name": project_name,
|
|
82
|
+
"is_root": "true",
|
|
83
|
+
"limit": str(limit),
|
|
84
|
+
}
|
|
85
|
+
result = langsmith_request("runs", api_key, params=params)
|
|
86
|
+
if result and isinstance(result, list):
|
|
87
|
+
return result
|
|
88
|
+
if result and isinstance(result, dict):
|
|
89
|
+
return result.get("runs", result.get("results", []))
|
|
90
|
+
|
|
91
|
+
return []
|
|
92
|
+
|
|
93
|
+
|
|
94
|
+
def extract_input(run):
|
|
95
|
+
"""Extract user input from a run's inputs field."""
|
|
96
|
+
inputs = run.get("inputs", {})
|
|
97
|
+
if not inputs:
|
|
98
|
+
return None
|
|
99
|
+
if isinstance(inputs, str):
|
|
100
|
+
return inputs
|
|
101
|
+
|
|
102
|
+
# Direct field
|
|
103
|
+
for key in ("input", "question", "query", "prompt", "text", "user_input"):
|
|
104
|
+
if key in inputs and isinstance(inputs[key], str):
|
|
105
|
+
return inputs[key]
|
|
106
|
+
|
|
107
|
+
# LangChain messages format
|
|
108
|
+
messages = inputs.get("messages") or inputs.get("input")
|
|
109
|
+
if isinstance(messages, list):
|
|
110
|
+
if messages and isinstance(messages[0], list):
|
|
111
|
+
messages = messages[0]
|
|
112
|
+
for msg in messages:
|
|
113
|
+
if isinstance(msg, dict):
|
|
114
|
+
if msg.get("type") in ("human", "HumanMessage") or msg.get("role") == "user":
|
|
115
|
+
content = msg.get("content", "")
|
|
116
|
+
if isinstance(content, str) and content:
|
|
117
|
+
return content
|
|
118
|
+
if isinstance(content, list):
|
|
119
|
+
for part in content:
|
|
120
|
+
if isinstance(part, dict) and part.get("type") == "text":
|
|
121
|
+
return part.get("text", "")
|
|
122
|
+
elif isinstance(msg, str) and msg:
|
|
123
|
+
return msg
|
|
124
|
+
|
|
125
|
+
return None
|
|
126
|
+
|
|
127
|
+
|
|
128
|
+
def extract_output(run):
|
|
129
|
+
"""Extract the output/response from a run."""
|
|
130
|
+
outputs = run.get("outputs", {})
|
|
131
|
+
if not outputs:
|
|
132
|
+
return None
|
|
133
|
+
if isinstance(outputs, str):
|
|
134
|
+
return outputs
|
|
135
|
+
|
|
136
|
+
for key in ("output", "answer", "result", "response", "text"):
|
|
137
|
+
if key in outputs and isinstance(outputs[key], str):
|
|
138
|
+
return outputs[key]
|
|
139
|
+
|
|
140
|
+
# LangChain messages format
|
|
141
|
+
messages = outputs.get("messages") or outputs.get("output")
|
|
142
|
+
if isinstance(messages, list):
|
|
143
|
+
if messages and isinstance(messages[0], list):
|
|
144
|
+
messages = messages[0]
|
|
145
|
+
for msg in reversed(messages):
|
|
146
|
+
if isinstance(msg, dict):
|
|
147
|
+
if msg.get("type") in ("ai", "AIMessage", "assistant") or msg.get("role") == "assistant":
|
|
148
|
+
content = msg.get("content", "")
|
|
149
|
+
if isinstance(content, str) and content:
|
|
150
|
+
return content
|
|
151
|
+
elif isinstance(msg, str) and msg:
|
|
152
|
+
return msg
|
|
153
|
+
|
|
154
|
+
return None
|
|
155
|
+
|
|
156
|
+
|
|
157
|
+
def get_feedback(run):
|
|
158
|
+
"""Extract feedback from a run."""
|
|
159
|
+
fb = run.get("feedback_stats") or {}
|
|
160
|
+
if isinstance(fb, dict):
|
|
161
|
+
pos = fb.get("thumbs_up", 0) or fb.get("positive", 0) or 0
|
|
162
|
+
neg = fb.get("thumbs_down", 0) or fb.get("negative", 0) or 0
|
|
163
|
+
if neg > 0:
|
|
164
|
+
return "negative"
|
|
165
|
+
if pos > 0:
|
|
166
|
+
return "positive"
|
|
167
|
+
return None
|
|
168
|
+
|
|
169
|
+
|
|
170
|
+
def categorize_run(run):
|
|
171
|
+
"""Categorize a run by its name/type."""
|
|
172
|
+
name = run.get("name", "unknown")
|
|
173
|
+
# Use top-level run name as category
|
|
174
|
+
return name
|
|
175
|
+
|
|
176
|
+
|
|
177
|
+
def analyze_runs(runs):
|
|
178
|
+
"""Analyze a batch of runs and produce structured insights."""
|
|
179
|
+
if not runs:
|
|
180
|
+
return None
|
|
181
|
+
|
|
182
|
+
processed = []
|
|
183
|
+
categories = Counter()
|
|
184
|
+
errors = []
|
|
185
|
+
latencies = []
|
|
186
|
+
token_counts = []
|
|
187
|
+
feedbacks = {"positive": 0, "negative": 0, "none": 0}
|
|
188
|
+
|
|
189
|
+
for run in runs:
|
|
190
|
+
user_input = extract_input(run)
|
|
191
|
+
output = extract_output(run)
|
|
192
|
+
error = run.get("error")
|
|
193
|
+
tokens = run.get("total_tokens") or 0
|
|
194
|
+
latency_ms = None
|
|
195
|
+
feedback = get_feedback(run)
|
|
196
|
+
|
|
197
|
+
# Calculate latency from start/end times
|
|
198
|
+
start = run.get("start_time") or run.get("start_dt")
|
|
199
|
+
end = run.get("end_time") or run.get("end_dt")
|
|
200
|
+
if isinstance(start, str) and isinstance(end, str):
|
|
201
|
+
try:
|
|
202
|
+
from datetime import datetime as dt
|
|
203
|
+
s = dt.fromisoformat(start.replace("Z", "+00:00"))
|
|
204
|
+
e = dt.fromisoformat(end.replace("Z", "+00:00"))
|
|
205
|
+
latency_ms = int((e - s).total_seconds() * 1000)
|
|
206
|
+
except Exception:
|
|
207
|
+
pass
|
|
208
|
+
elif run.get("latency"):
|
|
209
|
+
latency_ms = int(run["latency"] * 1000) if isinstance(run["latency"], float) else run["latency"]
|
|
210
|
+
|
|
211
|
+
category = categorize_run(run)
|
|
212
|
+
categories[category] += 1
|
|
213
|
+
|
|
214
|
+
entry = {
|
|
215
|
+
"input": (user_input or "")[:500],
|
|
216
|
+
"output": (output or "")[:300],
|
|
217
|
+
"category": category,
|
|
218
|
+
"tokens": tokens,
|
|
219
|
+
"latency_ms": latency_ms,
|
|
220
|
+
"error": (error or "")[:200] if error else None,
|
|
221
|
+
"feedback": feedback,
|
|
222
|
+
}
|
|
223
|
+
processed.append(entry)
|
|
224
|
+
|
|
225
|
+
if error:
|
|
226
|
+
errors.append({"error": error[:200], "input": (user_input or "")[:200], "category": category})
|
|
227
|
+
if latency_ms:
|
|
228
|
+
latencies.append(latency_ms)
|
|
229
|
+
if tokens:
|
|
230
|
+
token_counts.append(tokens)
|
|
231
|
+
|
|
232
|
+
if feedback == "positive":
|
|
233
|
+
feedbacks["positive"] += 1
|
|
234
|
+
elif feedback == "negative":
|
|
235
|
+
feedbacks["negative"] += 1
|
|
236
|
+
else:
|
|
237
|
+
feedbacks["none"] += 1
|
|
238
|
+
|
|
239
|
+
# Compute statistics
|
|
240
|
+
stats = {
|
|
241
|
+
"total_traces": len(runs),
|
|
242
|
+
"with_input": sum(1 for p in processed if p["input"]),
|
|
243
|
+
"with_error": len(errors),
|
|
244
|
+
"error_rate": len(errors) / max(len(runs), 1),
|
|
245
|
+
"feedback": feedbacks,
|
|
246
|
+
}
|
|
247
|
+
|
|
248
|
+
if latencies:
|
|
249
|
+
latencies.sort()
|
|
250
|
+
stats["latency"] = {
|
|
251
|
+
"avg_ms": int(sum(latencies) / len(latencies)),
|
|
252
|
+
"p50_ms": latencies[len(latencies) // 2],
|
|
253
|
+
"p95_ms": latencies[int(len(latencies) * 0.95)] if len(latencies) >= 20 else latencies[-1],
|
|
254
|
+
"max_ms": latencies[-1],
|
|
255
|
+
}
|
|
256
|
+
|
|
257
|
+
if token_counts:
|
|
258
|
+
stats["tokens"] = {
|
|
259
|
+
"avg": int(sum(token_counts) / len(token_counts)),
|
|
260
|
+
"max": max(token_counts),
|
|
261
|
+
"total": sum(token_counts),
|
|
262
|
+
}
|
|
263
|
+
|
|
264
|
+
# Group by category
|
|
265
|
+
by_category = {}
|
|
266
|
+
for entry in processed:
|
|
267
|
+
cat = entry["category"]
|
|
268
|
+
by_category.setdefault(cat, []).append(entry)
|
|
269
|
+
|
|
270
|
+
# Error patterns
|
|
271
|
+
error_patterns = Counter()
|
|
272
|
+
for e in errors:
|
|
273
|
+
# Normalize error to first 60 chars
|
|
274
|
+
pattern = e["error"][:60]
|
|
275
|
+
error_patterns[pattern] += 1
|
|
276
|
+
|
|
277
|
+
return {
|
|
278
|
+
"stats": stats,
|
|
279
|
+
"categories": dict(categories.most_common()),
|
|
280
|
+
"by_category": by_category,
|
|
281
|
+
"error_patterns": dict(error_patterns.most_common(10)),
|
|
282
|
+
"errors": errors[:20],
|
|
283
|
+
"processed": processed,
|
|
284
|
+
}
|
|
285
|
+
|
|
286
|
+
|
|
287
|
+
def generate_markdown_seed(analysis, project_name):
|
|
288
|
+
"""Generate a markdown seed file for the testgen agent."""
|
|
289
|
+
stats = analysis["stats"]
|
|
290
|
+
lines = [
|
|
291
|
+
f"# Production Trace Analysis: {project_name}",
|
|
292
|
+
"",
|
|
293
|
+
f"*{stats['total_traces']} traces analyzed*",
|
|
294
|
+
"",
|
|
295
|
+
"## Key Metrics",
|
|
296
|
+
"",
|
|
297
|
+
f"- **Error rate**: {stats['error_rate']:.1%}",
|
|
298
|
+
]
|
|
299
|
+
|
|
300
|
+
if "latency" in stats:
|
|
301
|
+
lat = stats["latency"]
|
|
302
|
+
lines.append(f"- **Latency**: {lat['avg_ms']}ms avg, {lat['p50_ms']}ms p50, {lat['p95_ms']}ms p95")
|
|
303
|
+
|
|
304
|
+
if "tokens" in stats:
|
|
305
|
+
tok = stats["tokens"]
|
|
306
|
+
lines.append(f"- **Tokens**: {tok['avg']} avg, {tok['max']} max")
|
|
307
|
+
|
|
308
|
+
fb = stats["feedback"]
|
|
309
|
+
total_fb = fb["positive"] + fb["negative"]
|
|
310
|
+
if total_fb > 0:
|
|
311
|
+
lines.append(f"- **User feedback**: {fb['positive']}/{total_fb} positive ({fb['positive']/total_fb:.0%})")
|
|
312
|
+
|
|
313
|
+
# Traffic distribution
|
|
314
|
+
lines.extend(["", "## Traffic Distribution", ""])
|
|
315
|
+
total = stats["total_traces"]
|
|
316
|
+
for cat, count in sorted(analysis["categories"].items(), key=lambda x: -x[1]):
|
|
317
|
+
pct = count / max(total, 1) * 100
|
|
318
|
+
lines.append(f"- **{cat}**: {count} traces ({pct:.0f}%)")
|
|
319
|
+
|
|
320
|
+
# Sample inputs by category
|
|
321
|
+
lines.extend(["", "## Sample Inputs by Category", ""])
|
|
322
|
+
for cat, entries in sorted(analysis["by_category"].items(), key=lambda x: -len(x[1])):
|
|
323
|
+
lines.append(f"### {cat} ({len(entries)} traces)")
|
|
324
|
+
lines.append("")
|
|
325
|
+
# Show up to 8 sample inputs per category
|
|
326
|
+
shown = 0
|
|
327
|
+
for entry in entries:
|
|
328
|
+
if not entry["input"] or shown >= 8:
|
|
329
|
+
break
|
|
330
|
+
status = "ERROR" if entry["error"] else "ok"
|
|
331
|
+
tok_str = f", {entry['tokens']}tok" if entry["tokens"] else ""
|
|
332
|
+
lat_str = f", {entry['latency_ms']}ms" if entry["latency_ms"] else ""
|
|
333
|
+
fb_str = ""
|
|
334
|
+
if entry["feedback"] == "negative":
|
|
335
|
+
fb_str = " [NEGATIVE FEEDBACK]"
|
|
336
|
+
elif entry["feedback"] == "positive":
|
|
337
|
+
fb_str = " [+]"
|
|
338
|
+
lines.append(f'- "{entry["input"][:150]}" ({status}{tok_str}{lat_str}){fb_str}')
|
|
339
|
+
shown += 1
|
|
340
|
+
lines.append("")
|
|
341
|
+
|
|
342
|
+
# Error patterns
|
|
343
|
+
if analysis["error_patterns"]:
|
|
344
|
+
lines.extend(["## Error Patterns", ""])
|
|
345
|
+
for pattern, count in analysis["error_patterns"].items():
|
|
346
|
+
lines.append(f"- **{pattern}**: {count} occurrences")
|
|
347
|
+
lines.append("")
|
|
348
|
+
|
|
349
|
+
# Negative feedback traces
|
|
350
|
+
neg_traces = [e for e in analysis["processed"] if e["feedback"] == "negative" and e["input"]]
|
|
351
|
+
if neg_traces:
|
|
352
|
+
lines.extend(["## Traces with Negative Feedback (high priority)", ""])
|
|
353
|
+
for entry in neg_traces[:10]:
|
|
354
|
+
lines.append(f'- "{entry["input"][:200]}" → category: {entry["category"]}')
|
|
355
|
+
lines.append("")
|
|
356
|
+
|
|
357
|
+
# Guidance for testgen
|
|
358
|
+
lines.extend([
|
|
359
|
+
"## Guidance for Test Generation",
|
|
360
|
+
"",
|
|
361
|
+
"Use the above data to generate test cases that:",
|
|
362
|
+
"1. **Match the real traffic distribution** — generate more tasks for high-traffic categories",
|
|
363
|
+
"2. **Include actual user phrasing** — real inputs show how users actually communicate (informal, abbreviations, typos)",
|
|
364
|
+
"3. **Cover real error patterns** — the errors above are genuine failure modes, not imagined scenarios",
|
|
365
|
+
"4. **Prioritize negative feedback traces** — these are confirmed bad experiences",
|
|
366
|
+
"5. **Include slow queries as edge cases** — high-latency traces may reveal timeout or complexity issues",
|
|
367
|
+
])
|
|
368
|
+
|
|
369
|
+
return "\n".join(lines)
|
|
370
|
+
|
|
371
|
+
|
|
372
|
+
def generate_json_summary(analysis, project_name):
|
|
373
|
+
"""Generate a JSON summary for programmatic use."""
|
|
374
|
+
return {
|
|
375
|
+
"project": project_name,
|
|
376
|
+
"generated_at": datetime.now(timezone.utc).isoformat(),
|
|
377
|
+
"stats": analysis["stats"],
|
|
378
|
+
"categories": analysis["categories"],
|
|
379
|
+
"error_patterns": analysis["error_patterns"],
|
|
380
|
+
"sample_inputs": {
|
|
381
|
+
cat: [e["input"] for e in entries if e["input"]][:10]
|
|
382
|
+
for cat, entries in analysis["by_category"].items()
|
|
383
|
+
},
|
|
384
|
+
"negative_feedback_inputs": [
|
|
385
|
+
e["input"] for e in analysis["processed"]
|
|
386
|
+
if e["feedback"] == "negative" and e["input"]
|
|
387
|
+
][:20],
|
|
388
|
+
"slow_queries": [
|
|
389
|
+
{"input": e["input"][:200], "latency_ms": e["latency_ms"], "category": e["category"]}
|
|
390
|
+
for e in sorted(analysis["processed"], key=lambda x: -(x["latency_ms"] or 0))
|
|
391
|
+
if e["latency_ms"] and e["input"]
|
|
392
|
+
][:10],
|
|
393
|
+
}
|
|
394
|
+
|
|
395
|
+
|
|
396
|
+
def main():
|
|
397
|
+
parser = argparse.ArgumentParser(description="Fetch and summarize production LangSmith traces")
|
|
398
|
+
parser.add_argument("--project", required=True, help="LangSmith project name")
|
|
399
|
+
parser.add_argument("--api-key-env", default="LANGSMITH_API_KEY",
|
|
400
|
+
help="Env var containing API key (default: LANGSMITH_API_KEY)")
|
|
401
|
+
parser.add_argument("--limit", type=int, default=100, help="Max traces to fetch (default: 100)")
|
|
402
|
+
parser.add_argument("--output-md", required=True, help="Output path for markdown seed")
|
|
403
|
+
parser.add_argument("--output-json", required=True, help="Output path for JSON summary")
|
|
404
|
+
args = parser.parse_args()
|
|
405
|
+
|
|
406
|
+
api_key = os.environ.get(args.api_key_env, "")
|
|
407
|
+
if not api_key:
|
|
408
|
+
print(f"No API key found in ${args.api_key_env} — cannot fetch production traces", file=sys.stderr)
|
|
409
|
+
sys.exit(1)
|
|
410
|
+
|
|
411
|
+
print(f"Fetching up to {args.limit} traces from LangSmith project '{args.project}'...")
|
|
412
|
+
runs = fetch_runs(args.project, api_key, args.limit)
|
|
413
|
+
|
|
414
|
+
if not runs:
|
|
415
|
+
print("No traces found. The project may be empty or the name may be wrong.")
|
|
416
|
+
# Write empty files so downstream doesn't break
|
|
417
|
+
for path in [args.output_md, args.output_json]:
|
|
418
|
+
os.makedirs(os.path.dirname(path) or ".", exist_ok=True)
|
|
419
|
+
with open(args.output_md, "w") as f:
|
|
420
|
+
f.write(f"# Production Trace Analysis: {args.project}\n\nNo traces found.\n")
|
|
421
|
+
with open(args.output_json, "w") as f:
|
|
422
|
+
json.dump({"project": args.project, "stats": {"total_traces": 0}}, f, indent=2)
|
|
423
|
+
return
|
|
424
|
+
|
|
425
|
+
print(f"Fetched {len(runs)} traces. Analyzing...")
|
|
426
|
+
analysis = analyze_runs(runs)
|
|
427
|
+
|
|
428
|
+
if not analysis:
|
|
429
|
+
print("Analysis failed — no processable traces")
|
|
430
|
+
return
|
|
431
|
+
|
|
432
|
+
# Write markdown seed
|
|
433
|
+
os.makedirs(os.path.dirname(args.output_md) or ".", exist_ok=True)
|
|
434
|
+
md = generate_markdown_seed(analysis, args.project)
|
|
435
|
+
with open(args.output_md, "w") as f:
|
|
436
|
+
f.write(md)
|
|
437
|
+
|
|
438
|
+
# Write JSON summary
|
|
439
|
+
os.makedirs(os.path.dirname(args.output_json) or ".", exist_ok=True)
|
|
440
|
+
summary = generate_json_summary(analysis, args.project)
|
|
441
|
+
with open(args.output_json, "w") as f:
|
|
442
|
+
json.dump(summary, f, indent=2, ensure_ascii=False)
|
|
443
|
+
|
|
444
|
+
stats = analysis["stats"]
|
|
445
|
+
cats = len(analysis["categories"])
|
|
446
|
+
errs = stats["with_error"]
|
|
447
|
+
print(f"Production seed generated:")
|
|
448
|
+
print(f" {stats['total_traces']} traces, {cats} categories, {errs} errors ({stats['error_rate']:.1%})")
|
|
449
|
+
print(f" {args.output_md}")
|
|
450
|
+
print(f" {args.output_json}")
|
|
451
|
+
|
|
452
|
+
|
|
453
|
+
if __name__ == "__main__":
|
|
454
|
+
main()
|