harness-evolver 2.7.0 → 2.8.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/agents/harness-evolver-proposer.md +18 -0
- package/agents/harness-evolver-testgen.md +15 -0
- package/package.json +1 -1
- package/skills/evolve/SKILL.md +26 -0
- package/skills/init/SKILL.md +36 -1
- package/tools/__pycache__/init.cpython-313.pyc +0 -0
- package/tools/__pycache__/seed_from_traces.cpython-313.pyc +0 -0
- package/tools/init.py +63 -0
- package/tools/seed_from_traces.py +454 -0
|
@@ -40,6 +40,24 @@ These insights are generated from LangSmith traces cross-referenced with per-tas
|
|
|
40
40
|
|
|
41
41
|
If trace insights are not available, proceed with manual trace analysis as described in Phase 2.
|
|
42
42
|
|
|
43
|
+
## Production Insights
|
|
44
|
+
|
|
45
|
+
If `.harness-evolver/production_seed.json` exists in your `<files_to_read>`, it contains **real production data** from the app's LangSmith project:
|
|
46
|
+
|
|
47
|
+
- `categories` — real traffic distribution (which domains/routes get the most queries)
|
|
48
|
+
- `error_patterns` — actual production errors and their frequency
|
|
49
|
+
- `negative_feedback_inputs` — queries where users gave thumbs-down
|
|
50
|
+
- `slow_queries` — high-latency queries that may indicate bottlenecks
|
|
51
|
+
- `sample_inputs` — real user inputs grouped by category
|
|
52
|
+
|
|
53
|
+
Use this data to:
|
|
54
|
+
1. **Prioritize changes that fix real production failures** over synthetic test failures
|
|
55
|
+
2. **Match the real traffic distribution** — if 60% of production queries are domain A, optimize for domain A
|
|
56
|
+
3. **Focus on negative feedback patterns** — these are confirmed bad user experiences
|
|
57
|
+
4. **Address latency outliers** — slow queries may need different routing, caching, or model selection
|
|
58
|
+
|
|
59
|
+
Production data complements trace_insights.json. Trace insights show what happened in *harness evaluation runs*. Production insights show what happens in *real-world usage*.
|
|
60
|
+
|
|
43
61
|
## Context7 — Enrich Your Knowledge
|
|
44
62
|
|
|
45
63
|
You have access to Context7 MCP tools (`resolve-library-id` and `get-library-docs`) for looking up **current, version-specific documentation** of any library.
|
|
@@ -36,6 +36,19 @@ Read the harness source code to understand:
|
|
|
36
36
|
- What are its likely failure modes?
|
|
37
37
|
- Are there any data files (knowledge bases, docs, etc.) that define the domain?
|
|
38
38
|
|
|
39
|
+
### Phase 1.5: Use Production Traces (if available)
|
|
40
|
+
|
|
41
|
+
If your prompt contains a `<production_traces>` block, this is **real data from production LangSmith traces**. This is the most valuable signal you have — real user inputs beat synthetic ones.
|
|
42
|
+
|
|
43
|
+
When production traces are available:
|
|
44
|
+
1. Read the traffic distribution — generate tasks proportional to real usage (if 60% of queries are domain A, 60% of tasks should cover domain A)
|
|
45
|
+
2. Use actual user phrasing as inspiration — real inputs show abbreviations, typos, informal language
|
|
46
|
+
3. Base edge cases on real error patterns — the errors listed are genuine failures, not imagined scenarios
|
|
47
|
+
4. Prioritize negative feedback traces — these are confirmed bad experiences that MUST be covered
|
|
48
|
+
5. Include slow queries as edge cases — high-latency traces may reveal timeout or complexity issues
|
|
49
|
+
|
|
50
|
+
**Do NOT just copy production inputs verbatim.** Use them as inspiration to generate VARIATIONS that test the same capabilities.
|
|
51
|
+
|
|
39
52
|
### Phase 2: Design Test Distribution
|
|
40
53
|
|
|
41
54
|
Plan 30 test cases with this distribution:
|
|
@@ -44,6 +57,8 @@ Plan 30 test cases with this distribution:
|
|
|
44
57
|
- **20% Cross-Domain** (6 tasks): inputs spanning multiple categories or requiring nuanced judgment
|
|
45
58
|
- **20% Adversarial** (6 tasks): misleading, ambiguous, or designed to expose weaknesses
|
|
46
59
|
|
|
60
|
+
If production traces are available, adjust the distribution to match real traffic patterns instead of uniform.
|
|
61
|
+
|
|
47
62
|
Ensure all categories/topics from the harness are covered.
|
|
48
63
|
|
|
49
64
|
### Phase 3: Generate Tasks
|
package/package.json
CHANGED
package/skills/evolve/SKILL.md
CHANGED
|
@@ -34,6 +34,27 @@ For each iteration:
|
|
|
34
34
|
python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(f'v{s[\"iterations\"]+1:03d}')"
|
|
35
35
|
```
|
|
36
36
|
|
|
37
|
+
### 1.4. Gather Production Insights (first iteration only)
|
|
38
|
+
|
|
39
|
+
On the **first iteration**, if the project has a production LangSmith project configured but no production seed yet, fetch it:
|
|
40
|
+
|
|
41
|
+
```bash
|
|
42
|
+
PROD_PROJECT=$(python3 -c "
|
|
43
|
+
import json, os
|
|
44
|
+
c = json.load(open('.harness-evolver/config.json'))
|
|
45
|
+
print(c.get('eval', {}).get('production_project', ''))
|
|
46
|
+
" 2>/dev/null)
|
|
47
|
+
if [ -n "$PROD_PROJECT" ] && [ ! -f ".harness-evolver/production_seed.json" ] && [ -n "$LANGSMITH_API_KEY" ]; then
|
|
48
|
+
python3 $TOOLS/seed_from_traces.py \
|
|
49
|
+
--project "$PROD_PROJECT" \
|
|
50
|
+
--output-md .harness-evolver/production_seed.md \
|
|
51
|
+
--output-json .harness-evolver/production_seed.json \
|
|
52
|
+
--limit 100 2>/dev/null
|
|
53
|
+
fi
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
The `production_seed.json` is included in all proposers' `<files_to_read>` so they have real-world context about how the agent is actually used in production.
|
|
57
|
+
|
|
37
58
|
### 1.5. Gather LangSmith Traces (MANDATORY after every evaluation)
|
|
38
59
|
|
|
39
60
|
**Run these commands unconditionally after EVERY evaluation** (including baseline). Do NOT guess project names — discover them.
|
|
@@ -255,6 +276,7 @@ Agent(
|
|
|
255
276
|
- .harness-evolver/langsmith_stats.json (if exists)
|
|
256
277
|
- .harness-evolver/langsmith_runs.json (if exists)
|
|
257
278
|
- .harness-evolver/trace_insights.json (if exists)
|
|
279
|
+
- .harness-evolver/production_seed.json (if exists)
|
|
258
280
|
- .harness-evolver/architecture.json (if exists)
|
|
259
281
|
</files_to_read>
|
|
260
282
|
|
|
@@ -295,6 +317,7 @@ Agent(
|
|
|
295
317
|
- .harness-evolver/langsmith_diagnosis.json (if exists)
|
|
296
318
|
- .harness-evolver/langsmith_runs.json (if exists)
|
|
297
319
|
- .harness-evolver/trace_insights.json (if exists)
|
|
320
|
+
- .harness-evolver/production_seed.json (if exists)
|
|
298
321
|
- .harness-evolver/architecture.json (if exists)
|
|
299
322
|
</files_to_read>
|
|
300
323
|
|
|
@@ -334,6 +357,7 @@ Agent(
|
|
|
334
357
|
- .harness-evolver/langsmith_diagnosis.json (if exists)
|
|
335
358
|
- .harness-evolver/langsmith_runs.json (if exists)
|
|
336
359
|
- .harness-evolver/trace_insights.json (if exists)
|
|
360
|
+
- .harness-evolver/production_seed.json (if exists)
|
|
337
361
|
- .harness-evolver/architecture.json (if exists)
|
|
338
362
|
</files_to_read>
|
|
339
363
|
|
|
@@ -377,6 +401,7 @@ Agent(
|
|
|
377
401
|
- .harness-evolver/harnesses/{best_version}/scores.json
|
|
378
402
|
- .harness-evolver/langsmith_runs.json (if exists)
|
|
379
403
|
- .harness-evolver/trace_insights.json (if exists)
|
|
404
|
+
- .harness-evolver/production_seed.json (if exists)
|
|
380
405
|
- .harness-evolver/architecture.json (if exists)
|
|
381
406
|
</files_to_read>
|
|
382
407
|
|
|
@@ -438,6 +463,7 @@ Agent(
|
|
|
438
463
|
- .harness-evolver/harnesses/{best_version}/scores.json
|
|
439
464
|
- .harness-evolver/langsmith_runs.json (if exists)
|
|
440
465
|
- .harness-evolver/trace_insights.json (if exists)
|
|
466
|
+
- .harness-evolver/production_seed.json (if exists)
|
|
441
467
|
- .harness-evolver/architecture.json (if exists)
|
|
442
468
|
</files_to_read>
|
|
443
469
|
|
package/skills/init/SKILL.md
CHANGED
|
@@ -80,11 +80,21 @@ Agent(
|
|
|
80
80
|
- /home/rp/Desktop/test-crewai/README.md
|
|
81
81
|
</files_to_read>
|
|
82
82
|
|
|
83
|
+
<production_traces>
|
|
84
|
+
{IF .harness-evolver/production_seed.md EXISTS, paste its full contents here.
|
|
85
|
+
This file contains real production inputs, traffic distribution, error patterns,
|
|
86
|
+
and user feedback from LangSmith. Use it to generate REALISTIC test cases that
|
|
87
|
+
match actual usage patterns instead of synthetic ones.
|
|
88
|
+
|
|
89
|
+
If the file does not exist, omit this entire block.}
|
|
90
|
+
</production_traces>
|
|
91
|
+
|
|
83
92
|
<output>
|
|
84
93
|
Create directory tasks/ (at project root) with 30 files: task_001.json through task_030.json.
|
|
85
94
|
Format: {"id": "task_001", "input": "...", "metadata": {"difficulty": "easy|medium|hard", "type": "standard|edge|cross_domain|adversarial"}}
|
|
86
95
|
No "expected" field needed — the judge subagent will score outputs.
|
|
87
96
|
Distribution: 40% standard, 20% edge, 20% cross-domain, 20% adversarial.
|
|
97
|
+
If production traces are available, match the real traffic distribution instead of uniform.
|
|
88
98
|
</output>
|
|
89
99
|
)
|
|
90
100
|
```
|
|
@@ -93,12 +103,37 @@ Wait for `## TESTGEN COMPLETE`. If the subagent fails or returns with no tasks,
|
|
|
93
103
|
|
|
94
104
|
Print: "Generated {N} test cases from code analysis."
|
|
95
105
|
|
|
106
|
+
If `.harness-evolver/production_seed.md` exists, also print:
|
|
107
|
+
"Tasks enriched with production trace data from LangSmith."
|
|
108
|
+
|
|
96
109
|
## Phase 3: Run Init
|
|
97
110
|
|
|
111
|
+
First, check if the project has a LangSmith production project configured:
|
|
112
|
+
|
|
113
|
+
```bash
|
|
114
|
+
# Auto-detect from env vars or .env
|
|
115
|
+
PROD_PROJECT=$(python3 -c "
|
|
116
|
+
import os
|
|
117
|
+
for v in ('LANGCHAIN_PROJECT', 'LANGSMITH_PROJECT'):
|
|
118
|
+
p = os.environ.get(v, '')
|
|
119
|
+
if p: print(p); exit()
|
|
120
|
+
for f in ('.env', '.env.local'):
|
|
121
|
+
if os.path.exists(f):
|
|
122
|
+
for line in open(f):
|
|
123
|
+
line = line.strip()
|
|
124
|
+
if '=' in line and not line.startswith('#'):
|
|
125
|
+
k, _, val = line.partition('=')
|
|
126
|
+
if k.strip() in ('LANGCHAIN_PROJECT', 'LANGSMITH_PROJECT'):
|
|
127
|
+
print(val.strip().strip('\"').strip(\"'\"))
|
|
128
|
+
exit()
|
|
129
|
+
" 2>/dev/null)
|
|
130
|
+
```
|
|
131
|
+
|
|
98
132
|
```bash
|
|
99
133
|
python3 $TOOLS/init.py [directory] \
|
|
100
134
|
--harness harness.py --eval eval.py --tasks tasks/ \
|
|
101
|
-
--tools-dir $TOOLS
|
|
135
|
+
--tools-dir $TOOLS \
|
|
136
|
+
${PROD_PROJECT:+--langsmith-project "$PROD_PROJECT"}
|
|
102
137
|
```
|
|
103
138
|
|
|
104
139
|
Add `--harness-config config.json` if a config exists.
|
|
Binary file
|
|
Binary file
|
package/tools/init.py
CHANGED
|
@@ -124,6 +124,40 @@ def _detect_langsmith():
|
|
|
124
124
|
return {"enabled": False}
|
|
125
125
|
|
|
126
126
|
|
|
127
|
+
def _detect_langsmith_project(search_dir="."):
|
|
128
|
+
"""Auto-detect the app's existing LangSmith project name.
|
|
129
|
+
|
|
130
|
+
Checks (in order):
|
|
131
|
+
1. LANGCHAIN_PROJECT env var (standard LangChain convention)
|
|
132
|
+
2. LANGSMITH_PROJECT env var (alternative)
|
|
133
|
+
3. .env file in the project directory
|
|
134
|
+
"""
|
|
135
|
+
for var in ("LANGCHAIN_PROJECT", "LANGSMITH_PROJECT"):
|
|
136
|
+
project = os.environ.get(var)
|
|
137
|
+
if project:
|
|
138
|
+
return project
|
|
139
|
+
|
|
140
|
+
# Parse .env file
|
|
141
|
+
for env_name in (".env", ".env.local"):
|
|
142
|
+
env_path = os.path.join(search_dir, env_name)
|
|
143
|
+
if os.path.exists(env_path):
|
|
144
|
+
try:
|
|
145
|
+
with open(env_path) as f:
|
|
146
|
+
for line in f:
|
|
147
|
+
line = line.strip()
|
|
148
|
+
if line.startswith("#") or "=" not in line:
|
|
149
|
+
continue
|
|
150
|
+
key, _, val = line.partition("=")
|
|
151
|
+
key = key.strip()
|
|
152
|
+
val = val.strip().strip("'\"")
|
|
153
|
+
if key in ("LANGCHAIN_PROJECT", "LANGSMITH_PROJECT") and val:
|
|
154
|
+
return val
|
|
155
|
+
except OSError:
|
|
156
|
+
pass
|
|
157
|
+
|
|
158
|
+
return None
|
|
159
|
+
|
|
160
|
+
|
|
127
161
|
def _check_langsmith_cli():
|
|
128
162
|
"""Check if langsmith-cli is installed."""
|
|
129
163
|
try:
|
|
@@ -202,6 +236,9 @@ def main():
|
|
|
202
236
|
parser.add_argument("--skip-validation", action="store_true",
|
|
203
237
|
help="Skip harness validation step. Use when you know the harness "
|
|
204
238
|
"works but validation times out (e.g. real LLM agent calls).")
|
|
239
|
+
parser.add_argument("--langsmith-project", default=None,
|
|
240
|
+
help="Existing LangSmith project name with production traces. "
|
|
241
|
+
"Auto-detected from LANGCHAIN_PROJECT / LANGSMITH_PROJECT env vars or .env file.")
|
|
205
242
|
args = parser.parse_args()
|
|
206
243
|
|
|
207
244
|
# Auto-detect missing args
|
|
@@ -280,6 +317,7 @@ def main():
|
|
|
280
317
|
"args": ["--results-dir", "{results_dir}", "--tasks-dir", "{tasks_dir}",
|
|
281
318
|
"--scores", "{scores}"],
|
|
282
319
|
"langsmith": _detect_langsmith(),
|
|
320
|
+
"production_project": args.langsmith_project or _detect_langsmith_project(search_dir),
|
|
283
321
|
},
|
|
284
322
|
"evolution": {
|
|
285
323
|
"max_iterations": 10,
|
|
@@ -375,6 +413,31 @@ def main():
|
|
|
375
413
|
except Exception:
|
|
376
414
|
pass
|
|
377
415
|
|
|
416
|
+
# 4.5 Fetch production traces seed (if LangSmith production project detected)
|
|
417
|
+
prod_project = config["eval"].get("production_project")
|
|
418
|
+
if prod_project and os.environ.get("LANGSMITH_API_KEY"):
|
|
419
|
+
seed_py = os.path.join(tools, "seed_from_traces.py")
|
|
420
|
+
if os.path.exists(seed_py):
|
|
421
|
+
print(f"Fetching production traces from LangSmith project '{prod_project}'...")
|
|
422
|
+
try:
|
|
423
|
+
r = subprocess.run(
|
|
424
|
+
[_resolve_python(), seed_py,
|
|
425
|
+
"--project", prod_project,
|
|
426
|
+
"--output-md", os.path.join(base, "production_seed.md"),
|
|
427
|
+
"--output-json", os.path.join(base, "production_seed.json"),
|
|
428
|
+
"--limit", "100"],
|
|
429
|
+
capture_output=True, text=True, timeout=60,
|
|
430
|
+
)
|
|
431
|
+
if r.returncode == 0:
|
|
432
|
+
print(r.stdout.strip())
|
|
433
|
+
else:
|
|
434
|
+
print(f" Could not fetch production traces: {r.stderr.strip()[:200]}")
|
|
435
|
+
except Exception as e:
|
|
436
|
+
print(f" Production trace fetch failed: {e}")
|
|
437
|
+
elif prod_project:
|
|
438
|
+
print(f"Production LangSmith project detected: {prod_project}")
|
|
439
|
+
print(" Set LANGSMITH_API_KEY to auto-fetch production traces during init.")
|
|
440
|
+
|
|
378
441
|
# 5. Validate baseline harness
|
|
379
442
|
config_path = os.path.join(base, "baseline", "config.json")
|
|
380
443
|
if args.skip_validation:
|
|
@@ -0,0 +1,454 @@
|
|
|
1
|
+
#!/usr/bin/env python3
|
|
2
|
+
"""Fetch and summarize production LangSmith traces for Harness Evolver.
|
|
3
|
+
|
|
4
|
+
Queries the LangSmith REST API directly (urllib, stdlib-only) to fetch
|
|
5
|
+
production traces and produce:
|
|
6
|
+
1. A markdown seed file for the testgen agent (production_seed.md)
|
|
7
|
+
2. A JSON summary for programmatic use (production_seed.json)
|
|
8
|
+
|
|
9
|
+
Usage:
|
|
10
|
+
python3 seed_from_traces.py \
|
|
11
|
+
--project ceppem-langgraph \
|
|
12
|
+
--output-md .harness-evolver/production_seed.md \
|
|
13
|
+
--output-json .harness-evolver/production_seed.json \
|
|
14
|
+
[--api-key-env LANGSMITH_API_KEY] \
|
|
15
|
+
[--limit 100]
|
|
16
|
+
|
|
17
|
+
Stdlib-only. No external dependencies (no langsmith-cli needed).
|
|
18
|
+
"""
|
|
19
|
+
|
|
20
|
+
import argparse
|
|
21
|
+
import json
|
|
22
|
+
import os
|
|
23
|
+
import sys
|
|
24
|
+
import urllib.parse
|
|
25
|
+
import urllib.request
|
|
26
|
+
from collections import Counter
|
|
27
|
+
from datetime import datetime, timezone
|
|
28
|
+
|
|
29
|
+
LANGSMITH_API_BASE = "https://api.smith.langchain.com/api/v1"
|
|
30
|
+
|
|
31
|
+
|
|
32
|
+
def langsmith_request(endpoint, api_key, method="GET", body=None, params=None):
|
|
33
|
+
"""Make a request to the LangSmith REST API."""
|
|
34
|
+
url = f"{LANGSMITH_API_BASE}/{endpoint}"
|
|
35
|
+
if params:
|
|
36
|
+
url += "?" + urllib.parse.urlencode(params)
|
|
37
|
+
|
|
38
|
+
headers = {
|
|
39
|
+
"x-api-key": api_key,
|
|
40
|
+
"Accept": "application/json",
|
|
41
|
+
}
|
|
42
|
+
|
|
43
|
+
data = None
|
|
44
|
+
if body is not None:
|
|
45
|
+
headers["Content-Type"] = "application/json"
|
|
46
|
+
data = json.dumps(body).encode("utf-8")
|
|
47
|
+
|
|
48
|
+
req = urllib.request.Request(url, data=data, headers=headers, method=method)
|
|
49
|
+
try:
|
|
50
|
+
with urllib.request.urlopen(req, timeout=30) as resp:
|
|
51
|
+
return json.loads(resp.read())
|
|
52
|
+
except urllib.error.HTTPError as e:
|
|
53
|
+
body_text = ""
|
|
54
|
+
try:
|
|
55
|
+
body_text = e.read().decode("utf-8", errors="replace")[:500]
|
|
56
|
+
except Exception:
|
|
57
|
+
pass
|
|
58
|
+
print(f"LangSmith API error {e.code}: {body_text}", file=sys.stderr)
|
|
59
|
+
return None
|
|
60
|
+
except Exception as e:
|
|
61
|
+
print(f"LangSmith API request failed: {e}", file=sys.stderr)
|
|
62
|
+
return None
|
|
63
|
+
|
|
64
|
+
|
|
65
|
+
def fetch_runs(project_name, api_key, limit=100):
|
|
66
|
+
"""Fetch recent root runs from a LangSmith project."""
|
|
67
|
+
# Try POST /runs/query first (newer API)
|
|
68
|
+
body = {
|
|
69
|
+
"project_name": project_name,
|
|
70
|
+
"is_root": True,
|
|
71
|
+
"limit": limit,
|
|
72
|
+
}
|
|
73
|
+
result = langsmith_request("runs/query", api_key, method="POST", body=body)
|
|
74
|
+
if result and isinstance(result, dict):
|
|
75
|
+
return result.get("runs", result.get("results", []))
|
|
76
|
+
if result and isinstance(result, list):
|
|
77
|
+
return result
|
|
78
|
+
|
|
79
|
+
# Fallback: GET /runs with query params
|
|
80
|
+
params = {
|
|
81
|
+
"project_name": project_name,
|
|
82
|
+
"is_root": "true",
|
|
83
|
+
"limit": str(limit),
|
|
84
|
+
}
|
|
85
|
+
result = langsmith_request("runs", api_key, params=params)
|
|
86
|
+
if result and isinstance(result, list):
|
|
87
|
+
return result
|
|
88
|
+
if result and isinstance(result, dict):
|
|
89
|
+
return result.get("runs", result.get("results", []))
|
|
90
|
+
|
|
91
|
+
return []
|
|
92
|
+
|
|
93
|
+
|
|
94
|
+
def extract_input(run):
|
|
95
|
+
"""Extract user input from a run's inputs field."""
|
|
96
|
+
inputs = run.get("inputs", {})
|
|
97
|
+
if not inputs:
|
|
98
|
+
return None
|
|
99
|
+
if isinstance(inputs, str):
|
|
100
|
+
return inputs
|
|
101
|
+
|
|
102
|
+
# Direct field
|
|
103
|
+
for key in ("input", "question", "query", "prompt", "text", "user_input"):
|
|
104
|
+
if key in inputs and isinstance(inputs[key], str):
|
|
105
|
+
return inputs[key]
|
|
106
|
+
|
|
107
|
+
# LangChain messages format
|
|
108
|
+
messages = inputs.get("messages") or inputs.get("input")
|
|
109
|
+
if isinstance(messages, list):
|
|
110
|
+
if messages and isinstance(messages[0], list):
|
|
111
|
+
messages = messages[0]
|
|
112
|
+
for msg in messages:
|
|
113
|
+
if isinstance(msg, dict):
|
|
114
|
+
if msg.get("type") in ("human", "HumanMessage") or msg.get("role") == "user":
|
|
115
|
+
content = msg.get("content", "")
|
|
116
|
+
if isinstance(content, str) and content:
|
|
117
|
+
return content
|
|
118
|
+
if isinstance(content, list):
|
|
119
|
+
for part in content:
|
|
120
|
+
if isinstance(part, dict) and part.get("type") == "text":
|
|
121
|
+
return part.get("text", "")
|
|
122
|
+
elif isinstance(msg, str) and msg:
|
|
123
|
+
return msg
|
|
124
|
+
|
|
125
|
+
return None
|
|
126
|
+
|
|
127
|
+
|
|
128
|
+
def extract_output(run):
|
|
129
|
+
"""Extract the output/response from a run."""
|
|
130
|
+
outputs = run.get("outputs", {})
|
|
131
|
+
if not outputs:
|
|
132
|
+
return None
|
|
133
|
+
if isinstance(outputs, str):
|
|
134
|
+
return outputs
|
|
135
|
+
|
|
136
|
+
for key in ("output", "answer", "result", "response", "text"):
|
|
137
|
+
if key in outputs and isinstance(outputs[key], str):
|
|
138
|
+
return outputs[key]
|
|
139
|
+
|
|
140
|
+
# LangChain messages format
|
|
141
|
+
messages = outputs.get("messages") or outputs.get("output")
|
|
142
|
+
if isinstance(messages, list):
|
|
143
|
+
if messages and isinstance(messages[0], list):
|
|
144
|
+
messages = messages[0]
|
|
145
|
+
for msg in reversed(messages):
|
|
146
|
+
if isinstance(msg, dict):
|
|
147
|
+
if msg.get("type") in ("ai", "AIMessage", "assistant") or msg.get("role") == "assistant":
|
|
148
|
+
content = msg.get("content", "")
|
|
149
|
+
if isinstance(content, str) and content:
|
|
150
|
+
return content
|
|
151
|
+
elif isinstance(msg, str) and msg:
|
|
152
|
+
return msg
|
|
153
|
+
|
|
154
|
+
return None
|
|
155
|
+
|
|
156
|
+
|
|
157
|
+
def get_feedback(run):
|
|
158
|
+
"""Extract feedback from a run."""
|
|
159
|
+
fb = run.get("feedback_stats") or {}
|
|
160
|
+
if isinstance(fb, dict):
|
|
161
|
+
pos = fb.get("thumbs_up", 0) or fb.get("positive", 0) or 0
|
|
162
|
+
neg = fb.get("thumbs_down", 0) or fb.get("negative", 0) or 0
|
|
163
|
+
if neg > 0:
|
|
164
|
+
return "negative"
|
|
165
|
+
if pos > 0:
|
|
166
|
+
return "positive"
|
|
167
|
+
return None
|
|
168
|
+
|
|
169
|
+
|
|
170
|
+
def categorize_run(run):
|
|
171
|
+
"""Categorize a run by its name/type."""
|
|
172
|
+
name = run.get("name", "unknown")
|
|
173
|
+
# Use top-level run name as category
|
|
174
|
+
return name
|
|
175
|
+
|
|
176
|
+
|
|
177
|
+
def analyze_runs(runs):
|
|
178
|
+
"""Analyze a batch of runs and produce structured insights."""
|
|
179
|
+
if not runs:
|
|
180
|
+
return None
|
|
181
|
+
|
|
182
|
+
processed = []
|
|
183
|
+
categories = Counter()
|
|
184
|
+
errors = []
|
|
185
|
+
latencies = []
|
|
186
|
+
token_counts = []
|
|
187
|
+
feedbacks = {"positive": 0, "negative": 0, "none": 0}
|
|
188
|
+
|
|
189
|
+
for run in runs:
|
|
190
|
+
user_input = extract_input(run)
|
|
191
|
+
output = extract_output(run)
|
|
192
|
+
error = run.get("error")
|
|
193
|
+
tokens = run.get("total_tokens") or 0
|
|
194
|
+
latency_ms = None
|
|
195
|
+
feedback = get_feedback(run)
|
|
196
|
+
|
|
197
|
+
# Calculate latency from start/end times
|
|
198
|
+
start = run.get("start_time") or run.get("start_dt")
|
|
199
|
+
end = run.get("end_time") or run.get("end_dt")
|
|
200
|
+
if isinstance(start, str) and isinstance(end, str):
|
|
201
|
+
try:
|
|
202
|
+
from datetime import datetime as dt
|
|
203
|
+
s = dt.fromisoformat(start.replace("Z", "+00:00"))
|
|
204
|
+
e = dt.fromisoformat(end.replace("Z", "+00:00"))
|
|
205
|
+
latency_ms = int((e - s).total_seconds() * 1000)
|
|
206
|
+
except Exception:
|
|
207
|
+
pass
|
|
208
|
+
elif run.get("latency"):
|
|
209
|
+
latency_ms = int(run["latency"] * 1000) if isinstance(run["latency"], float) else run["latency"]
|
|
210
|
+
|
|
211
|
+
category = categorize_run(run)
|
|
212
|
+
categories[category] += 1
|
|
213
|
+
|
|
214
|
+
entry = {
|
|
215
|
+
"input": (user_input or "")[:500],
|
|
216
|
+
"output": (output or "")[:300],
|
|
217
|
+
"category": category,
|
|
218
|
+
"tokens": tokens,
|
|
219
|
+
"latency_ms": latency_ms,
|
|
220
|
+
"error": (error or "")[:200] if error else None,
|
|
221
|
+
"feedback": feedback,
|
|
222
|
+
}
|
|
223
|
+
processed.append(entry)
|
|
224
|
+
|
|
225
|
+
if error:
|
|
226
|
+
errors.append({"error": error[:200], "input": (user_input or "")[:200], "category": category})
|
|
227
|
+
if latency_ms:
|
|
228
|
+
latencies.append(latency_ms)
|
|
229
|
+
if tokens:
|
|
230
|
+
token_counts.append(tokens)
|
|
231
|
+
|
|
232
|
+
if feedback == "positive":
|
|
233
|
+
feedbacks["positive"] += 1
|
|
234
|
+
elif feedback == "negative":
|
|
235
|
+
feedbacks["negative"] += 1
|
|
236
|
+
else:
|
|
237
|
+
feedbacks["none"] += 1
|
|
238
|
+
|
|
239
|
+
# Compute statistics
|
|
240
|
+
stats = {
|
|
241
|
+
"total_traces": len(runs),
|
|
242
|
+
"with_input": sum(1 for p in processed if p["input"]),
|
|
243
|
+
"with_error": len(errors),
|
|
244
|
+
"error_rate": len(errors) / max(len(runs), 1),
|
|
245
|
+
"feedback": feedbacks,
|
|
246
|
+
}
|
|
247
|
+
|
|
248
|
+
if latencies:
|
|
249
|
+
latencies.sort()
|
|
250
|
+
stats["latency"] = {
|
|
251
|
+
"avg_ms": int(sum(latencies) / len(latencies)),
|
|
252
|
+
"p50_ms": latencies[len(latencies) // 2],
|
|
253
|
+
"p95_ms": latencies[int(len(latencies) * 0.95)] if len(latencies) >= 20 else latencies[-1],
|
|
254
|
+
"max_ms": latencies[-1],
|
|
255
|
+
}
|
|
256
|
+
|
|
257
|
+
if token_counts:
|
|
258
|
+
stats["tokens"] = {
|
|
259
|
+
"avg": int(sum(token_counts) / len(token_counts)),
|
|
260
|
+
"max": max(token_counts),
|
|
261
|
+
"total": sum(token_counts),
|
|
262
|
+
}
|
|
263
|
+
|
|
264
|
+
# Group by category
|
|
265
|
+
by_category = {}
|
|
266
|
+
for entry in processed:
|
|
267
|
+
cat = entry["category"]
|
|
268
|
+
by_category.setdefault(cat, []).append(entry)
|
|
269
|
+
|
|
270
|
+
# Error patterns
|
|
271
|
+
error_patterns = Counter()
|
|
272
|
+
for e in errors:
|
|
273
|
+
# Normalize error to first 60 chars
|
|
274
|
+
pattern = e["error"][:60]
|
|
275
|
+
error_patterns[pattern] += 1
|
|
276
|
+
|
|
277
|
+
return {
|
|
278
|
+
"stats": stats,
|
|
279
|
+
"categories": dict(categories.most_common()),
|
|
280
|
+
"by_category": by_category,
|
|
281
|
+
"error_patterns": dict(error_patterns.most_common(10)),
|
|
282
|
+
"errors": errors[:20],
|
|
283
|
+
"processed": processed,
|
|
284
|
+
}
|
|
285
|
+
|
|
286
|
+
|
|
287
|
+
def generate_markdown_seed(analysis, project_name):
|
|
288
|
+
"""Generate a markdown seed file for the testgen agent."""
|
|
289
|
+
stats = analysis["stats"]
|
|
290
|
+
lines = [
|
|
291
|
+
f"# Production Trace Analysis: {project_name}",
|
|
292
|
+
"",
|
|
293
|
+
f"*{stats['total_traces']} traces analyzed*",
|
|
294
|
+
"",
|
|
295
|
+
"## Key Metrics",
|
|
296
|
+
"",
|
|
297
|
+
f"- **Error rate**: {stats['error_rate']:.1%}",
|
|
298
|
+
]
|
|
299
|
+
|
|
300
|
+
if "latency" in stats:
|
|
301
|
+
lat = stats["latency"]
|
|
302
|
+
lines.append(f"- **Latency**: {lat['avg_ms']}ms avg, {lat['p50_ms']}ms p50, {lat['p95_ms']}ms p95")
|
|
303
|
+
|
|
304
|
+
if "tokens" in stats:
|
|
305
|
+
tok = stats["tokens"]
|
|
306
|
+
lines.append(f"- **Tokens**: {tok['avg']} avg, {tok['max']} max")
|
|
307
|
+
|
|
308
|
+
fb = stats["feedback"]
|
|
309
|
+
total_fb = fb["positive"] + fb["negative"]
|
|
310
|
+
if total_fb > 0:
|
|
311
|
+
lines.append(f"- **User feedback**: {fb['positive']}/{total_fb} positive ({fb['positive']/total_fb:.0%})")
|
|
312
|
+
|
|
313
|
+
# Traffic distribution
|
|
314
|
+
lines.extend(["", "## Traffic Distribution", ""])
|
|
315
|
+
total = stats["total_traces"]
|
|
316
|
+
for cat, count in sorted(analysis["categories"].items(), key=lambda x: -x[1]):
|
|
317
|
+
pct = count / max(total, 1) * 100
|
|
318
|
+
lines.append(f"- **{cat}**: {count} traces ({pct:.0f}%)")
|
|
319
|
+
|
|
320
|
+
# Sample inputs by category
|
|
321
|
+
lines.extend(["", "## Sample Inputs by Category", ""])
|
|
322
|
+
for cat, entries in sorted(analysis["by_category"].items(), key=lambda x: -len(x[1])):
|
|
323
|
+
lines.append(f"### {cat} ({len(entries)} traces)")
|
|
324
|
+
lines.append("")
|
|
325
|
+
# Show up to 8 sample inputs per category
|
|
326
|
+
shown = 0
|
|
327
|
+
for entry in entries:
|
|
328
|
+
if not entry["input"] or shown >= 8:
|
|
329
|
+
break
|
|
330
|
+
status = "ERROR" if entry["error"] else "ok"
|
|
331
|
+
tok_str = f", {entry['tokens']}tok" if entry["tokens"] else ""
|
|
332
|
+
lat_str = f", {entry['latency_ms']}ms" if entry["latency_ms"] else ""
|
|
333
|
+
fb_str = ""
|
|
334
|
+
if entry["feedback"] == "negative":
|
|
335
|
+
fb_str = " [NEGATIVE FEEDBACK]"
|
|
336
|
+
elif entry["feedback"] == "positive":
|
|
337
|
+
fb_str = " [+]"
|
|
338
|
+
lines.append(f'- "{entry["input"][:150]}" ({status}{tok_str}{lat_str}){fb_str}')
|
|
339
|
+
shown += 1
|
|
340
|
+
lines.append("")
|
|
341
|
+
|
|
342
|
+
# Error patterns
|
|
343
|
+
if analysis["error_patterns"]:
|
|
344
|
+
lines.extend(["## Error Patterns", ""])
|
|
345
|
+
for pattern, count in analysis["error_patterns"].items():
|
|
346
|
+
lines.append(f"- **{pattern}**: {count} occurrences")
|
|
347
|
+
lines.append("")
|
|
348
|
+
|
|
349
|
+
# Negative feedback traces
|
|
350
|
+
neg_traces = [e for e in analysis["processed"] if e["feedback"] == "negative" and e["input"]]
|
|
351
|
+
if neg_traces:
|
|
352
|
+
lines.extend(["## Traces with Negative Feedback (high priority)", ""])
|
|
353
|
+
for entry in neg_traces[:10]:
|
|
354
|
+
lines.append(f'- "{entry["input"][:200]}" → category: {entry["category"]}')
|
|
355
|
+
lines.append("")
|
|
356
|
+
|
|
357
|
+
# Guidance for testgen
|
|
358
|
+
lines.extend([
|
|
359
|
+
"## Guidance for Test Generation",
|
|
360
|
+
"",
|
|
361
|
+
"Use the above data to generate test cases that:",
|
|
362
|
+
"1. **Match the real traffic distribution** — generate more tasks for high-traffic categories",
|
|
363
|
+
"2. **Include actual user phrasing** — real inputs show how users actually communicate (informal, abbreviations, typos)",
|
|
364
|
+
"3. **Cover real error patterns** — the errors above are genuine failure modes, not imagined scenarios",
|
|
365
|
+
"4. **Prioritize negative feedback traces** — these are confirmed bad experiences",
|
|
366
|
+
"5. **Include slow queries as edge cases** — high-latency traces may reveal timeout or complexity issues",
|
|
367
|
+
])
|
|
368
|
+
|
|
369
|
+
return "\n".join(lines)
|
|
370
|
+
|
|
371
|
+
|
|
372
|
+
def generate_json_summary(analysis, project_name):
|
|
373
|
+
"""Generate a JSON summary for programmatic use."""
|
|
374
|
+
return {
|
|
375
|
+
"project": project_name,
|
|
376
|
+
"generated_at": datetime.now(timezone.utc).isoformat(),
|
|
377
|
+
"stats": analysis["stats"],
|
|
378
|
+
"categories": analysis["categories"],
|
|
379
|
+
"error_patterns": analysis["error_patterns"],
|
|
380
|
+
"sample_inputs": {
|
|
381
|
+
cat: [e["input"] for e in entries if e["input"]][:10]
|
|
382
|
+
for cat, entries in analysis["by_category"].items()
|
|
383
|
+
},
|
|
384
|
+
"negative_feedback_inputs": [
|
|
385
|
+
e["input"] for e in analysis["processed"]
|
|
386
|
+
if e["feedback"] == "negative" and e["input"]
|
|
387
|
+
][:20],
|
|
388
|
+
"slow_queries": [
|
|
389
|
+
{"input": e["input"][:200], "latency_ms": e["latency_ms"], "category": e["category"]}
|
|
390
|
+
for e in sorted(analysis["processed"], key=lambda x: -(x["latency_ms"] or 0))
|
|
391
|
+
if e["latency_ms"] and e["input"]
|
|
392
|
+
][:10],
|
|
393
|
+
}
|
|
394
|
+
|
|
395
|
+
|
|
396
|
+
def main():
|
|
397
|
+
parser = argparse.ArgumentParser(description="Fetch and summarize production LangSmith traces")
|
|
398
|
+
parser.add_argument("--project", required=True, help="LangSmith project name")
|
|
399
|
+
parser.add_argument("--api-key-env", default="LANGSMITH_API_KEY",
|
|
400
|
+
help="Env var containing API key (default: LANGSMITH_API_KEY)")
|
|
401
|
+
parser.add_argument("--limit", type=int, default=100, help="Max traces to fetch (default: 100)")
|
|
402
|
+
parser.add_argument("--output-md", required=True, help="Output path for markdown seed")
|
|
403
|
+
parser.add_argument("--output-json", required=True, help="Output path for JSON summary")
|
|
404
|
+
args = parser.parse_args()
|
|
405
|
+
|
|
406
|
+
api_key = os.environ.get(args.api_key_env, "")
|
|
407
|
+
if not api_key:
|
|
408
|
+
print(f"No API key found in ${args.api_key_env} — cannot fetch production traces", file=sys.stderr)
|
|
409
|
+
sys.exit(1)
|
|
410
|
+
|
|
411
|
+
print(f"Fetching up to {args.limit} traces from LangSmith project '{args.project}'...")
|
|
412
|
+
runs = fetch_runs(args.project, api_key, args.limit)
|
|
413
|
+
|
|
414
|
+
if not runs:
|
|
415
|
+
print("No traces found. The project may be empty or the name may be wrong.")
|
|
416
|
+
# Write empty files so downstream doesn't break
|
|
417
|
+
for path in [args.output_md, args.output_json]:
|
|
418
|
+
os.makedirs(os.path.dirname(path) or ".", exist_ok=True)
|
|
419
|
+
with open(args.output_md, "w") as f:
|
|
420
|
+
f.write(f"# Production Trace Analysis: {args.project}\n\nNo traces found.\n")
|
|
421
|
+
with open(args.output_json, "w") as f:
|
|
422
|
+
json.dump({"project": args.project, "stats": {"total_traces": 0}}, f, indent=2)
|
|
423
|
+
return
|
|
424
|
+
|
|
425
|
+
print(f"Fetched {len(runs)} traces. Analyzing...")
|
|
426
|
+
analysis = analyze_runs(runs)
|
|
427
|
+
|
|
428
|
+
if not analysis:
|
|
429
|
+
print("Analysis failed — no processable traces")
|
|
430
|
+
return
|
|
431
|
+
|
|
432
|
+
# Write markdown seed
|
|
433
|
+
os.makedirs(os.path.dirname(args.output_md) or ".", exist_ok=True)
|
|
434
|
+
md = generate_markdown_seed(analysis, args.project)
|
|
435
|
+
with open(args.output_md, "w") as f:
|
|
436
|
+
f.write(md)
|
|
437
|
+
|
|
438
|
+
# Write JSON summary
|
|
439
|
+
os.makedirs(os.path.dirname(args.output_json) or ".", exist_ok=True)
|
|
440
|
+
summary = generate_json_summary(analysis, args.project)
|
|
441
|
+
with open(args.output_json, "w") as f:
|
|
442
|
+
json.dump(summary, f, indent=2, ensure_ascii=False)
|
|
443
|
+
|
|
444
|
+
stats = analysis["stats"]
|
|
445
|
+
cats = len(analysis["categories"])
|
|
446
|
+
errs = stats["with_error"]
|
|
447
|
+
print(f"Production seed generated:")
|
|
448
|
+
print(f" {stats['total_traces']} traces, {cats} categories, {errs} errors ({stats['error_rate']:.1%})")
|
|
449
|
+
print(f" {args.output_md}")
|
|
450
|
+
print(f" {args.output_json}")
|
|
451
|
+
|
|
452
|
+
|
|
453
|
+
if __name__ == "__main__":
|
|
454
|
+
main()
|