harness-evolver 2.7.0 → 2.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -40,6 +40,24 @@ These insights are generated from LangSmith traces cross-referenced with per-tas
40
40
 
41
41
  If trace insights are not available, proceed with manual trace analysis as described in Phase 2.
42
42
 
43
+ ## Production Insights
44
+
45
+ If `.harness-evolver/production_seed.json` exists in your `<files_to_read>`, it contains **real production data** from the app's LangSmith project:
46
+
47
+ - `categories` — real traffic distribution (which domains/routes get the most queries)
48
+ - `error_patterns` — actual production errors and their frequency
49
+ - `negative_feedback_inputs` — queries where users gave thumbs-down
50
+ - `slow_queries` — high-latency queries that may indicate bottlenecks
51
+ - `sample_inputs` — real user inputs grouped by category
52
+
53
+ Use this data to:
54
+ 1. **Prioritize changes that fix real production failures** over synthetic test failures
55
+ 2. **Match the real traffic distribution** — if 60% of production queries are domain A, optimize for domain A
56
+ 3. **Focus on negative feedback patterns** — these are confirmed bad user experiences
57
+ 4. **Address latency outliers** — slow queries may need different routing, caching, or model selection
58
+
59
+ Production data complements trace_insights.json. Trace insights show what happened in *harness evaluation runs*. Production insights show what happens in *real-world usage*.
60
+
43
61
  ## Context7 — Enrich Your Knowledge
44
62
 
45
63
  You have access to Context7 MCP tools (`resolve-library-id` and `get-library-docs`) for looking up **current, version-specific documentation** of any library.
@@ -36,6 +36,19 @@ Read the harness source code to understand:
36
36
  - What are its likely failure modes?
37
37
  - Are there any data files (knowledge bases, docs, etc.) that define the domain?
38
38
 
39
+ ### Phase 1.5: Use Production Traces (if available)
40
+
41
+ If your prompt contains a `<production_traces>` block, this is **real data from production LangSmith traces**. This is the most valuable signal you have — real user inputs beat synthetic ones.
42
+
43
+ When production traces are available:
44
+ 1. Read the traffic distribution — generate tasks proportional to real usage (if 60% of queries are domain A, 60% of tasks should cover domain A)
45
+ 2. Use actual user phrasing as inspiration — real inputs show abbreviations, typos, informal language
46
+ 3. Base edge cases on real error patterns — the errors listed are genuine failures, not imagined scenarios
47
+ 4. Prioritize negative feedback traces — these are confirmed bad experiences that MUST be covered
48
+ 5. Include slow queries as edge cases — high-latency traces may reveal timeout or complexity issues
49
+
50
+ **Do NOT just copy production inputs verbatim.** Use them as inspiration to generate VARIATIONS that test the same capabilities.
51
+
39
52
  ### Phase 2: Design Test Distribution
40
53
 
41
54
  Plan 30 test cases with this distribution:
@@ -44,6 +57,8 @@ Plan 30 test cases with this distribution:
44
57
  - **20% Cross-Domain** (6 tasks): inputs spanning multiple categories or requiring nuanced judgment
45
58
  - **20% Adversarial** (6 tasks): misleading, ambiguous, or designed to expose weaknesses
46
59
 
60
+ If production traces are available, adjust the distribution to match real traffic patterns instead of uniform.
61
+
47
62
  Ensure all categories/topics from the harness are covered.
48
63
 
49
64
  ### Phase 3: Generate Tasks
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "harness-evolver",
3
- "version": "2.7.0",
3
+ "version": "2.8.0",
4
4
  "description": "Meta-Harness-style autonomous harness optimization for Claude Code",
5
5
  "author": "Raphael Valdetaro",
6
6
  "license": "MIT",
@@ -34,6 +34,27 @@ For each iteration:
34
34
  python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(f'v{s[\"iterations\"]+1:03d}')"
35
35
  ```
36
36
 
37
+ ### 1.4. Gather Production Insights (first iteration only)
38
+
39
+ On the **first iteration**, if the project has a production LangSmith project configured but no production seed yet, fetch it:
40
+
41
+ ```bash
42
+ PROD_PROJECT=$(python3 -c "
43
+ import json, os
44
+ c = json.load(open('.harness-evolver/config.json'))
45
+ print(c.get('eval', {}).get('production_project', ''))
46
+ " 2>/dev/null)
47
+ if [ -n "$PROD_PROJECT" ] && [ ! -f ".harness-evolver/production_seed.json" ] && [ -n "$LANGSMITH_API_KEY" ]; then
48
+ python3 $TOOLS/seed_from_traces.py \
49
+ --project "$PROD_PROJECT" \
50
+ --output-md .harness-evolver/production_seed.md \
51
+ --output-json .harness-evolver/production_seed.json \
52
+ --limit 100 2>/dev/null
53
+ fi
54
+ ```
55
+
56
+ The `production_seed.json` is included in all proposers' `<files_to_read>` so they have real-world context about how the agent is actually used in production.
57
+
37
58
  ### 1.5. Gather LangSmith Traces (MANDATORY after every evaluation)
38
59
 
39
60
  **Run these commands unconditionally after EVERY evaluation** (including baseline). Do NOT guess project names — discover them.
@@ -255,6 +276,7 @@ Agent(
255
276
  - .harness-evolver/langsmith_stats.json (if exists)
256
277
  - .harness-evolver/langsmith_runs.json (if exists)
257
278
  - .harness-evolver/trace_insights.json (if exists)
279
+ - .harness-evolver/production_seed.json (if exists)
258
280
  - .harness-evolver/architecture.json (if exists)
259
281
  </files_to_read>
260
282
 
@@ -295,6 +317,7 @@ Agent(
295
317
  - .harness-evolver/langsmith_diagnosis.json (if exists)
296
318
  - .harness-evolver/langsmith_runs.json (if exists)
297
319
  - .harness-evolver/trace_insights.json (if exists)
320
+ - .harness-evolver/production_seed.json (if exists)
298
321
  - .harness-evolver/architecture.json (if exists)
299
322
  </files_to_read>
300
323
 
@@ -334,6 +357,7 @@ Agent(
334
357
  - .harness-evolver/langsmith_diagnosis.json (if exists)
335
358
  - .harness-evolver/langsmith_runs.json (if exists)
336
359
  - .harness-evolver/trace_insights.json (if exists)
360
+ - .harness-evolver/production_seed.json (if exists)
337
361
  - .harness-evolver/architecture.json (if exists)
338
362
  </files_to_read>
339
363
 
@@ -377,6 +401,7 @@ Agent(
377
401
  - .harness-evolver/harnesses/{best_version}/scores.json
378
402
  - .harness-evolver/langsmith_runs.json (if exists)
379
403
  - .harness-evolver/trace_insights.json (if exists)
404
+ - .harness-evolver/production_seed.json (if exists)
380
405
  - .harness-evolver/architecture.json (if exists)
381
406
  </files_to_read>
382
407
 
@@ -438,6 +463,7 @@ Agent(
438
463
  - .harness-evolver/harnesses/{best_version}/scores.json
439
464
  - .harness-evolver/langsmith_runs.json (if exists)
440
465
  - .harness-evolver/trace_insights.json (if exists)
466
+ - .harness-evolver/production_seed.json (if exists)
441
467
  - .harness-evolver/architecture.json (if exists)
442
468
  </files_to_read>
443
469
 
@@ -80,11 +80,21 @@ Agent(
80
80
  - /home/rp/Desktop/test-crewai/README.md
81
81
  </files_to_read>
82
82
 
83
+ <production_traces>
84
+ {IF .harness-evolver/production_seed.md EXISTS, paste its full contents here.
85
+ This file contains real production inputs, traffic distribution, error patterns,
86
+ and user feedback from LangSmith. Use it to generate REALISTIC test cases that
87
+ match actual usage patterns instead of synthetic ones.
88
+
89
+ If the file does not exist, omit this entire block.}
90
+ </production_traces>
91
+
83
92
  <output>
84
93
  Create directory tasks/ (at project root) with 30 files: task_001.json through task_030.json.
85
94
  Format: {"id": "task_001", "input": "...", "metadata": {"difficulty": "easy|medium|hard", "type": "standard|edge|cross_domain|adversarial"}}
86
95
  No "expected" field needed — the judge subagent will score outputs.
87
96
  Distribution: 40% standard, 20% edge, 20% cross-domain, 20% adversarial.
97
+ If production traces are available, match the real traffic distribution instead of uniform.
88
98
  </output>
89
99
  )
90
100
  ```
@@ -93,12 +103,37 @@ Wait for `## TESTGEN COMPLETE`. If the subagent fails or returns with no tasks,
93
103
 
94
104
  Print: "Generated {N} test cases from code analysis."
95
105
 
106
+ If `.harness-evolver/production_seed.md` exists, also print:
107
+ "Tasks enriched with production trace data from LangSmith."
108
+
96
109
  ## Phase 3: Run Init
97
110
 
111
+ First, check if the project has a LangSmith production project configured:
112
+
113
+ ```bash
114
+ # Auto-detect from env vars or .env
115
+ PROD_PROJECT=$(python3 -c "
116
+ import os
117
+ for v in ('LANGCHAIN_PROJECT', 'LANGSMITH_PROJECT'):
118
+ p = os.environ.get(v, '')
119
+ if p: print(p); exit()
120
+ for f in ('.env', '.env.local'):
121
+ if os.path.exists(f):
122
+ for line in open(f):
123
+ line = line.strip()
124
+ if '=' in line and not line.startswith('#'):
125
+ k, _, val = line.partition('=')
126
+ if k.strip() in ('LANGCHAIN_PROJECT', 'LANGSMITH_PROJECT'):
127
+ print(val.strip().strip('\"').strip(\"'\"))
128
+ exit()
129
+ " 2>/dev/null)
130
+ ```
131
+
98
132
  ```bash
99
133
  python3 $TOOLS/init.py [directory] \
100
134
  --harness harness.py --eval eval.py --tasks tasks/ \
101
- --tools-dir $TOOLS
135
+ --tools-dir $TOOLS \
136
+ ${PROD_PROJECT:+--langsmith-project "$PROD_PROJECT"}
102
137
  ```
103
138
 
104
139
  Add `--harness-config config.json` if a config exists.
package/tools/init.py CHANGED
@@ -124,6 +124,40 @@ def _detect_langsmith():
124
124
  return {"enabled": False}
125
125
 
126
126
 
127
+ def _detect_langsmith_project(search_dir="."):
128
+ """Auto-detect the app's existing LangSmith project name.
129
+
130
+ Checks (in order):
131
+ 1. LANGCHAIN_PROJECT env var (standard LangChain convention)
132
+ 2. LANGSMITH_PROJECT env var (alternative)
133
+ 3. .env file in the project directory
134
+ """
135
+ for var in ("LANGCHAIN_PROJECT", "LANGSMITH_PROJECT"):
136
+ project = os.environ.get(var)
137
+ if project:
138
+ return project
139
+
140
+ # Parse .env file
141
+ for env_name in (".env", ".env.local"):
142
+ env_path = os.path.join(search_dir, env_name)
143
+ if os.path.exists(env_path):
144
+ try:
145
+ with open(env_path) as f:
146
+ for line in f:
147
+ line = line.strip()
148
+ if line.startswith("#") or "=" not in line:
149
+ continue
150
+ key, _, val = line.partition("=")
151
+ key = key.strip()
152
+ val = val.strip().strip("'\"")
153
+ if key in ("LANGCHAIN_PROJECT", "LANGSMITH_PROJECT") and val:
154
+ return val
155
+ except OSError:
156
+ pass
157
+
158
+ return None
159
+
160
+
127
161
  def _check_langsmith_cli():
128
162
  """Check if langsmith-cli is installed."""
129
163
  try:
@@ -202,6 +236,9 @@ def main():
202
236
  parser.add_argument("--skip-validation", action="store_true",
203
237
  help="Skip harness validation step. Use when you know the harness "
204
238
  "works but validation times out (e.g. real LLM agent calls).")
239
+ parser.add_argument("--langsmith-project", default=None,
240
+ help="Existing LangSmith project name with production traces. "
241
+ "Auto-detected from LANGCHAIN_PROJECT / LANGSMITH_PROJECT env vars or .env file.")
205
242
  args = parser.parse_args()
206
243
 
207
244
  # Auto-detect missing args
@@ -280,6 +317,7 @@ def main():
280
317
  "args": ["--results-dir", "{results_dir}", "--tasks-dir", "{tasks_dir}",
281
318
  "--scores", "{scores}"],
282
319
  "langsmith": _detect_langsmith(),
320
+ "production_project": args.langsmith_project or _detect_langsmith_project(search_dir),
283
321
  },
284
322
  "evolution": {
285
323
  "max_iterations": 10,
@@ -375,6 +413,31 @@ def main():
375
413
  except Exception:
376
414
  pass
377
415
 
416
+ # 4.5 Fetch production traces seed (if LangSmith production project detected)
417
+ prod_project = config["eval"].get("production_project")
418
+ if prod_project and os.environ.get("LANGSMITH_API_KEY"):
419
+ seed_py = os.path.join(tools, "seed_from_traces.py")
420
+ if os.path.exists(seed_py):
421
+ print(f"Fetching production traces from LangSmith project '{prod_project}'...")
422
+ try:
423
+ r = subprocess.run(
424
+ [_resolve_python(), seed_py,
425
+ "--project", prod_project,
426
+ "--output-md", os.path.join(base, "production_seed.md"),
427
+ "--output-json", os.path.join(base, "production_seed.json"),
428
+ "--limit", "100"],
429
+ capture_output=True, text=True, timeout=60,
430
+ )
431
+ if r.returncode == 0:
432
+ print(r.stdout.strip())
433
+ else:
434
+ print(f" Could not fetch production traces: {r.stderr.strip()[:200]}")
435
+ except Exception as e:
436
+ print(f" Production trace fetch failed: {e}")
437
+ elif prod_project:
438
+ print(f"Production LangSmith project detected: {prod_project}")
439
+ print(" Set LANGSMITH_API_KEY to auto-fetch production traces during init.")
440
+
378
441
  # 5. Validate baseline harness
379
442
  config_path = os.path.join(base, "baseline", "config.json")
380
443
  if args.skip_validation:
@@ -0,0 +1,454 @@
1
+ #!/usr/bin/env python3
2
+ """Fetch and summarize production LangSmith traces for Harness Evolver.
3
+
4
+ Queries the LangSmith REST API directly (urllib, stdlib-only) to fetch
5
+ production traces and produce:
6
+ 1. A markdown seed file for the testgen agent (production_seed.md)
7
+ 2. A JSON summary for programmatic use (production_seed.json)
8
+
9
+ Usage:
10
+ python3 seed_from_traces.py \
11
+ --project ceppem-langgraph \
12
+ --output-md .harness-evolver/production_seed.md \
13
+ --output-json .harness-evolver/production_seed.json \
14
+ [--api-key-env LANGSMITH_API_KEY] \
15
+ [--limit 100]
16
+
17
+ Stdlib-only. No external dependencies (no langsmith-cli needed).
18
+ """
19
+
20
+ import argparse
21
+ import json
22
+ import os
23
+ import sys
24
+ import urllib.parse
25
+ import urllib.request
26
+ from collections import Counter
27
+ from datetime import datetime, timezone
28
+
29
+ LANGSMITH_API_BASE = "https://api.smith.langchain.com/api/v1"
30
+
31
+
32
+ def langsmith_request(endpoint, api_key, method="GET", body=None, params=None):
33
+ """Make a request to the LangSmith REST API."""
34
+ url = f"{LANGSMITH_API_BASE}/{endpoint}"
35
+ if params:
36
+ url += "?" + urllib.parse.urlencode(params)
37
+
38
+ headers = {
39
+ "x-api-key": api_key,
40
+ "Accept": "application/json",
41
+ }
42
+
43
+ data = None
44
+ if body is not None:
45
+ headers["Content-Type"] = "application/json"
46
+ data = json.dumps(body).encode("utf-8")
47
+
48
+ req = urllib.request.Request(url, data=data, headers=headers, method=method)
49
+ try:
50
+ with urllib.request.urlopen(req, timeout=30) as resp:
51
+ return json.loads(resp.read())
52
+ except urllib.error.HTTPError as e:
53
+ body_text = ""
54
+ try:
55
+ body_text = e.read().decode("utf-8", errors="replace")[:500]
56
+ except Exception:
57
+ pass
58
+ print(f"LangSmith API error {e.code}: {body_text}", file=sys.stderr)
59
+ return None
60
+ except Exception as e:
61
+ print(f"LangSmith API request failed: {e}", file=sys.stderr)
62
+ return None
63
+
64
+
65
+ def fetch_runs(project_name, api_key, limit=100):
66
+ """Fetch recent root runs from a LangSmith project."""
67
+ # Try POST /runs/query first (newer API)
68
+ body = {
69
+ "project_name": project_name,
70
+ "is_root": True,
71
+ "limit": limit,
72
+ }
73
+ result = langsmith_request("runs/query", api_key, method="POST", body=body)
74
+ if result and isinstance(result, dict):
75
+ return result.get("runs", result.get("results", []))
76
+ if result and isinstance(result, list):
77
+ return result
78
+
79
+ # Fallback: GET /runs with query params
80
+ params = {
81
+ "project_name": project_name,
82
+ "is_root": "true",
83
+ "limit": str(limit),
84
+ }
85
+ result = langsmith_request("runs", api_key, params=params)
86
+ if result and isinstance(result, list):
87
+ return result
88
+ if result and isinstance(result, dict):
89
+ return result.get("runs", result.get("results", []))
90
+
91
+ return []
92
+
93
+
94
+ def extract_input(run):
95
+ """Extract user input from a run's inputs field."""
96
+ inputs = run.get("inputs", {})
97
+ if not inputs:
98
+ return None
99
+ if isinstance(inputs, str):
100
+ return inputs
101
+
102
+ # Direct field
103
+ for key in ("input", "question", "query", "prompt", "text", "user_input"):
104
+ if key in inputs and isinstance(inputs[key], str):
105
+ return inputs[key]
106
+
107
+ # LangChain messages format
108
+ messages = inputs.get("messages") or inputs.get("input")
109
+ if isinstance(messages, list):
110
+ if messages and isinstance(messages[0], list):
111
+ messages = messages[0]
112
+ for msg in messages:
113
+ if isinstance(msg, dict):
114
+ if msg.get("type") in ("human", "HumanMessage") or msg.get("role") == "user":
115
+ content = msg.get("content", "")
116
+ if isinstance(content, str) and content:
117
+ return content
118
+ if isinstance(content, list):
119
+ for part in content:
120
+ if isinstance(part, dict) and part.get("type") == "text":
121
+ return part.get("text", "")
122
+ elif isinstance(msg, str) and msg:
123
+ return msg
124
+
125
+ return None
126
+
127
+
128
+ def extract_output(run):
129
+ """Extract the output/response from a run."""
130
+ outputs = run.get("outputs", {})
131
+ if not outputs:
132
+ return None
133
+ if isinstance(outputs, str):
134
+ return outputs
135
+
136
+ for key in ("output", "answer", "result", "response", "text"):
137
+ if key in outputs and isinstance(outputs[key], str):
138
+ return outputs[key]
139
+
140
+ # LangChain messages format
141
+ messages = outputs.get("messages") or outputs.get("output")
142
+ if isinstance(messages, list):
143
+ if messages and isinstance(messages[0], list):
144
+ messages = messages[0]
145
+ for msg in reversed(messages):
146
+ if isinstance(msg, dict):
147
+ if msg.get("type") in ("ai", "AIMessage", "assistant") or msg.get("role") == "assistant":
148
+ content = msg.get("content", "")
149
+ if isinstance(content, str) and content:
150
+ return content
151
+ elif isinstance(msg, str) and msg:
152
+ return msg
153
+
154
+ return None
155
+
156
+
157
+ def get_feedback(run):
158
+ """Extract feedback from a run."""
159
+ fb = run.get("feedback_stats") or {}
160
+ if isinstance(fb, dict):
161
+ pos = fb.get("thumbs_up", 0) or fb.get("positive", 0) or 0
162
+ neg = fb.get("thumbs_down", 0) or fb.get("negative", 0) or 0
163
+ if neg > 0:
164
+ return "negative"
165
+ if pos > 0:
166
+ return "positive"
167
+ return None
168
+
169
+
170
+ def categorize_run(run):
171
+ """Categorize a run by its name/type."""
172
+ name = run.get("name", "unknown")
173
+ # Use top-level run name as category
174
+ return name
175
+
176
+
177
+ def analyze_runs(runs):
178
+ """Analyze a batch of runs and produce structured insights."""
179
+ if not runs:
180
+ return None
181
+
182
+ processed = []
183
+ categories = Counter()
184
+ errors = []
185
+ latencies = []
186
+ token_counts = []
187
+ feedbacks = {"positive": 0, "negative": 0, "none": 0}
188
+
189
+ for run in runs:
190
+ user_input = extract_input(run)
191
+ output = extract_output(run)
192
+ error = run.get("error")
193
+ tokens = run.get("total_tokens") or 0
194
+ latency_ms = None
195
+ feedback = get_feedback(run)
196
+
197
+ # Calculate latency from start/end times
198
+ start = run.get("start_time") or run.get("start_dt")
199
+ end = run.get("end_time") or run.get("end_dt")
200
+ if isinstance(start, str) and isinstance(end, str):
201
+ try:
202
+ from datetime import datetime as dt
203
+ s = dt.fromisoformat(start.replace("Z", "+00:00"))
204
+ e = dt.fromisoformat(end.replace("Z", "+00:00"))
205
+ latency_ms = int((e - s).total_seconds() * 1000)
206
+ except Exception:
207
+ pass
208
+ elif run.get("latency"):
209
+ latency_ms = int(run["latency"] * 1000) if isinstance(run["latency"], float) else run["latency"]
210
+
211
+ category = categorize_run(run)
212
+ categories[category] += 1
213
+
214
+ entry = {
215
+ "input": (user_input or "")[:500],
216
+ "output": (output or "")[:300],
217
+ "category": category,
218
+ "tokens": tokens,
219
+ "latency_ms": latency_ms,
220
+ "error": (error or "")[:200] if error else None,
221
+ "feedback": feedback,
222
+ }
223
+ processed.append(entry)
224
+
225
+ if error:
226
+ errors.append({"error": error[:200], "input": (user_input or "")[:200], "category": category})
227
+ if latency_ms:
228
+ latencies.append(latency_ms)
229
+ if tokens:
230
+ token_counts.append(tokens)
231
+
232
+ if feedback == "positive":
233
+ feedbacks["positive"] += 1
234
+ elif feedback == "negative":
235
+ feedbacks["negative"] += 1
236
+ else:
237
+ feedbacks["none"] += 1
238
+
239
+ # Compute statistics
240
+ stats = {
241
+ "total_traces": len(runs),
242
+ "with_input": sum(1 for p in processed if p["input"]),
243
+ "with_error": len(errors),
244
+ "error_rate": len(errors) / max(len(runs), 1),
245
+ "feedback": feedbacks,
246
+ }
247
+
248
+ if latencies:
249
+ latencies.sort()
250
+ stats["latency"] = {
251
+ "avg_ms": int(sum(latencies) / len(latencies)),
252
+ "p50_ms": latencies[len(latencies) // 2],
253
+ "p95_ms": latencies[int(len(latencies) * 0.95)] if len(latencies) >= 20 else latencies[-1],
254
+ "max_ms": latencies[-1],
255
+ }
256
+
257
+ if token_counts:
258
+ stats["tokens"] = {
259
+ "avg": int(sum(token_counts) / len(token_counts)),
260
+ "max": max(token_counts),
261
+ "total": sum(token_counts),
262
+ }
263
+
264
+ # Group by category
265
+ by_category = {}
266
+ for entry in processed:
267
+ cat = entry["category"]
268
+ by_category.setdefault(cat, []).append(entry)
269
+
270
+ # Error patterns
271
+ error_patterns = Counter()
272
+ for e in errors:
273
+ # Normalize error to first 60 chars
274
+ pattern = e["error"][:60]
275
+ error_patterns[pattern] += 1
276
+
277
+ return {
278
+ "stats": stats,
279
+ "categories": dict(categories.most_common()),
280
+ "by_category": by_category,
281
+ "error_patterns": dict(error_patterns.most_common(10)),
282
+ "errors": errors[:20],
283
+ "processed": processed,
284
+ }
285
+
286
+
287
+ def generate_markdown_seed(analysis, project_name):
288
+ """Generate a markdown seed file for the testgen agent."""
289
+ stats = analysis["stats"]
290
+ lines = [
291
+ f"# Production Trace Analysis: {project_name}",
292
+ "",
293
+ f"*{stats['total_traces']} traces analyzed*",
294
+ "",
295
+ "## Key Metrics",
296
+ "",
297
+ f"- **Error rate**: {stats['error_rate']:.1%}",
298
+ ]
299
+
300
+ if "latency" in stats:
301
+ lat = stats["latency"]
302
+ lines.append(f"- **Latency**: {lat['avg_ms']}ms avg, {lat['p50_ms']}ms p50, {lat['p95_ms']}ms p95")
303
+
304
+ if "tokens" in stats:
305
+ tok = stats["tokens"]
306
+ lines.append(f"- **Tokens**: {tok['avg']} avg, {tok['max']} max")
307
+
308
+ fb = stats["feedback"]
309
+ total_fb = fb["positive"] + fb["negative"]
310
+ if total_fb > 0:
311
+ lines.append(f"- **User feedback**: {fb['positive']}/{total_fb} positive ({fb['positive']/total_fb:.0%})")
312
+
313
+ # Traffic distribution
314
+ lines.extend(["", "## Traffic Distribution", ""])
315
+ total = stats["total_traces"]
316
+ for cat, count in sorted(analysis["categories"].items(), key=lambda x: -x[1]):
317
+ pct = count / max(total, 1) * 100
318
+ lines.append(f"- **{cat}**: {count} traces ({pct:.0f}%)")
319
+
320
+ # Sample inputs by category
321
+ lines.extend(["", "## Sample Inputs by Category", ""])
322
+ for cat, entries in sorted(analysis["by_category"].items(), key=lambda x: -len(x[1])):
323
+ lines.append(f"### {cat} ({len(entries)} traces)")
324
+ lines.append("")
325
+ # Show up to 8 sample inputs per category
326
+ shown = 0
327
+ for entry in entries:
328
+ if not entry["input"] or shown >= 8:
329
+ break
330
+ status = "ERROR" if entry["error"] else "ok"
331
+ tok_str = f", {entry['tokens']}tok" if entry["tokens"] else ""
332
+ lat_str = f", {entry['latency_ms']}ms" if entry["latency_ms"] else ""
333
+ fb_str = ""
334
+ if entry["feedback"] == "negative":
335
+ fb_str = " [NEGATIVE FEEDBACK]"
336
+ elif entry["feedback"] == "positive":
337
+ fb_str = " [+]"
338
+ lines.append(f'- "{entry["input"][:150]}" ({status}{tok_str}{lat_str}){fb_str}')
339
+ shown += 1
340
+ lines.append("")
341
+
342
+ # Error patterns
343
+ if analysis["error_patterns"]:
344
+ lines.extend(["## Error Patterns", ""])
345
+ for pattern, count in analysis["error_patterns"].items():
346
+ lines.append(f"- **{pattern}**: {count} occurrences")
347
+ lines.append("")
348
+
349
+ # Negative feedback traces
350
+ neg_traces = [e for e in analysis["processed"] if e["feedback"] == "negative" and e["input"]]
351
+ if neg_traces:
352
+ lines.extend(["## Traces with Negative Feedback (high priority)", ""])
353
+ for entry in neg_traces[:10]:
354
+ lines.append(f'- "{entry["input"][:200]}" → category: {entry["category"]}')
355
+ lines.append("")
356
+
357
+ # Guidance for testgen
358
+ lines.extend([
359
+ "## Guidance for Test Generation",
360
+ "",
361
+ "Use the above data to generate test cases that:",
362
+ "1. **Match the real traffic distribution** — generate more tasks for high-traffic categories",
363
+ "2. **Include actual user phrasing** — real inputs show how users actually communicate (informal, abbreviations, typos)",
364
+ "3. **Cover real error patterns** — the errors above are genuine failure modes, not imagined scenarios",
365
+ "4. **Prioritize negative feedback traces** — these are confirmed bad experiences",
366
+ "5. **Include slow queries as edge cases** — high-latency traces may reveal timeout or complexity issues",
367
+ ])
368
+
369
+ return "\n".join(lines)
370
+
371
+
372
+ def generate_json_summary(analysis, project_name):
373
+ """Generate a JSON summary for programmatic use."""
374
+ return {
375
+ "project": project_name,
376
+ "generated_at": datetime.now(timezone.utc).isoformat(),
377
+ "stats": analysis["stats"],
378
+ "categories": analysis["categories"],
379
+ "error_patterns": analysis["error_patterns"],
380
+ "sample_inputs": {
381
+ cat: [e["input"] for e in entries if e["input"]][:10]
382
+ for cat, entries in analysis["by_category"].items()
383
+ },
384
+ "negative_feedback_inputs": [
385
+ e["input"] for e in analysis["processed"]
386
+ if e["feedback"] == "negative" and e["input"]
387
+ ][:20],
388
+ "slow_queries": [
389
+ {"input": e["input"][:200], "latency_ms": e["latency_ms"], "category": e["category"]}
390
+ for e in sorted(analysis["processed"], key=lambda x: -(x["latency_ms"] or 0))
391
+ if e["latency_ms"] and e["input"]
392
+ ][:10],
393
+ }
394
+
395
+
396
+ def main():
397
+ parser = argparse.ArgumentParser(description="Fetch and summarize production LangSmith traces")
398
+ parser.add_argument("--project", required=True, help="LangSmith project name")
399
+ parser.add_argument("--api-key-env", default="LANGSMITH_API_KEY",
400
+ help="Env var containing API key (default: LANGSMITH_API_KEY)")
401
+ parser.add_argument("--limit", type=int, default=100, help="Max traces to fetch (default: 100)")
402
+ parser.add_argument("--output-md", required=True, help="Output path for markdown seed")
403
+ parser.add_argument("--output-json", required=True, help="Output path for JSON summary")
404
+ args = parser.parse_args()
405
+
406
+ api_key = os.environ.get(args.api_key_env, "")
407
+ if not api_key:
408
+ print(f"No API key found in ${args.api_key_env} — cannot fetch production traces", file=sys.stderr)
409
+ sys.exit(1)
410
+
411
+ print(f"Fetching up to {args.limit} traces from LangSmith project '{args.project}'...")
412
+ runs = fetch_runs(args.project, api_key, args.limit)
413
+
414
+ if not runs:
415
+ print("No traces found. The project may be empty or the name may be wrong.")
416
+ # Write empty files so downstream doesn't break
417
+ for path in [args.output_md, args.output_json]:
418
+ os.makedirs(os.path.dirname(path) or ".", exist_ok=True)
419
+ with open(args.output_md, "w") as f:
420
+ f.write(f"# Production Trace Analysis: {args.project}\n\nNo traces found.\n")
421
+ with open(args.output_json, "w") as f:
422
+ json.dump({"project": args.project, "stats": {"total_traces": 0}}, f, indent=2)
423
+ return
424
+
425
+ print(f"Fetched {len(runs)} traces. Analyzing...")
426
+ analysis = analyze_runs(runs)
427
+
428
+ if not analysis:
429
+ print("Analysis failed — no processable traces")
430
+ return
431
+
432
+ # Write markdown seed
433
+ os.makedirs(os.path.dirname(args.output_md) or ".", exist_ok=True)
434
+ md = generate_markdown_seed(analysis, args.project)
435
+ with open(args.output_md, "w") as f:
436
+ f.write(md)
437
+
438
+ # Write JSON summary
439
+ os.makedirs(os.path.dirname(args.output_json) or ".", exist_ok=True)
440
+ summary = generate_json_summary(analysis, args.project)
441
+ with open(args.output_json, "w") as f:
442
+ json.dump(summary, f, indent=2, ensure_ascii=False)
443
+
444
+ stats = analysis["stats"]
445
+ cats = len(analysis["categories"])
446
+ errs = stats["with_error"]
447
+ print(f"Production seed generated:")
448
+ print(f" {stats['total_traces']} traces, {cats} categories, {errs} errors ({stats['error_rate']:.1%})")
449
+ print(f" {args.output_md}")
450
+ print(f" {args.output_json}")
451
+
452
+
453
+ if __name__ == "__main__":
454
+ main()