harness-evolver 2.6.1 → 2.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -40,6 +40,24 @@ These insights are generated from LangSmith traces cross-referenced with per-tas
40
40
 
41
41
  If trace insights are not available, proceed with manual trace analysis as described in Phase 2.
42
42
 
43
+ ## Production Insights
44
+
45
+ If `.harness-evolver/production_seed.json` exists in your `<files_to_read>`, it contains **real production data** from the app's LangSmith project:
46
+
47
+ - `categories` — real traffic distribution (which domains/routes get the most queries)
48
+ - `error_patterns` — actual production errors and their frequency
49
+ - `negative_feedback_inputs` — queries where users gave thumbs-down
50
+ - `slow_queries` — high-latency queries that may indicate bottlenecks
51
+ - `sample_inputs` — real user inputs grouped by category
52
+
53
+ Use this data to:
54
+ 1. **Prioritize changes that fix real production failures** over synthetic test failures
55
+ 2. **Match the real traffic distribution** — if 60% of production queries are domain A, optimize for domain A
56
+ 3. **Focus on negative feedback patterns** — these are confirmed bad user experiences
57
+ 4. **Address latency outliers** — slow queries may need different routing, caching, or model selection
58
+
59
+ Production data complements trace_insights.json. Trace insights show what happened in *harness evaluation runs*. Production insights show what happens in *real-world usage*.
60
+
43
61
  ## Context7 — Enrich Your Knowledge
44
62
 
45
63
  You have access to Context7 MCP tools (`resolve-library-id` and `get-library-docs`) for looking up **current, version-specific documentation** of any library.
@@ -36,6 +36,19 @@ Read the harness source code to understand:
36
36
  - What are its likely failure modes?
37
37
  - Are there any data files (knowledge bases, docs, etc.) that define the domain?
38
38
 
39
+ ### Phase 1.5: Use Production Traces (if available)
40
+
41
+ If your prompt contains a `<production_traces>` block, this is **real data from production LangSmith traces**. This is the most valuable signal you have — real user inputs beat synthetic ones.
42
+
43
+ When production traces are available:
44
+ 1. Read the traffic distribution — generate tasks proportional to real usage (if 60% of queries are domain A, 60% of tasks should cover domain A)
45
+ 2. Use actual user phrasing as inspiration — real inputs show abbreviations, typos, informal language
46
+ 3. Base edge cases on real error patterns — the errors listed are genuine failures, not imagined scenarios
47
+ 4. Prioritize negative feedback traces — these are confirmed bad experiences that MUST be covered
48
+ 5. Include slow queries as edge cases — high-latency traces may reveal timeout or complexity issues
49
+
50
+ **Do NOT just copy production inputs verbatim.** Use them as inspiration to generate VARIATIONS that test the same capabilities.
51
+
39
52
  ### Phase 2: Design Test Distribution
40
53
 
41
54
  Plan 30 test cases with this distribution:
@@ -44,6 +57,8 @@ Plan 30 test cases with this distribution:
44
57
  - **20% Cross-Domain** (6 tasks): inputs spanning multiple categories or requiring nuanced judgment
45
58
  - **20% Adversarial** (6 tasks): misleading, ambiguous, or designed to expose weaknesses
46
59
 
60
+ If production traces are available, adjust the distribution to match real traffic patterns instead of uniform.
61
+
47
62
  Ensure all categories/topics from the harness are covered.
48
63
 
49
64
  ### Phase 3: Generate Tasks
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "harness-evolver",
3
- "version": "2.6.1",
3
+ "version": "2.8.0",
4
4
  "description": "Meta-Harness-style autonomous harness optimization for Claude Code",
5
5
  "author": "Raphael Valdetaro",
6
6
  "license": "MIT",
@@ -34,6 +34,27 @@ For each iteration:
34
34
  python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(f'v{s[\"iterations\"]+1:03d}')"
35
35
  ```
36
36
 
37
+ ### 1.4. Gather Production Insights (first iteration only)
38
+
39
+ On the **first iteration**, if the project has a production LangSmith project configured but no production seed yet, fetch it:
40
+
41
+ ```bash
42
+ PROD_PROJECT=$(python3 -c "
43
+ import json, os
44
+ c = json.load(open('.harness-evolver/config.json'))
45
+ print(c.get('eval', {}).get('production_project', ''))
46
+ " 2>/dev/null)
47
+ if [ -n "$PROD_PROJECT" ] && [ ! -f ".harness-evolver/production_seed.json" ] && [ -n "$LANGSMITH_API_KEY" ]; then
48
+ python3 $TOOLS/seed_from_traces.py \
49
+ --project "$PROD_PROJECT" \
50
+ --output-md .harness-evolver/production_seed.md \
51
+ --output-json .harness-evolver/production_seed.json \
52
+ --limit 100 2>/dev/null
53
+ fi
54
+ ```
55
+
56
+ The `production_seed.json` is included in all proposers' `<files_to_read>` so they have real-world context about how the agent is actually used in production.
57
+
37
58
  ### 1.5. Gather LangSmith Traces (MANDATORY after every evaluation)
38
59
 
39
60
  **Run these commands unconditionally after EVERY evaluation** (including baseline). Do NOT guess project names — discover them.
@@ -255,6 +276,7 @@ Agent(
255
276
  - .harness-evolver/langsmith_stats.json (if exists)
256
277
  - .harness-evolver/langsmith_runs.json (if exists)
257
278
  - .harness-evolver/trace_insights.json (if exists)
279
+ - .harness-evolver/production_seed.json (if exists)
258
280
  - .harness-evolver/architecture.json (if exists)
259
281
  </files_to_read>
260
282
 
@@ -295,6 +317,7 @@ Agent(
295
317
  - .harness-evolver/langsmith_diagnosis.json (if exists)
296
318
  - .harness-evolver/langsmith_runs.json (if exists)
297
319
  - .harness-evolver/trace_insights.json (if exists)
320
+ - .harness-evolver/production_seed.json (if exists)
298
321
  - .harness-evolver/architecture.json (if exists)
299
322
  </files_to_read>
300
323
 
@@ -334,6 +357,7 @@ Agent(
334
357
  - .harness-evolver/langsmith_diagnosis.json (if exists)
335
358
  - .harness-evolver/langsmith_runs.json (if exists)
336
359
  - .harness-evolver/trace_insights.json (if exists)
360
+ - .harness-evolver/production_seed.json (if exists)
337
361
  - .harness-evolver/architecture.json (if exists)
338
362
  </files_to_read>
339
363
 
@@ -377,6 +401,7 @@ Agent(
377
401
  - .harness-evolver/harnesses/{best_version}/scores.json
378
402
  - .harness-evolver/langsmith_runs.json (if exists)
379
403
  - .harness-evolver/trace_insights.json (if exists)
404
+ - .harness-evolver/production_seed.json (if exists)
380
405
  - .harness-evolver/architecture.json (if exists)
381
406
  </files_to_read>
382
407
 
@@ -438,6 +463,7 @@ Agent(
438
463
  - .harness-evolver/harnesses/{best_version}/scores.json
439
464
  - .harness-evolver/langsmith_runs.json (if exists)
440
465
  - .harness-evolver/trace_insights.json (if exists)
466
+ - .harness-evolver/production_seed.json (if exists)
441
467
  - .harness-evolver/architecture.json (if exists)
442
468
  </files_to_read>
443
469
 
@@ -80,11 +80,21 @@ Agent(
80
80
  - /home/rp/Desktop/test-crewai/README.md
81
81
  </files_to_read>
82
82
 
83
+ <production_traces>
84
+ {IF .harness-evolver/production_seed.md EXISTS, paste its full contents here.
85
+ This file contains real production inputs, traffic distribution, error patterns,
86
+ and user feedback from LangSmith. Use it to generate REALISTIC test cases that
87
+ match actual usage patterns instead of synthetic ones.
88
+
89
+ If the file does not exist, omit this entire block.}
90
+ </production_traces>
91
+
83
92
  <output>
84
93
  Create directory tasks/ (at project root) with 30 files: task_001.json through task_030.json.
85
94
  Format: {"id": "task_001", "input": "...", "metadata": {"difficulty": "easy|medium|hard", "type": "standard|edge|cross_domain|adversarial"}}
86
95
  No "expected" field needed — the judge subagent will score outputs.
87
96
  Distribution: 40% standard, 20% edge, 20% cross-domain, 20% adversarial.
97
+ If production traces are available, match the real traffic distribution instead of uniform.
88
98
  </output>
89
99
  )
90
100
  ```
@@ -93,16 +103,60 @@ Wait for `## TESTGEN COMPLETE`. If the subagent fails or returns with no tasks,
93
103
 
94
104
  Print: "Generated {N} test cases from code analysis."
95
105
 
106
+ If `.harness-evolver/production_seed.md` exists, also print:
107
+ "Tasks enriched with production trace data from LangSmith."
108
+
96
109
  ## Phase 3: Run Init
97
110
 
111
+ First, check if the project has a LangSmith production project configured:
112
+
113
+ ```bash
114
+ # Auto-detect from env vars or .env
115
+ PROD_PROJECT=$(python3 -c "
116
+ import os
117
+ for v in ('LANGCHAIN_PROJECT', 'LANGSMITH_PROJECT'):
118
+ p = os.environ.get(v, '')
119
+ if p: print(p); exit()
120
+ for f in ('.env', '.env.local'):
121
+ if os.path.exists(f):
122
+ for line in open(f):
123
+ line = line.strip()
124
+ if '=' in line and not line.startswith('#'):
125
+ k, _, val = line.partition('=')
126
+ if k.strip() in ('LANGCHAIN_PROJECT', 'LANGSMITH_PROJECT'):
127
+ print(val.strip().strip('\"').strip(\"'\"))
128
+ exit()
129
+ " 2>/dev/null)
130
+ ```
131
+
98
132
  ```bash
99
133
  python3 $TOOLS/init.py [directory] \
100
134
  --harness harness.py --eval eval.py --tasks tasks/ \
101
- --tools-dir $TOOLS
135
+ --tools-dir $TOOLS \
136
+ ${PROD_PROJECT:+--langsmith-project "$PROD_PROJECT"}
102
137
  ```
103
138
 
104
139
  Add `--harness-config config.json` if a config exists.
105
140
 
141
+ For **LLM-powered agents** that make real API calls (LangGraph, CrewAI, etc.) and take
142
+ more than 30 seconds per invocation, increase the validation timeout:
143
+
144
+ ```bash
145
+ python3 $TOOLS/init.py [directory] \
146
+ --harness harness.py --eval eval.py --tasks tasks/ \
147
+ --tools-dir $TOOLS \
148
+ --validation-timeout 120
149
+ ```
150
+
151
+ If validation keeps timing out but you've verified the harness works manually, skip it:
152
+
153
+ ```bash
154
+ python3 $TOOLS/init.py [directory] \
155
+ --harness harness.py --eval eval.py --tasks tasks/ \
156
+ --tools-dir $TOOLS \
157
+ --skip-validation
158
+ ```
159
+
106
160
  ## After Init — Report
107
161
 
108
162
  - What was detected vs created
@@ -132,3 +186,6 @@ This is advisory only — do not spawn the architect agent.
132
186
  - The `expected` field is never shown to the harness — only the eval script sees it.
133
187
  - If `.harness-evolver/` already exists, warn before overwriting.
134
188
  - If no Python files exist in CWD, the user is probably in the wrong directory.
189
+ - **Monorepo / venv mismatch**: In monorepos with dedicated venvs per app, the system `python3` may differ from the project's Python version. The harness wrapper should re-exec with the correct venv Python. The tools now use `sys.executable` instead of hardcoded `python3`.
190
+ - **Stale site-packages**: If the project uses editable installs (`pip install -e .`), packages in `site-packages/` may have stale copies of data files (e.g. registry YAMLs). Run `uv pip install -e . --force-reinstall --no-deps` to sync.
191
+ - **Validation timeout**: LLM agents making real API calls typically take 15-60s per invocation. Use `--validation-timeout 120` or `--skip-validation` to handle this.
@@ -472,12 +472,60 @@ def analyze_scores(summary_path):
472
472
 
473
473
  # --- Main ---
474
474
 
475
+ def analyze_multiple(file_paths):
476
+ """Analyze multiple Python files and merge their signals.
477
+
478
+ Useful in monorepo setups where the harness is a thin wrapper that
479
+ delegates to the actual agent code. Pass the harness AND the main
480
+ agent source files for a comprehensive topology classification.
481
+ """
482
+ merged = {
483
+ "llm_call_count": 0,
484
+ "has_loop_around_llm": False,
485
+ "has_tool_definitions": False,
486
+ "has_retrieval": False,
487
+ "has_graph_framework": False,
488
+ "has_parallel_execution": False,
489
+ "has_error_handling": False,
490
+ "code_lines": 0,
491
+ "function_count": 0,
492
+ "class_count": 0,
493
+ "files_analyzed": [],
494
+ }
495
+
496
+ for path in file_paths:
497
+ if not os.path.isfile(path):
498
+ continue
499
+ try:
500
+ signals = analyze_code(path)
501
+ except Exception:
502
+ continue
503
+
504
+ merged["llm_call_count"] += signals.get("llm_call_count", 0)
505
+ merged["code_lines"] += signals.get("code_lines", 0)
506
+ merged["function_count"] += signals.get("function_count", 0)
507
+ merged["class_count"] += signals.get("class_count", 0)
508
+ merged["files_analyzed"].append(os.path.basename(path))
509
+
510
+ for bool_key in ["has_loop_around_llm", "has_tool_definitions", "has_retrieval",
511
+ "has_graph_framework", "has_parallel_execution", "has_error_handling"]:
512
+ if signals.get(bool_key):
513
+ merged[bool_key] = True
514
+
515
+ merged["estimated_topology"] = _estimate_topology(merged)
516
+ return merged
517
+
518
+
475
519
  def main():
476
520
  parser = argparse.ArgumentParser(
477
521
  description="Analyze harness architecture and produce signals for the architect agent",
478
- usage="analyze_architecture.py --harness PATH [--traces-dir PATH] [--summary PATH] [-o output.json]",
522
+ usage="analyze_architecture.py --harness PATH [--source-files PATH ...] "
523
+ "[--traces-dir PATH] [--summary PATH] [-o output.json]",
479
524
  )
480
525
  parser.add_argument("--harness", required=True, help="Path to harness Python file")
526
+ parser.add_argument("--source-files", nargs="*", default=None,
527
+ help="Additional source files to analyze (e.g. the actual agent code). "
528
+ "Useful when the harness is a thin wrapper around a larger system.")
481
529
  parser.add_argument("--traces-dir", default=None, help="Path to traces directory")
482
530
  parser.add_argument("--summary", default=None, help="Path to summary.json")
483
531
  parser.add_argument("-o", "--output", default=None, help="Output JSON path")
@@ -487,8 +535,14 @@ def main():
487
535
  print(json.dumps({"error": f"Harness file not found: {args.harness}"}))
488
536
  sys.exit(1)
489
537
 
538
+ if args.source_files:
539
+ all_files = [args.harness] + [f for f in args.source_files if os.path.isfile(f)]
540
+ code_signals = analyze_multiple(all_files)
541
+ else:
542
+ code_signals = analyze_code(args.harness)
543
+
490
544
  result = {
491
- "code_signals": analyze_code(args.harness),
545
+ "code_signals": code_signals,
492
546
  "trace_signals": None,
493
547
  "score_signals": None,
494
548
  }
package/tools/evaluate.py CHANGED
@@ -2,7 +2,7 @@
2
2
  """Evaluation orchestrator for Harness Evolver.
3
3
 
4
4
  Commands:
5
- validate --harness PATH [--config PATH]
5
+ validate --harness PATH [--config PATH] [--timeout SECONDS]
6
6
  run --harness PATH --tasks-dir PATH --eval PATH --traces-dir PATH --scores PATH
7
7
  [--config PATH] [--timeout SECONDS]
8
8
 
@@ -20,9 +20,23 @@ import tempfile
20
20
  import time
21
21
 
22
22
 
23
+ def _resolve_python():
24
+ """Resolve the Python interpreter to use for subprocesses.
25
+
26
+ Prefers the current interpreter (sys.executable) over a hardcoded 'python3'.
27
+ This is critical in monorepo setups where the harness may need a specific
28
+ venv Python (e.g. Python 3.12) while the system 'python3' is a different
29
+ version (e.g. 3.14) with incompatible site-packages.
30
+ """
31
+ exe = sys.executable
32
+ if exe and os.path.isfile(exe):
33
+ return exe
34
+ return "python3"
35
+
36
+
23
37
  def _run_harness_on_task(harness, config, task_input_path, output_path, task_traces_dir, timeout, env=None):
24
38
  """Run the harness on a single task. Returns (success, elapsed_ms, stdout, stderr)."""
25
- cmd = ["python3", harness, "--input", task_input_path, "--output", output_path]
39
+ cmd = [_resolve_python(), harness, "--input", task_input_path, "--output", output_path]
26
40
  if task_traces_dir:
27
41
  extra_dir = os.path.join(task_traces_dir, "extra")
28
42
  os.makedirs(extra_dir, exist_ok=True)
@@ -48,6 +62,7 @@ def _run_harness_on_task(harness, config, task_input_path, output_path, task_tra
48
62
  def cmd_validate(args):
49
63
  harness = args.harness
50
64
  config = getattr(args, "config", None)
65
+ timeout = getattr(args, "timeout", 30) or 30
51
66
 
52
67
  if not os.path.exists(harness):
53
68
  print(f"FAIL: harness not found: {harness}", file=sys.stderr)
@@ -61,11 +76,17 @@ def cmd_validate(args):
61
76
  json.dump(dummy_task, f)
62
77
 
63
78
  success, elapsed, stdout, stderr = _run_harness_on_task(
64
- harness, config, input_path, output_path, None, timeout=30,
79
+ harness, config, input_path, output_path, None, timeout=timeout,
65
80
  )
66
81
 
67
82
  if not success:
68
- print(f"FAIL: harness exited with error.\nstderr: {stderr}", file=sys.stderr)
83
+ hint = ""
84
+ if "TIMEOUT" in stderr:
85
+ hint = (f"\nHint: validation timed out after {timeout}s. "
86
+ "For LLM-powered agents that make real API calls, "
87
+ "use --timeout to increase the limit: "
88
+ f"evaluate.py validate --harness {harness} --timeout 120")
89
+ print(f"FAIL: harness exited with error.\nstderr: {stderr}{hint}", file=sys.stderr)
69
90
  sys.exit(1)
70
91
 
71
92
  if not os.path.exists(output_path):
@@ -171,7 +192,7 @@ def cmd_run(args):
171
192
  f.write("\n".join(all_stderr))
172
193
 
173
194
  eval_cmd = [
174
- "python3", eval_script,
195
+ _resolve_python(), eval_script,
175
196
  "--results-dir", results_dir,
176
197
  "--tasks-dir", tasks_dir,
177
198
  "--scores", scores_path,
@@ -195,6 +216,9 @@ def main():
195
216
  p_val = sub.add_parser("validate")
196
217
  p_val.add_argument("--harness", required=True)
197
218
  p_val.add_argument("--config", default=None)
219
+ p_val.add_argument("--timeout", type=int, default=30,
220
+ help="Validation timeout in seconds (default: 30). "
221
+ "Increase for LLM-powered agents that make real API calls.")
198
222
 
199
223
  p_run = sub.add_parser("run")
200
224
  p_run.add_argument("--harness", required=True)
package/tools/init.py CHANGED
@@ -124,6 +124,40 @@ def _detect_langsmith():
124
124
  return {"enabled": False}
125
125
 
126
126
 
127
+ def _detect_langsmith_project(search_dir="."):
128
+ """Auto-detect the app's existing LangSmith project name.
129
+
130
+ Checks (in order):
131
+ 1. LANGCHAIN_PROJECT env var (standard LangChain convention)
132
+ 2. LANGSMITH_PROJECT env var (alternative)
133
+ 3. .env file in the project directory
134
+ """
135
+ for var in ("LANGCHAIN_PROJECT", "LANGSMITH_PROJECT"):
136
+ project = os.environ.get(var)
137
+ if project:
138
+ return project
139
+
140
+ # Parse .env file
141
+ for env_name in (".env", ".env.local"):
142
+ env_path = os.path.join(search_dir, env_name)
143
+ if os.path.exists(env_path):
144
+ try:
145
+ with open(env_path) as f:
146
+ for line in f:
147
+ line = line.strip()
148
+ if line.startswith("#") or "=" not in line:
149
+ continue
150
+ key, _, val = line.partition("=")
151
+ key = key.strip()
152
+ val = val.strip().strip("'\"")
153
+ if key in ("LANGCHAIN_PROJECT", "LANGSMITH_PROJECT") and val:
154
+ return val
155
+ except OSError:
156
+ pass
157
+
158
+ return None
159
+
160
+
127
161
  def _check_langsmith_cli():
128
162
  """Check if langsmith-cli is installed."""
129
163
  try:
@@ -134,6 +168,19 @@ def _check_langsmith_cli():
134
168
  return False
135
169
 
136
170
 
171
+ def _resolve_python():
172
+ """Resolve the Python interpreter for subprocesses.
173
+
174
+ Uses the current interpreter (sys.executable) instead of hardcoded 'python3'.
175
+ This prevents version mismatches in monorepo setups where the harness may
176
+ need a specific venv Python different from the system python3.
177
+ """
178
+ exe = sys.executable
179
+ if exe and os.path.isfile(exe):
180
+ return exe
181
+ return "python3"
182
+
183
+
137
184
  def _detect_stack(harness_path):
138
185
  """Detect technology stack from harness imports."""
139
186
  detect_stack_py = os.path.join(os.path.dirname(__file__), "detect_stack.py")
@@ -141,7 +188,7 @@ def _detect_stack(harness_path):
141
188
  return {}
142
189
  try:
143
190
  r = subprocess.run(
144
- ["python3", detect_stack_py, harness_path],
191
+ [_resolve_python(), detect_stack_py, harness_path],
145
192
  capture_output=True, text=True, timeout=30,
146
193
  )
147
194
  if r.returncode == 0 and r.stdout.strip():
@@ -183,6 +230,15 @@ def main():
183
230
  parser.add_argument("--base-dir", default=None, help="Path for .harness-evolver/")
184
231
  parser.add_argument("--harness-config", default=None, help="Path to harness config.json")
185
232
  parser.add_argument("--tools-dir", default=None, help="Path to tools directory")
233
+ parser.add_argument("--validation-timeout", type=int, default=30,
234
+ help="Timeout for harness validation in seconds (default: 30). "
235
+ "Increase for LLM-powered agents that make real API calls.")
236
+ parser.add_argument("--skip-validation", action="store_true",
237
+ help="Skip harness validation step. Use when you know the harness "
238
+ "works but validation times out (e.g. real LLM agent calls).")
239
+ parser.add_argument("--langsmith-project", default=None,
240
+ help="Existing LangSmith project name with production traces. "
241
+ "Auto-detected from LANGCHAIN_PROJECT / LANGSMITH_PROJECT env vars or .env file.")
186
242
  args = parser.parse_args()
187
243
 
188
244
  # Auto-detect missing args
@@ -261,6 +317,7 @@ def main():
261
317
  "args": ["--results-dir", "{results_dir}", "--tasks-dir", "{tasks_dir}",
262
318
  "--scores", "{scores}"],
263
319
  "langsmith": _detect_langsmith(),
320
+ "production_project": args.langsmith_project or _detect_langsmith_project(search_dir),
264
321
  },
265
322
  "evolution": {
266
323
  "max_iterations": 10,
@@ -309,7 +366,7 @@ def main():
309
366
  if os.path.exists(detect_stack_py):
310
367
  try:
311
368
  r = subprocess.run(
312
- ["python3", detect_stack_py, harness_dir],
369
+ [_resolve_python(), detect_stack_py, harness_dir],
313
370
  capture_output=True, text=True, timeout=30,
314
371
  )
315
372
  if r.returncode == 0 and r.stdout.strip():
@@ -338,7 +395,7 @@ def main():
338
395
  if os.path.exists(analyze_py):
339
396
  try:
340
397
  r = subprocess.run(
341
- ["python3", analyze_py, "--harness", args.harness],
398
+ [_resolve_python(), analyze_py, "--harness", args.harness],
342
399
  capture_output=True, text=True, timeout=30,
343
400
  )
344
401
  if r.returncode == 0 and r.stdout.strip():
@@ -356,31 +413,65 @@ def main():
356
413
  except Exception:
357
414
  pass
358
415
 
416
+ # 4.5 Fetch production traces seed (if LangSmith production project detected)
417
+ prod_project = config["eval"].get("production_project")
418
+ if prod_project and os.environ.get("LANGSMITH_API_KEY"):
419
+ seed_py = os.path.join(tools, "seed_from_traces.py")
420
+ if os.path.exists(seed_py):
421
+ print(f"Fetching production traces from LangSmith project '{prod_project}'...")
422
+ try:
423
+ r = subprocess.run(
424
+ [_resolve_python(), seed_py,
425
+ "--project", prod_project,
426
+ "--output-md", os.path.join(base, "production_seed.md"),
427
+ "--output-json", os.path.join(base, "production_seed.json"),
428
+ "--limit", "100"],
429
+ capture_output=True, text=True, timeout=60,
430
+ )
431
+ if r.returncode == 0:
432
+ print(r.stdout.strip())
433
+ else:
434
+ print(f" Could not fetch production traces: {r.stderr.strip()[:200]}")
435
+ except Exception as e:
436
+ print(f" Production trace fetch failed: {e}")
437
+ elif prod_project:
438
+ print(f"Production LangSmith project detected: {prod_project}")
439
+ print(" Set LANGSMITH_API_KEY to auto-fetch production traces during init.")
440
+
359
441
  # 5. Validate baseline harness
360
- print("Validating baseline harness...")
361
- val_args = ["python3", evaluate_py, "validate",
362
- "--harness", os.path.join(base, "baseline", "harness.py")]
363
442
  config_path = os.path.join(base, "baseline", "config.json")
364
- if os.path.exists(config_path):
365
- val_args.extend(["--config", config_path])
366
- r = subprocess.run(val_args, capture_output=True, text=True)
367
- if r.returncode != 0:
368
- print(f"FAIL: baseline harness validation failed.\n{r.stderr}", file=sys.stderr)
369
- sys.exit(1)
370
- print(r.stdout.strip())
443
+ if args.skip_validation:
444
+ print("Skipping baseline validation (--skip-validation).")
445
+ else:
446
+ print(f"Validating baseline harness (timeout: {args.validation_timeout}s)...")
447
+ val_args = [_resolve_python(), evaluate_py, "validate",
448
+ "--harness", os.path.join(base, "baseline", "harness.py"),
449
+ "--timeout", str(args.validation_timeout)]
450
+ if os.path.exists(config_path):
451
+ val_args.extend(["--config", config_path])
452
+ r = subprocess.run(val_args, capture_output=True, text=True)
453
+ if r.returncode != 0:
454
+ hint = ""
455
+ if "TIMEOUT" in r.stderr:
456
+ hint = (f"\n\nHint: The harness timed out after {args.validation_timeout}s. "
457
+ "This is common for LLM-powered agents that make real API calls.\n"
458
+ "Try: --validation-timeout 120 (or --skip-validation to bypass)")
459
+ print(f"FAIL: baseline harness validation failed.\n{r.stderr}{hint}", file=sys.stderr)
460
+ sys.exit(1)
461
+ print(r.stdout.strip())
371
462
 
372
463
  # 6. Evaluate baseline
373
464
  print("Evaluating baseline harness...")
374
465
  baseline_traces = tempfile.mkdtemp()
375
466
  baseline_scores = os.path.join(base, "baseline_scores.json")
376
467
  eval_args = [
377
- "python3", evaluate_py, "run",
468
+ _resolve_python(), evaluate_py, "run",
378
469
  "--harness", os.path.join(base, "baseline", "harness.py"),
379
470
  "--tasks-dir", os.path.join(base, "eval", "tasks"),
380
471
  "--eval", os.path.join(base, "eval", "eval.py"),
381
472
  "--traces-dir", baseline_traces,
382
473
  "--scores", baseline_scores,
383
- "--timeout", "60",
474
+ "--timeout", str(max(args.validation_timeout, 60)),
384
475
  ]
385
476
  if os.path.exists(config_path):
386
477
  eval_args.extend(["--config", config_path])
@@ -399,7 +490,7 @@ def main():
399
490
  # 7. Initialize state with baseline score
400
491
  print(f"Baseline score: {baseline_score:.2f}")
401
492
  r = subprocess.run(
402
- ["python3", state_py, "init",
493
+ [_resolve_python(), state_py, "init",
403
494
  "--base-dir", base,
404
495
  "--baseline-score", str(baseline_score)],
405
496
  capture_output=True, text=True,
@@ -0,0 +1,454 @@
1
+ #!/usr/bin/env python3
2
+ """Fetch and summarize production LangSmith traces for Harness Evolver.
3
+
4
+ Queries the LangSmith REST API directly (urllib, stdlib-only) to fetch
5
+ production traces and produce:
6
+ 1. A markdown seed file for the testgen agent (production_seed.md)
7
+ 2. A JSON summary for programmatic use (production_seed.json)
8
+
9
+ Usage:
10
+ python3 seed_from_traces.py \
11
+ --project ceppem-langgraph \
12
+ --output-md .harness-evolver/production_seed.md \
13
+ --output-json .harness-evolver/production_seed.json \
14
+ [--api-key-env LANGSMITH_API_KEY] \
15
+ [--limit 100]
16
+
17
+ Stdlib-only. No external dependencies (no langsmith-cli needed).
18
+ """
19
+
20
+ import argparse
21
+ import json
22
+ import os
23
+ import sys
24
+ import urllib.parse
25
+ import urllib.request
26
+ from collections import Counter
27
+ from datetime import datetime, timezone
28
+
29
+ LANGSMITH_API_BASE = "https://api.smith.langchain.com/api/v1"
30
+
31
+
32
+ def langsmith_request(endpoint, api_key, method="GET", body=None, params=None):
33
+ """Make a request to the LangSmith REST API."""
34
+ url = f"{LANGSMITH_API_BASE}/{endpoint}"
35
+ if params:
36
+ url += "?" + urllib.parse.urlencode(params)
37
+
38
+ headers = {
39
+ "x-api-key": api_key,
40
+ "Accept": "application/json",
41
+ }
42
+
43
+ data = None
44
+ if body is not None:
45
+ headers["Content-Type"] = "application/json"
46
+ data = json.dumps(body).encode("utf-8")
47
+
48
+ req = urllib.request.Request(url, data=data, headers=headers, method=method)
49
+ try:
50
+ with urllib.request.urlopen(req, timeout=30) as resp:
51
+ return json.loads(resp.read())
52
+ except urllib.error.HTTPError as e:
53
+ body_text = ""
54
+ try:
55
+ body_text = e.read().decode("utf-8", errors="replace")[:500]
56
+ except Exception:
57
+ pass
58
+ print(f"LangSmith API error {e.code}: {body_text}", file=sys.stderr)
59
+ return None
60
+ except Exception as e:
61
+ print(f"LangSmith API request failed: {e}", file=sys.stderr)
62
+ return None
63
+
64
+
65
+ def fetch_runs(project_name, api_key, limit=100):
66
+ """Fetch recent root runs from a LangSmith project."""
67
+ # Try POST /runs/query first (newer API)
68
+ body = {
69
+ "project_name": project_name,
70
+ "is_root": True,
71
+ "limit": limit,
72
+ }
73
+ result = langsmith_request("runs/query", api_key, method="POST", body=body)
74
+ if result and isinstance(result, dict):
75
+ return result.get("runs", result.get("results", []))
76
+ if result and isinstance(result, list):
77
+ return result
78
+
79
+ # Fallback: GET /runs with query params
80
+ params = {
81
+ "project_name": project_name,
82
+ "is_root": "true",
83
+ "limit": str(limit),
84
+ }
85
+ result = langsmith_request("runs", api_key, params=params)
86
+ if result and isinstance(result, list):
87
+ return result
88
+ if result and isinstance(result, dict):
89
+ return result.get("runs", result.get("results", []))
90
+
91
+ return []
92
+
93
+
94
+ def extract_input(run):
95
+ """Extract user input from a run's inputs field."""
96
+ inputs = run.get("inputs", {})
97
+ if not inputs:
98
+ return None
99
+ if isinstance(inputs, str):
100
+ return inputs
101
+
102
+ # Direct field
103
+ for key in ("input", "question", "query", "prompt", "text", "user_input"):
104
+ if key in inputs and isinstance(inputs[key], str):
105
+ return inputs[key]
106
+
107
+ # LangChain messages format
108
+ messages = inputs.get("messages") or inputs.get("input")
109
+ if isinstance(messages, list):
110
+ if messages and isinstance(messages[0], list):
111
+ messages = messages[0]
112
+ for msg in messages:
113
+ if isinstance(msg, dict):
114
+ if msg.get("type") in ("human", "HumanMessage") or msg.get("role") == "user":
115
+ content = msg.get("content", "")
116
+ if isinstance(content, str) and content:
117
+ return content
118
+ if isinstance(content, list):
119
+ for part in content:
120
+ if isinstance(part, dict) and part.get("type") == "text":
121
+ return part.get("text", "")
122
+ elif isinstance(msg, str) and msg:
123
+ return msg
124
+
125
+ return None
126
+
127
+
128
+ def extract_output(run):
129
+ """Extract the output/response from a run."""
130
+ outputs = run.get("outputs", {})
131
+ if not outputs:
132
+ return None
133
+ if isinstance(outputs, str):
134
+ return outputs
135
+
136
+ for key in ("output", "answer", "result", "response", "text"):
137
+ if key in outputs and isinstance(outputs[key], str):
138
+ return outputs[key]
139
+
140
+ # LangChain messages format
141
+ messages = outputs.get("messages") or outputs.get("output")
142
+ if isinstance(messages, list):
143
+ if messages and isinstance(messages[0], list):
144
+ messages = messages[0]
145
+ for msg in reversed(messages):
146
+ if isinstance(msg, dict):
147
+ if msg.get("type") in ("ai", "AIMessage", "assistant") or msg.get("role") == "assistant":
148
+ content = msg.get("content", "")
149
+ if isinstance(content, str) and content:
150
+ return content
151
+ elif isinstance(msg, str) and msg:
152
+ return msg
153
+
154
+ return None
155
+
156
+
157
+ def get_feedback(run):
158
+ """Extract feedback from a run."""
159
+ fb = run.get("feedback_stats") or {}
160
+ if isinstance(fb, dict):
161
+ pos = fb.get("thumbs_up", 0) or fb.get("positive", 0) or 0
162
+ neg = fb.get("thumbs_down", 0) or fb.get("negative", 0) or 0
163
+ if neg > 0:
164
+ return "negative"
165
+ if pos > 0:
166
+ return "positive"
167
+ return None
168
+
169
+
170
+ def categorize_run(run):
171
+ """Categorize a run by its name/type."""
172
+ name = run.get("name", "unknown")
173
+ # Use top-level run name as category
174
+ return name
175
+
176
+
177
+ def analyze_runs(runs):
178
+ """Analyze a batch of runs and produce structured insights."""
179
+ if not runs:
180
+ return None
181
+
182
+ processed = []
183
+ categories = Counter()
184
+ errors = []
185
+ latencies = []
186
+ token_counts = []
187
+ feedbacks = {"positive": 0, "negative": 0, "none": 0}
188
+
189
+ for run in runs:
190
+ user_input = extract_input(run)
191
+ output = extract_output(run)
192
+ error = run.get("error")
193
+ tokens = run.get("total_tokens") or 0
194
+ latency_ms = None
195
+ feedback = get_feedback(run)
196
+
197
+ # Calculate latency from start/end times
198
+ start = run.get("start_time") or run.get("start_dt")
199
+ end = run.get("end_time") or run.get("end_dt")
200
+ if isinstance(start, str) and isinstance(end, str):
201
+ try:
202
+ from datetime import datetime as dt
203
+ s = dt.fromisoformat(start.replace("Z", "+00:00"))
204
+ e = dt.fromisoformat(end.replace("Z", "+00:00"))
205
+ latency_ms = int((e - s).total_seconds() * 1000)
206
+ except Exception:
207
+ pass
208
+ elif run.get("latency"):
209
+ latency_ms = int(run["latency"] * 1000) if isinstance(run["latency"], float) else run["latency"]
210
+
211
+ category = categorize_run(run)
212
+ categories[category] += 1
213
+
214
+ entry = {
215
+ "input": (user_input or "")[:500],
216
+ "output": (output or "")[:300],
217
+ "category": category,
218
+ "tokens": tokens,
219
+ "latency_ms": latency_ms,
220
+ "error": (error or "")[:200] if error else None,
221
+ "feedback": feedback,
222
+ }
223
+ processed.append(entry)
224
+
225
+ if error:
226
+ errors.append({"error": error[:200], "input": (user_input or "")[:200], "category": category})
227
+ if latency_ms:
228
+ latencies.append(latency_ms)
229
+ if tokens:
230
+ token_counts.append(tokens)
231
+
232
+ if feedback == "positive":
233
+ feedbacks["positive"] += 1
234
+ elif feedback == "negative":
235
+ feedbacks["negative"] += 1
236
+ else:
237
+ feedbacks["none"] += 1
238
+
239
+ # Compute statistics
240
+ stats = {
241
+ "total_traces": len(runs),
242
+ "with_input": sum(1 for p in processed if p["input"]),
243
+ "with_error": len(errors),
244
+ "error_rate": len(errors) / max(len(runs), 1),
245
+ "feedback": feedbacks,
246
+ }
247
+
248
+ if latencies:
249
+ latencies.sort()
250
+ stats["latency"] = {
251
+ "avg_ms": int(sum(latencies) / len(latencies)),
252
+ "p50_ms": latencies[len(latencies) // 2],
253
+ "p95_ms": latencies[int(len(latencies) * 0.95)] if len(latencies) >= 20 else latencies[-1],
254
+ "max_ms": latencies[-1],
255
+ }
256
+
257
+ if token_counts:
258
+ stats["tokens"] = {
259
+ "avg": int(sum(token_counts) / len(token_counts)),
260
+ "max": max(token_counts),
261
+ "total": sum(token_counts),
262
+ }
263
+
264
+ # Group by category
265
+ by_category = {}
266
+ for entry in processed:
267
+ cat = entry["category"]
268
+ by_category.setdefault(cat, []).append(entry)
269
+
270
+ # Error patterns
271
+ error_patterns = Counter()
272
+ for e in errors:
273
+ # Normalize error to first 60 chars
274
+ pattern = e["error"][:60]
275
+ error_patterns[pattern] += 1
276
+
277
+ return {
278
+ "stats": stats,
279
+ "categories": dict(categories.most_common()),
280
+ "by_category": by_category,
281
+ "error_patterns": dict(error_patterns.most_common(10)),
282
+ "errors": errors[:20],
283
+ "processed": processed,
284
+ }
285
+
286
+
287
+ def generate_markdown_seed(analysis, project_name):
288
+ """Generate a markdown seed file for the testgen agent."""
289
+ stats = analysis["stats"]
290
+ lines = [
291
+ f"# Production Trace Analysis: {project_name}",
292
+ "",
293
+ f"*{stats['total_traces']} traces analyzed*",
294
+ "",
295
+ "## Key Metrics",
296
+ "",
297
+ f"- **Error rate**: {stats['error_rate']:.1%}",
298
+ ]
299
+
300
+ if "latency" in stats:
301
+ lat = stats["latency"]
302
+ lines.append(f"- **Latency**: {lat['avg_ms']}ms avg, {lat['p50_ms']}ms p50, {lat['p95_ms']}ms p95")
303
+
304
+ if "tokens" in stats:
305
+ tok = stats["tokens"]
306
+ lines.append(f"- **Tokens**: {tok['avg']} avg, {tok['max']} max")
307
+
308
+ fb = stats["feedback"]
309
+ total_fb = fb["positive"] + fb["negative"]
310
+ if total_fb > 0:
311
+ lines.append(f"- **User feedback**: {fb['positive']}/{total_fb} positive ({fb['positive']/total_fb:.0%})")
312
+
313
+ # Traffic distribution
314
+ lines.extend(["", "## Traffic Distribution", ""])
315
+ total = stats["total_traces"]
316
+ for cat, count in sorted(analysis["categories"].items(), key=lambda x: -x[1]):
317
+ pct = count / max(total, 1) * 100
318
+ lines.append(f"- **{cat}**: {count} traces ({pct:.0f}%)")
319
+
320
+ # Sample inputs by category
321
+ lines.extend(["", "## Sample Inputs by Category", ""])
322
+ for cat, entries in sorted(analysis["by_category"].items(), key=lambda x: -len(x[1])):
323
+ lines.append(f"### {cat} ({len(entries)} traces)")
324
+ lines.append("")
325
+ # Show up to 8 sample inputs per category
326
+ shown = 0
327
+ for entry in entries:
328
+ if not entry["input"] or shown >= 8:
329
+ break
330
+ status = "ERROR" if entry["error"] else "ok"
331
+ tok_str = f", {entry['tokens']}tok" if entry["tokens"] else ""
332
+ lat_str = f", {entry['latency_ms']}ms" if entry["latency_ms"] else ""
333
+ fb_str = ""
334
+ if entry["feedback"] == "negative":
335
+ fb_str = " [NEGATIVE FEEDBACK]"
336
+ elif entry["feedback"] == "positive":
337
+ fb_str = " [+]"
338
+ lines.append(f'- "{entry["input"][:150]}" ({status}{tok_str}{lat_str}){fb_str}')
339
+ shown += 1
340
+ lines.append("")
341
+
342
+ # Error patterns
343
+ if analysis["error_patterns"]:
344
+ lines.extend(["## Error Patterns", ""])
345
+ for pattern, count in analysis["error_patterns"].items():
346
+ lines.append(f"- **{pattern}**: {count} occurrences")
347
+ lines.append("")
348
+
349
+ # Negative feedback traces
350
+ neg_traces = [e for e in analysis["processed"] if e["feedback"] == "negative" and e["input"]]
351
+ if neg_traces:
352
+ lines.extend(["## Traces with Negative Feedback (high priority)", ""])
353
+ for entry in neg_traces[:10]:
354
+ lines.append(f'- "{entry["input"][:200]}" → category: {entry["category"]}')
355
+ lines.append("")
356
+
357
+ # Guidance for testgen
358
+ lines.extend([
359
+ "## Guidance for Test Generation",
360
+ "",
361
+ "Use the above data to generate test cases that:",
362
+ "1. **Match the real traffic distribution** — generate more tasks for high-traffic categories",
363
+ "2. **Include actual user phrasing** — real inputs show how users actually communicate (informal, abbreviations, typos)",
364
+ "3. **Cover real error patterns** — the errors above are genuine failure modes, not imagined scenarios",
365
+ "4. **Prioritize negative feedback traces** — these are confirmed bad experiences",
366
+ "5. **Include slow queries as edge cases** — high-latency traces may reveal timeout or complexity issues",
367
+ ])
368
+
369
+ return "\n".join(lines)
370
+
371
+
372
+ def generate_json_summary(analysis, project_name):
373
+ """Generate a JSON summary for programmatic use."""
374
+ return {
375
+ "project": project_name,
376
+ "generated_at": datetime.now(timezone.utc).isoformat(),
377
+ "stats": analysis["stats"],
378
+ "categories": analysis["categories"],
379
+ "error_patterns": analysis["error_patterns"],
380
+ "sample_inputs": {
381
+ cat: [e["input"] for e in entries if e["input"]][:10]
382
+ for cat, entries in analysis["by_category"].items()
383
+ },
384
+ "negative_feedback_inputs": [
385
+ e["input"] for e in analysis["processed"]
386
+ if e["feedback"] == "negative" and e["input"]
387
+ ][:20],
388
+ "slow_queries": [
389
+ {"input": e["input"][:200], "latency_ms": e["latency_ms"], "category": e["category"]}
390
+ for e in sorted(analysis["processed"], key=lambda x: -(x["latency_ms"] or 0))
391
+ if e["latency_ms"] and e["input"]
392
+ ][:10],
393
+ }
394
+
395
+
396
+ def main():
397
+ parser = argparse.ArgumentParser(description="Fetch and summarize production LangSmith traces")
398
+ parser.add_argument("--project", required=True, help="LangSmith project name")
399
+ parser.add_argument("--api-key-env", default="LANGSMITH_API_KEY",
400
+ help="Env var containing API key (default: LANGSMITH_API_KEY)")
401
+ parser.add_argument("--limit", type=int, default=100, help="Max traces to fetch (default: 100)")
402
+ parser.add_argument("--output-md", required=True, help="Output path for markdown seed")
403
+ parser.add_argument("--output-json", required=True, help="Output path for JSON summary")
404
+ args = parser.parse_args()
405
+
406
+ api_key = os.environ.get(args.api_key_env, "")
407
+ if not api_key:
408
+ print(f"No API key found in ${args.api_key_env} — cannot fetch production traces", file=sys.stderr)
409
+ sys.exit(1)
410
+
411
+ print(f"Fetching up to {args.limit} traces from LangSmith project '{args.project}'...")
412
+ runs = fetch_runs(args.project, api_key, args.limit)
413
+
414
+ if not runs:
415
+ print("No traces found. The project may be empty or the name may be wrong.")
416
+ # Write empty files so downstream doesn't break
417
+ for path in [args.output_md, args.output_json]:
418
+ os.makedirs(os.path.dirname(path) or ".", exist_ok=True)
419
+ with open(args.output_md, "w") as f:
420
+ f.write(f"# Production Trace Analysis: {args.project}\n\nNo traces found.\n")
421
+ with open(args.output_json, "w") as f:
422
+ json.dump({"project": args.project, "stats": {"total_traces": 0}}, f, indent=2)
423
+ return
424
+
425
+ print(f"Fetched {len(runs)} traces. Analyzing...")
426
+ analysis = analyze_runs(runs)
427
+
428
+ if not analysis:
429
+ print("Analysis failed — no processable traces")
430
+ return
431
+
432
+ # Write markdown seed
433
+ os.makedirs(os.path.dirname(args.output_md) or ".", exist_ok=True)
434
+ md = generate_markdown_seed(analysis, args.project)
435
+ with open(args.output_md, "w") as f:
436
+ f.write(md)
437
+
438
+ # Write JSON summary
439
+ os.makedirs(os.path.dirname(args.output_json) or ".", exist_ok=True)
440
+ summary = generate_json_summary(analysis, args.project)
441
+ with open(args.output_json, "w") as f:
442
+ json.dump(summary, f, indent=2, ensure_ascii=False)
443
+
444
+ stats = analysis["stats"]
445
+ cats = len(analysis["categories"])
446
+ errs = stats["with_error"]
447
+ print(f"Production seed generated:")
448
+ print(f" {stats['total_traces']} traces, {cats} categories, {errs} errors ({stats['error_rate']:.1%})")
449
+ print(f" {args.output_md}")
450
+ print(f" {args.output_json}")
451
+
452
+
453
+ if __name__ == "__main__":
454
+ main()