harness-evolver 2.5.1 → 2.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -46,12 +46,20 @@ claude
46
46
 
47
47
  <table>
48
48
  <tr>
49
- <td><b>5 Proposers</b></td>
50
- <td>Each iteration spawns 5 parallel agents with different strategies: exploit (targeted fix), explore (bold rewrite), crossover (combine two parents), prompt specialist, retrieval specialist. Best candidate wins.</td>
49
+ <td><b>5 Adaptive Proposers</b></td>
50
+ <td>Each iteration spawns 5 parallel agents: exploit (targeted fix), explore (bold rewrite), crossover (combine two parents), and 2 failure-focused agents that target the weakest task clusters. Strategies adapt every iteration based on actual per-task scores — no fixed specialists.</td>
51
51
  </tr>
52
52
  <tr>
53
- <td><b>Full Traces</b></td>
54
- <td>Every harness run captures stdout, stderr, timing, and per-task I/O. LangSmith auto-tracing for LangChain/LangGraph agents. Proposers read actual LLM prompts and responses.</td>
53
+ <td><b>Trace Insights</b></td>
54
+ <td>Every harness run captures stdout, stderr, timing, and per-task I/O. LangSmith auto-tracing for LangChain/LangGraph agents. Traces are systematically clustered by error pattern, token usage, and response type — proposers receive structured diagnostic data, not raw logs.</td>
55
+ </tr>
56
+ <tr>
57
+ <td><b>Quality-Diversity Selection</b></td>
58
+ <td>Not winner-take-all. Tracks per-task champions — a candidate that loses overall but excels at specific tasks is preserved as the next crossover parent. The archive never discards variants.</td>
59
+ </tr>
60
+ <tr>
61
+ <td><b>Durable Test Gates</b></td>
62
+ <td>When the loop fixes a failure, regression tasks are automatically generated to lock in the improvement. The test suite grows over iterations — fixed bugs can never silently return.</td>
55
63
  </tr>
56
64
  <tr>
57
65
  <td><b>Critic</b></td>
@@ -74,13 +82,14 @@ claude
74
82
  | Command | What it does |
75
83
  |---|---|
76
84
  | `/harness-evolver:init` | Scan project, create harness/eval/tasks, run baseline |
77
- | `/harness-evolver:evolve` | Run the autonomous optimization loop (5 parallel proposers) |
85
+ | `/harness-evolver:evolve` | Run the autonomous optimization loop (5 adaptive proposers) |
78
86
  | `/harness-evolver:status` | Show progress, scores, stagnation detection |
79
87
  | `/harness-evolver:compare` | Diff two versions with per-task analysis |
80
88
  | `/harness-evolver:diagnose` | Deep trace analysis of a specific version |
81
89
  | `/harness-evolver:deploy` | Promote the best harness back to your project |
82
90
  | `/harness-evolver:architect` | Analyze and recommend optimal agent topology |
83
91
  | `/harness-evolver:critic` | Evaluate eval quality and detect gaming |
92
+ | `/harness-evolver:import-traces` | Pull production LangSmith traces as eval tasks |
84
93
 
85
94
  ---
86
95
 
@@ -139,16 +148,20 @@ Works with any language, any framework, any domain. If your project doesn't have
139
148
  ```
140
149
  /harness-evolver:evolve
141
150
 
142
- ├─ 1. Gather LangSmith traces (processed into readable format)
143
- ├─ 2. Spawn 5 proposers in parallel (exploit/explore/crossover/prompt/retrieval)
144
- ├─ 3. Validate all candidates
145
- ├─ 4. Evaluate all candidates
151
+ ├─ 1. Get next version
152
+ ├─ 1.5 Gather LangSmith traces (processed into readable format)
153
+ ├─ 1.6 Generate Trace Insights (cluster errors, analyze tokens, cross-ref scores)
154
+ ├─ 1.8 Analyze per-task failures (cluster by category for adaptive briefings)
155
+ ├─ 2. Spawn 5 proposers in parallel (exploit / explore / crossover / 2× failure-targeted)
156
+ ├─ 3. Validate all candidates
157
+ ├─ 4. Evaluate all candidates
146
158
  ├─ 4.5 Judge (if using LLM-as-judge eval)
147
- ├─ 5. Select winner (highest combined_score)
148
- ├─ 6. Report results
159
+ ├─ 5. Select winner + track per-task champion
160
+ ├─ 5.5 Test suite growth (generate regression tasks for fixed failures)
161
+ ├─ 6. Report results
149
162
  ├─ 6.5 Auto-trigger Critic (if score jumped >0.3 or reached 1.0 too fast)
150
- ├─ 7. Auto-trigger Architect (if regression or stagnation)
151
- └─ 8. Check stop conditions (target reached, N iterations, stagnation post-architect)
163
+ ├─ 7. Auto-trigger Architect (if regression or stagnation)
164
+ └─ 8. Check stop conditions (target reached, N iterations, stagnation post-architect)
152
165
  ```
153
166
 
154
167
  ---
@@ -175,9 +188,12 @@ The plugin auto-detects available keys. No key needed for the included example.
175
188
  |---|---|---|---|---|
176
189
  | **Format** | Paper artifact | Framework (Docker) | Plugin (passive) | **Plugin (active)** |
177
190
  | **Search** | Code-space | Code-space | Prompt-space | **Code-space** |
178
- | **Candidates/iter** | 1 | 1 | N/A | **5 parallel** |
191
+ | **Candidates/iter** | 1 | 1 | N/A | **5 parallel (adaptive)** |
192
+ | **Selection** | Single best | Single best | N/A | **Quality-diversity (per-task)** |
179
193
  | **Auto-critique** | No | No | No | **Yes (critic + judge)** |
180
194
  | **Architecture** | Fixed | Fixed | N/A | **Auto-recommended** |
195
+ | **Trace analysis** | Manual | No | No | **Systematic (clustering + insights)** |
196
+ | **Test growth** | No | No | No | **Yes (durable regression gates)** |
181
197
  | **LangSmith** | No | No | No | **Yes** |
182
198
  | **Context7** | No | No | No | **Yes** |
183
199
  | **Zero-config** | No | No | No | **Yes** |
@@ -26,6 +26,20 @@ Your prompt contains a `<strategy>` block defining your approach. Follow it:
26
26
 
27
27
  If no strategy block is present, default to exploitation (conservative improvement).
28
28
 
29
+ ## Trace Insights
30
+
31
+ If `.harness-evolver/trace_insights.json` exists in your `<files_to_read>`, use it to guide your diagnosis:
32
+
33
+ 1. Check `top_issues` first — these are the highest-impact problems sorted by severity
34
+ 2. Check `hypotheses` for data-driven theories about failure causes
35
+ 3. Use `error_clusters` to understand which error patterns affect which runs
36
+ 4. The `token_analysis` and `token_score_correlation` sections show if verbosity correlates with quality
37
+ 5. `score_cross_ref.failure_categories` maps failure patterns to task categories
38
+
39
+ These insights are generated from LangSmith traces cross-referenced with per-task scores — they are **data, not guesses**. Prioritize addressing issues marked severity `"high"` over `"medium"` or `"low"`.
40
+
41
+ If trace insights are not available, proceed with manual trace analysis as described in Phase 2.
42
+
29
43
  ## Context7 — Enrich Your Knowledge
30
44
 
31
45
  You have access to Context7 MCP tools (`resolve-library-id` and `get-library-docs`) for looking up **current, version-specific documentation** of any library.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "harness-evolver",
3
- "version": "2.5.1",
3
+ "version": "2.6.0",
4
4
  "description": "Meta-Harness-style autonomous harness optimization for Claude Code",
5
5
  "author": "Raphael Valdetaro",
6
6
  "license": "MIT",
@@ -132,6 +132,32 @@ The resulting `langsmith_runs.json` has clean, readable entries:
132
132
 
133
133
  These files are included in the proposer's `<files_to_read>` so it has readable trace data for diagnosis.
134
134
 
135
+ ### 1.6. Generate Trace Insights (systematic analysis)
136
+
137
+ If LangSmith traces were gathered, run systematic analysis to cluster errors, analyze token usage, and cross-reference with scores:
138
+
139
+ ```bash
140
+ if [ -f ".harness-evolver/langsmith_runs.json" ]; then
141
+ BEST=$(python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(s['best']['version'])")
142
+ SCORES_PATH=".harness-evolver/harnesses/$BEST/scores.json"
143
+ [ ! -f "$SCORES_PATH" ] && SCORES_PATH=".harness-evolver/baseline/scores.json"
144
+ python3 $TOOLS/trace_insights.py \
145
+ --langsmith-runs .harness-evolver/langsmith_runs.json \
146
+ --langsmith-stats .harness-evolver/langsmith_stats.json \
147
+ --scores "$SCORES_PATH" \
148
+ --tasks-dir .harness-evolver/eval/tasks/ \
149
+ --output .harness-evolver/trace_insights.json 2>/dev/null
150
+ fi
151
+ ```
152
+
153
+ The resulting `trace_insights.json` contains:
154
+ - `error_clusters`: grouped error patterns with counts
155
+ - `token_analysis`: score distribution by token usage bucket (low/medium/high)
156
+ - `hypotheses`: data-driven theories about failure causes
157
+ - `top_issues`: highest-impact problems sorted by severity
158
+
159
+ This file is included in all proposers' `<files_to_read>` so they have structured diagnostic data.
160
+
135
161
  ### 1.8. Analyze Per-Task Failures (adaptive briefings for Candidates D and E)
136
162
 
137
163
  Before spawning proposers, analyze which tasks are failing and cluster them:
@@ -228,6 +254,7 @@ Agent(
228
254
  - .harness-evolver/langsmith_diagnosis.json (if exists)
229
255
  - .harness-evolver/langsmith_stats.json (if exists)
230
256
  - .harness-evolver/langsmith_runs.json (if exists)
257
+ - .harness-evolver/trace_insights.json (if exists)
231
258
  - .harness-evolver/architecture.json (if exists)
232
259
  </files_to_read>
233
260
 
@@ -267,6 +294,7 @@ Agent(
267
294
  - .harness-evolver/harnesses/{explorer_parent}/scores.json
268
295
  - .harness-evolver/langsmith_diagnosis.json (if exists)
269
296
  - .harness-evolver/langsmith_runs.json (if exists)
297
+ - .harness-evolver/trace_insights.json (if exists)
270
298
  - .harness-evolver/architecture.json (if exists)
271
299
  </files_to_read>
272
300
 
@@ -305,6 +333,7 @@ Agent(
305
333
  - .harness-evolver/harnesses/{parent_b}/scores.json
306
334
  - .harness-evolver/langsmith_diagnosis.json (if exists)
307
335
  - .harness-evolver/langsmith_runs.json (if exists)
336
+ - .harness-evolver/trace_insights.json (if exists)
308
337
  - .harness-evolver/architecture.json (if exists)
309
338
  </files_to_read>
310
339
 
@@ -347,6 +376,7 @@ Agent(
347
376
  - .harness-evolver/harnesses/{best_version}/harness.py
348
377
  - .harness-evolver/harnesses/{best_version}/scores.json
349
378
  - .harness-evolver/langsmith_runs.json (if exists)
379
+ - .harness-evolver/trace_insights.json (if exists)
350
380
  - .harness-evolver/architecture.json (if exists)
351
381
  </files_to_read>
352
382
 
@@ -407,6 +437,7 @@ Agent(
407
437
  - .harness-evolver/harnesses/{best_version}/harness.py
408
438
  - .harness-evolver/harnesses/{best_version}/scores.json
409
439
  - .harness-evolver/langsmith_runs.json (if exists)
440
+ - .harness-evolver/trace_insights.json (if exists)
410
441
  - .harness-evolver/architecture.json (if exists)
411
442
  </files_to_read>
412
443
 
@@ -580,6 +611,34 @@ Iteration {i}/{N} — {num_candidates} candidates evaluated:
580
611
 
581
612
  Keep losing candidates in their directories (they're part of the archive — never discard, per DGM).
582
613
 
614
+ ### 5.5. Test Suite Growth (Durable Regression Gates)
615
+
616
+ After the winner is promoted, check if any previously-failing tasks are now passing.
617
+ Generate regression tasks to lock in improvements and prevent future regressions:
618
+
619
+ ```bash
620
+ PREV_BEST=$(python3 -c "
621
+ import json
622
+ s = json.load(open('.harness-evolver/summary.json'))
623
+ versions = s.get('versions', [])
624
+ print(versions[-2]['version'] if len(versions) >= 2 else '')
625
+ " 2>/dev/null)
626
+ if [ -n "$PREV_BEST" ] && [ -f ".harness-evolver/harnesses/$PREV_BEST/scores.json" ]; then
627
+ python3 $TOOLS/test_growth.py \
628
+ --current-scores .harness-evolver/harnesses/{version}/scores.json \
629
+ --previous-scores ".harness-evolver/harnesses/$PREV_BEST/scores.json" \
630
+ --tasks-dir .harness-evolver/eval/tasks/ \
631
+ --output-dir .harness-evolver/eval/tasks/ \
632
+ --max-total-tasks 60 2>/dev/null
633
+ fi
634
+ ```
635
+
636
+ If new tasks were added, print: "Added {N} regression tasks to lock in improvements on: {task_ids}"
637
+
638
+ This is the "durable test gates" pattern: every fixed failure becomes a permanent regression test.
639
+ New tasks are tagged with `metadata.type: "regression"` and `metadata.source: "regression"` so they
640
+ can be distinguished from original tasks. The test suite only grows — regression tasks are never removed.
641
+
583
642
  ### 6. Report
584
643
 
585
644
  Read `summary.json`. Print: `Iteration {i}/{N}: {version} scored {score} (best: {best} at {best_score})`
@@ -0,0 +1,102 @@
1
+ ---
2
+ name: harness-evolver:import-traces
3
+ description: "Use when the user wants to import real production traces from LangSmith as test tasks, convert traces to eval tasks, enrich their eval set with real-world data, or pull production data into their harness evaluation."
4
+ argument-hint: "[--project NAME] [--limit N]"
5
+ allowed-tools: [Read, Write, Edit, Bash, Glob, Grep]
6
+ ---
7
+
8
+ # /harness-evolver:import-traces
9
+
10
+ Import production traces from LangSmith and convert them into eval tasks. This enriches the test suite with real-world inputs, prioritizing traces with negative user feedback.
11
+
12
+ ## Prerequisites
13
+
14
+ - `.harness-evolver/` must exist. If not, tell user to run `harness-evolver:init` first.
15
+ - `langsmith-cli` must be available. Check:
16
+
17
+ ```bash
18
+ which langsmith-cli 2>/dev/null
19
+ ```
20
+
21
+ If not found: "Install langsmith-cli first: `uv tool install langsmith-cli && langsmith-cli auth login`"
22
+
23
+ ## Resolve Tool Path
24
+
25
+ ```bash
26
+ TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo "$HOME/.harness-evolver/tools")
27
+ ```
28
+
29
+ ## Parse Arguments
30
+
31
+ - `--project NAME` — LangSmith project name (if not provided, discover interactively)
32
+ - `--limit N` — max traces to import (default: 20)
33
+
34
+ ## Phase 1: Discover Projects
35
+
36
+ If `--project` not provided, list available projects:
37
+
38
+ ```bash
39
+ langsmith-cli --json projects list --limit 20 2>/dev/null
40
+ ```
41
+
42
+ Show the user a list of projects with run counts. Let them pick one, or use the most recent.
43
+
44
+ If `--project` is provided, use it directly.
45
+
46
+ ## Phase 2: Fetch Traces
47
+
48
+ ```bash
49
+ langsmith-cli --json runs list \
50
+ --project "{project_name}" \
51
+ --limit {limit} \
52
+ --fields id,name,inputs,outputs,error,feedback_stats,total_tokens \
53
+ > /tmp/harness_import_traces.json 2>/dev/null
54
+ ```
55
+
56
+ Check the output has data:
57
+ ```bash
58
+ python3 -c "import json; data=json.load(open('/tmp/harness_import_traces.json')); print(f'{len(data)} traces fetched')"
59
+ ```
60
+
61
+ If no traces found, tell user the project may be empty or the name may be wrong.
62
+
63
+ ## Phase 3: Convert to Tasks
64
+
65
+ ```bash
66
+ python3 $TOOLS/import_traces.py \
67
+ --traces-json /tmp/harness_import_traces.json \
68
+ --output-dir .harness-evolver/eval/tasks/ \
69
+ --prefix imported \
70
+ --max-tasks {limit}
71
+ ```
72
+
73
+ ## Phase 4: Report
74
+
75
+ Read the tool output and report:
76
+ - How many traces were imported
77
+ - How many had negative feedback (high priority)
78
+ - How many were skipped (no extractable input, duplicates)
79
+ - Total tasks now in eval set
80
+
81
+ ```bash
82
+ ls .harness-evolver/eval/tasks/*.json | wc -l
83
+ ```
84
+
85
+ Print:
86
+ ```
87
+ Imported {N} production traces as eval tasks.
88
+ {M} with negative user feedback (high priority)
89
+ {K} skipped (no input or duplicates)
90
+ Total eval tasks: {total}
91
+
92
+ Next: run `harness-evolver:evolve` to optimize against real-world inputs.
93
+ ```
94
+
95
+ ## Gotchas
96
+
97
+ - Traces with no extractable user input are skipped (e.g., system-only runs)
98
+ - Duplicate traces (same run ID) are automatically skipped
99
+ - Imported tasks are tagged with `metadata.source: "imported"` and `metadata.type: "production"`
100
+ - Tasks with negative feedback get `metadata.user_feedback: "negative"` — the proposer should prioritize these
101
+ - The `metadata.langsmith_run_id` field links back to the original trace for debugging
102
+ - Cleanup: `rm /tmp/harness_import_traces.json` after import
@@ -0,0 +1,229 @@
1
+ #!/usr/bin/env python3
2
+ """Import LangSmith Traces as Eval Tasks for Harness Evolver.
3
+
4
+ Transforms LangSmith trace JSON (from langsmith-cli) into task JSON files
5
+ for the evaluation set. Prioritizes traces with negative feedback.
6
+
7
+ Usage:
8
+ python3 import_traces.py \
9
+ --traces-json /tmp/langsmith_traces.json \
10
+ --output-dir .harness-evolver/eval/tasks/ \
11
+ --prefix imported \
12
+ [--max-tasks 30]
13
+
14
+ Stdlib-only. No external dependencies.
15
+ """
16
+
17
+ import argparse
18
+ import hashlib
19
+ import json
20
+ import os
21
+ import re
22
+ import sys
23
+
24
+
25
+ def load_json(path):
26
+ """Load JSON file, return None if missing or invalid."""
27
+ if not path or not os.path.exists(path):
28
+ return None
29
+ try:
30
+ with open(path) as f:
31
+ return json.load(f)
32
+ except (json.JSONDecodeError, OSError):
33
+ return None
34
+
35
+
36
+ def extract_input_from_trace(run):
37
+ """Extract the user input from a LangSmith run's inputs field.
38
+
39
+ Handles multiple LangChain serialization formats:
40
+ - Direct {"input": "..."} field
41
+ - {"messages": [[HumanMessage, ...]]} format
42
+ - {"question": "..."} or {"query": "..."} fields
43
+ """
44
+ inputs = run.get("inputs", {})
45
+ if not inputs:
46
+ return None
47
+
48
+ if isinstance(inputs, str):
49
+ return inputs
50
+
51
+ # Direct input field
52
+ for key in ("input", "question", "query", "prompt", "text", "user_input"):
53
+ if key in inputs and isinstance(inputs[key], str):
54
+ return inputs[key]
55
+
56
+ # LangChain messages format
57
+ messages = inputs.get("messages") or inputs.get("input")
58
+ if isinstance(messages, list):
59
+ # Might be [[msg1, msg2]] (batched) or [msg1, msg2]
60
+ if messages and isinstance(messages[0], list):
61
+ messages = messages[0]
62
+ for msg in messages:
63
+ if isinstance(msg, dict):
64
+ # {"type": "human", "content": "..."}
65
+ if msg.get("type") in ("human", "HumanMessage") or msg.get("role") == "user":
66
+ content = msg.get("content", "")
67
+ if isinstance(content, str) and content:
68
+ return content
69
+ if isinstance(content, list):
70
+ # Multi-modal: [{"type": "text", "text": "..."}]
71
+ for part in content:
72
+ if isinstance(part, dict) and part.get("type") == "text":
73
+ return part.get("text", "")
74
+ elif isinstance(msg, str) and msg:
75
+ return msg
76
+
77
+ # Fallback: stringify the whole inputs
78
+ flat = json.dumps(inputs)
79
+ if len(flat) > 20: # Only if there's meaningful content
80
+ return flat[:2000]
81
+
82
+ return None
83
+
84
+
85
+ def extract_feedback(run):
86
+ """Extract user feedback from a LangSmith run."""
87
+ feedback = run.get("feedback_stats") or run.get("feedback") or {}
88
+ if not feedback:
89
+ return None
90
+
91
+ # feedback_stats format: {"thumbs_up": N, "thumbs_down": N}
92
+ if isinstance(feedback, dict):
93
+ up = feedback.get("thumbs_up", 0) or feedback.get("positive", 0)
94
+ down = feedback.get("thumbs_down", 0) or feedback.get("negative", 0)
95
+ if down > 0:
96
+ return "negative"
97
+ if up > 0:
98
+ return "positive"
99
+ return None
100
+
101
+
102
+ def infer_difficulty(text):
103
+ """Infer difficulty from input characteristics."""
104
+ if not text:
105
+ return "medium"
106
+ length = len(text)
107
+ # Count question marks, clauses, etc.
108
+ questions = text.count("?")
109
+ sentences = len(re.split(r"[.!?]+", text))
110
+
111
+ if length < 50 and questions <= 1:
112
+ return "easy"
113
+ if length > 500 or questions > 2 or sentences > 5:
114
+ return "hard"
115
+ return "medium"
116
+
117
+
118
+ def short_id(run_id):
119
+ """Create a short deterministic ID from a full run ID."""
120
+ return hashlib.md5(str(run_id).encode()).hexdigest()[:8]
121
+
122
+
123
+ def main():
124
+ parser = argparse.ArgumentParser(description="Import LangSmith traces as eval tasks")
125
+ parser.add_argument("--traces-json", required=True, help="Path to langsmith-cli JSON output")
126
+ parser.add_argument("--output-dir", required=True, help="Directory to write task JSON files")
127
+ parser.add_argument("--prefix", default="imported", help="Prefix for task IDs (default: imported)")
128
+ parser.add_argument("--max-tasks", type=int, default=30, help="Max tasks to import (default: 30)")
129
+ parser.add_argument("--prioritize-negative", action="store_true", default=True,
130
+ help="Import negative-feedback traces first (default: true)")
131
+ args = parser.parse_args()
132
+
133
+ traces = load_json(args.traces_json)
134
+ if not traces:
135
+ print("No traces found or invalid JSON — nothing to import")
136
+ return
137
+
138
+ if isinstance(traces, dict):
139
+ # Might be wrapped in {"runs": [...]}
140
+ traces = traces.get("runs", traces.get("data", [traces]))
141
+
142
+ if not isinstance(traces, list):
143
+ print("Unexpected traces format — expected a JSON array")
144
+ return
145
+
146
+ # Sort: negative feedback first, then errors, then the rest
147
+ if args.prioritize_negative:
148
+ def priority(run):
149
+ fb = extract_feedback(run)
150
+ has_error = bool(run.get("error"))
151
+ if fb == "negative":
152
+ return 0
153
+ if has_error:
154
+ return 1
155
+ return 2
156
+ traces.sort(key=priority)
157
+
158
+ os.makedirs(args.output_dir, exist_ok=True)
159
+
160
+ # Check for existing imported tasks to avoid duplicates
161
+ existing_run_ids = set()
162
+ for fname in os.listdir(args.output_dir):
163
+ if fname.endswith(".json"):
164
+ task = load_json(os.path.join(args.output_dir, fname))
165
+ if task and task.get("metadata", {}).get("langsmith_run_id"):
166
+ existing_run_ids.add(task["metadata"]["langsmith_run_id"])
167
+
168
+ imported = 0
169
+ skipped_no_input = 0
170
+ skipped_duplicate = 0
171
+ negative_count = 0
172
+
173
+ for run in traces:
174
+ if imported >= args.max_tasks:
175
+ break
176
+
177
+ run_id = str(run.get("id", ""))
178
+ if run_id in existing_run_ids:
179
+ skipped_duplicate += 1
180
+ continue
181
+
182
+ user_input = extract_input_from_trace(run)
183
+ if not user_input or len(user_input.strip()) < 5:
184
+ skipped_no_input += 1
185
+ continue
186
+
187
+ feedback = extract_feedback(run)
188
+ has_error = bool(run.get("error"))
189
+ task_id = f"{args.prefix}_{short_id(run_id)}"
190
+
191
+ task = {
192
+ "id": task_id,
193
+ "input": user_input.strip(),
194
+ "metadata": {
195
+ "difficulty": infer_difficulty(user_input),
196
+ "category": run.get("name", "unknown"),
197
+ "type": "production",
198
+ "source": "imported",
199
+ "langsmith_run_id": run_id,
200
+ "had_error": has_error,
201
+ "user_feedback": feedback,
202
+ },
203
+ }
204
+
205
+ out_path = os.path.join(args.output_dir, f"{task_id}.json")
206
+ with open(out_path, "w") as f:
207
+ json.dump(task, f, indent=2)
208
+
209
+ imported += 1
210
+ if feedback == "negative":
211
+ negative_count += 1
212
+
213
+ summary = {
214
+ "imported": imported,
215
+ "negative_feedback": negative_count,
216
+ "skipped_no_input": skipped_no_input,
217
+ "skipped_duplicate": skipped_duplicate,
218
+ "total_traces": len(traces),
219
+ }
220
+ print(json.dumps(summary))
221
+ print(f"Imported {imported} production traces as tasks ({negative_count} with negative feedback)")
222
+ if skipped_duplicate:
223
+ print(f" Skipped {skipped_duplicate} already-imported traces")
224
+ if skipped_no_input:
225
+ print(f" Skipped {skipped_no_input} traces with no extractable input")
226
+
227
+
228
+ if __name__ == "__main__":
229
+ main()
@@ -0,0 +1,230 @@
1
+ #!/usr/bin/env python3
2
+ """Test Suite Growth for Harness Evolver.
3
+
4
+ Generates regression test tasks when previously-failing tasks are now passing.
5
+ Creates mechanical variations of fixed tasks to prevent future regressions.
6
+
7
+ Usage:
8
+ python3 test_growth.py \
9
+ --current-scores .harness-evolver/harnesses/v003/scores.json \
10
+ --previous-scores .harness-evolver/harnesses/v002/scores.json \
11
+ --tasks-dir .harness-evolver/eval/tasks/ \
12
+ --output-dir .harness-evolver/eval/tasks/ \
13
+ --max-total-tasks 60
14
+
15
+ Stdlib-only. No external dependencies.
16
+ """
17
+
18
+ import argparse
19
+ import json
20
+ import os
21
+ import re
22
+ import sys
23
+
24
+
25
+ def load_json(path):
26
+ """Load JSON file, return None if missing or invalid."""
27
+ if not path or not os.path.exists(path):
28
+ return None
29
+ try:
30
+ with open(path) as f:
31
+ return json.load(f)
32
+ except (json.JSONDecodeError, OSError):
33
+ return None
34
+
35
+
36
+ def find_fixed_tasks(current_scores, previous_scores, fix_threshold_before=0.5, fix_threshold_after=0.8):
37
+ """Find tasks that improved significantly: score < before_threshold → > after_threshold."""
38
+ current_per_task = current_scores.get("per_task", {})
39
+ previous_per_task = previous_scores.get("per_task", {})
40
+
41
+ fixed = []
42
+ for tid, curr_data in current_per_task.items():
43
+ if not isinstance(curr_data, dict):
44
+ continue
45
+ curr_score = curr_data.get("score", 0)
46
+ prev_data = previous_per_task.get(tid, {})
47
+ prev_score = prev_data.get("score", 0) if isinstance(prev_data, dict) else 0
48
+
49
+ if prev_score < fix_threshold_before and curr_score > fix_threshold_after:
50
+ fixed.append({
51
+ "task_id": tid,
52
+ "previous_score": prev_score,
53
+ "current_score": curr_score,
54
+ "improvement": curr_score - prev_score,
55
+ })
56
+
57
+ # Sort by improvement (biggest fixes first)
58
+ fixed.sort(key=lambda x: -x["improvement"])
59
+ return fixed
60
+
61
+
62
+ def count_existing_tasks(directory):
63
+ """Count existing task JSON files in directory."""
64
+ if not os.path.isdir(directory):
65
+ return 0
66
+ return sum(1 for f in os.listdir(directory) if f.endswith(".json"))
67
+
68
+
69
+ def next_regression_id(output_dir):
70
+ """Find the next available regression task ID."""
71
+ existing = set()
72
+ if os.path.isdir(output_dir):
73
+ for fname in os.listdir(output_dir):
74
+ m = re.match(r"regression_(\d+)\.json", fname)
75
+ if m:
76
+ existing.add(int(m.group(1)))
77
+ n = 1
78
+ while n in existing:
79
+ n += 1
80
+ return n
81
+
82
+
83
+ def generate_variations(original_input, task_id):
84
+ """Generate 2-3 mechanical variations of an input string.
85
+
86
+ Uses simple string transforms — no LLM needed:
87
+ - Rephrase by reordering
88
+ - Add qualifying clause
89
+ - Simplify to minimal form
90
+ """
91
+ variations = []
92
+ text = original_input.strip()
93
+
94
+ # Variation 1: Add a qualifying clause
95
+ qualifiers = [
96
+ "Please be specific and detailed in your response.",
97
+ "Consider edge cases in your answer.",
98
+ "Provide a concise but thorough response.",
99
+ "Think step by step before answering.",
100
+ ]
101
+ # Pick qualifier based on hash of task_id for determinism
102
+ qi = hash(task_id) % len(qualifiers)
103
+ v1 = f"{text}\n\n{qualifiers[qi]}"
104
+ variations.append(("qualified", v1))
105
+
106
+ # Variation 2: Reorder sentences if multiple exist
107
+ sentences = re.split(r"(?<=[.!?])\s+", text)
108
+ if len(sentences) >= 2:
109
+ # Swap first two sentences
110
+ reordered = sentences[1:] + sentences[:1]
111
+ v2 = " ".join(reordered)
112
+ variations.append(("reordered", v2))
113
+ else:
114
+ # If single sentence, prepend "Given the context: "
115
+ v2 = f"Given the following context, {text[0].lower()}{text[1:]}" if len(text) > 1 else text
116
+ variations.append(("rephrased", v2))
117
+
118
+ # Variation 3: Minimal version — strip to core question
119
+ # Remove qualifiers, keep just the main ask
120
+ minimal = text
121
+ # Strip common padding phrases
122
+ for prefix in ["Please ", "Can you ", "Could you ", "I would like you to ", "I need you to "]:
123
+ if minimal.startswith(prefix):
124
+ minimal = minimal[len(prefix):]
125
+ minimal = minimal[0].upper() + minimal[1:] if minimal else minimal
126
+ break
127
+ if minimal != text:
128
+ variations.append(("minimal", minimal))
129
+
130
+ return variations
131
+
132
+
133
+ def main():
134
+ parser = argparse.ArgumentParser(description="Generate regression test tasks from score improvements")
135
+ parser.add_argument("--current-scores", required=True, help="Path to current version's scores.json")
136
+ parser.add_argument("--previous-scores", required=True, help="Path to previous version's scores.json")
137
+ parser.add_argument("--tasks-dir", required=True, help="Path to eval/tasks/ (to read originals)")
138
+ parser.add_argument("--output-dir", required=True, help="Directory to write regression tasks")
139
+ parser.add_argument("--max-total-tasks", type=int, default=60, help="Cap total tasks in output-dir (default 60)")
140
+ args = parser.parse_args()
141
+
142
+ current = load_json(args.current_scores)
143
+ previous = load_json(args.previous_scores)
144
+
145
+ if not current or not previous:
146
+ print("Missing scores files — skipping test growth")
147
+ return
148
+
149
+ # Find tasks that were fixed
150
+ fixed = find_fixed_tasks(current, previous)
151
+ if not fixed:
152
+ print("No tasks improved significantly — no regression tasks needed")
153
+ return
154
+
155
+ # Check capacity
156
+ existing_count = count_existing_tasks(args.output_dir)
157
+ available_slots = args.max_total_tasks - existing_count
158
+ if available_slots <= 0:
159
+ print(f"Task suite already at capacity ({existing_count}/{args.max_total_tasks}) — skipping growth")
160
+ return
161
+
162
+ os.makedirs(args.output_dir, exist_ok=True)
163
+ regression_id = next_regression_id(args.output_dir)
164
+ tasks_added = 0
165
+ fixed_ids = []
166
+
167
+ for fix_info in fixed:
168
+ if tasks_added >= available_slots:
169
+ break
170
+
171
+ tid = fix_info["task_id"]
172
+ # Load original task
173
+ task_path = os.path.join(args.tasks_dir, f"{tid}.json")
174
+ original = load_json(task_path)
175
+ if not original:
176
+ continue
177
+
178
+ original_input = original.get("input", "")
179
+ if not original_input:
180
+ continue
181
+
182
+ original_meta = original.get("metadata", {})
183
+ variations = generate_variations(original_input, tid)
184
+
185
+ for var_type, var_input in variations:
186
+ if tasks_added >= available_slots:
187
+ break
188
+
189
+ reg_id = f"regression_{regression_id:03d}"
190
+ task = {
191
+ "id": reg_id,
192
+ "input": var_input,
193
+ "metadata": {
194
+ "difficulty": original_meta.get("difficulty", "medium"),
195
+ "category": original_meta.get("category", "unknown"),
196
+ "type": "regression",
197
+ "source": "regression",
198
+ "regression_for": tid,
199
+ "variation": var_type,
200
+ "previous_score": fix_info["previous_score"],
201
+ "fixed_at_score": fix_info["current_score"],
202
+ },
203
+ }
204
+
205
+ # Include expected if original had it
206
+ if "expected" in original:
207
+ task["expected"] = original["expected"]
208
+
209
+ out_path = os.path.join(args.output_dir, f"{reg_id}.json")
210
+ with open(out_path, "w") as f:
211
+ json.dump(task, f, indent=2)
212
+
213
+ tasks_added += 1
214
+ regression_id += 1
215
+
216
+ fixed_ids.append(tid)
217
+
218
+ # Output summary
219
+ summary = {
220
+ "tasks_added": tasks_added,
221
+ "fixed_tasks": fixed_ids,
222
+ "total_tasks_now": existing_count + tasks_added,
223
+ "max_total_tasks": args.max_total_tasks,
224
+ }
225
+ print(json.dumps(summary))
226
+ print(f"Added {tasks_added} regression tasks to lock in improvements on: {', '.join(fixed_ids)}")
227
+
228
+
229
+ if __name__ == "__main__":
230
+ main()
@@ -0,0 +1,350 @@
1
+ #!/usr/bin/env python3
2
+ """Trace Insights Generator for Harness Evolver.
3
+
4
+ Analyzes LangSmith traces + per-task scores to produce structured insights.
5
+ Clusters errors, analyzes token usage, cross-references with scores,
6
+ and generates data-driven hypotheses.
7
+
8
+ Usage:
9
+ python3 trace_insights.py \
10
+ --langsmith-runs .harness-evolver/langsmith_runs.json \
11
+ --scores .harness-evolver/harnesses/v002/scores.json \
12
+ --tasks-dir .harness-evolver/eval/tasks/ \
13
+ --output .harness-evolver/trace_insights.json \
14
+ [--langsmith-stats .harness-evolver/langsmith_stats.json]
15
+
16
+ Stdlib-only. No external dependencies.
17
+ """
18
+
19
+ import argparse
20
+ import json
21
+ import os
22
+ import sys
23
+ from datetime import datetime, timezone
24
+
25
+
26
+ def load_json(path):
27
+ """Load JSON file, return None if missing or invalid."""
28
+ if not path or not os.path.exists(path):
29
+ return None
30
+ try:
31
+ with open(path) as f:
32
+ return json.load(f)
33
+ except (json.JSONDecodeError, OSError):
34
+ return None
35
+
36
+
37
+ def cluster_errors(runs):
38
+ """Group runs by error pattern (first 80 chars of error message)."""
39
+ clusters = {}
40
+ for run in runs:
41
+ error = run.get("error")
42
+ if not error:
43
+ continue
44
+ # Normalize: take first 80 chars, strip whitespace
45
+ pattern = error.strip()[:80]
46
+ clusters.setdefault(pattern, []).append(run)
47
+ return [
48
+ {"pattern": pattern, "count": len(runs_list), "run_names": [r.get("name", "?") for r in runs_list[:5]]}
49
+ for pattern, runs_list in sorted(clusters.items(), key=lambda x: -len(x[1]))
50
+ ]
51
+
52
+
53
+ def analyze_tokens(runs):
54
+ """Bucket runs by token usage: low (<500), medium (500-2000), high (>2000)."""
55
+ buckets = {"low": [], "medium": [], "high": []}
56
+ for run in runs:
57
+ tokens = run.get("tokens") or run.get("total_tokens") or 0
58
+ if tokens < 500:
59
+ buckets["low"].append(run)
60
+ elif tokens < 2000:
61
+ buckets["medium"].append(run)
62
+ else:
63
+ buckets["high"].append(run)
64
+ return {
65
+ name: {"count": len(items), "avg_tokens": sum((r.get("tokens") or r.get("total_tokens") or 0) for r in items) / max(len(items), 1)}
66
+ for name, items in buckets.items()
67
+ }
68
+
69
+
70
+ def analyze_responses(runs):
71
+ """Bucket runs by response length: empty, short (<100), normal (100-1000), long (>1000)."""
72
+ buckets = {"empty": [], "short": [], "normal": [], "long": []}
73
+ for run in runs:
74
+ resp = run.get("llm_response") or run.get("output") or ""
75
+ length = len(resp)
76
+ if length == 0:
77
+ buckets["empty"].append(run)
78
+ elif length < 100:
79
+ buckets["short"].append(run)
80
+ elif length < 1000:
81
+ buckets["normal"].append(run)
82
+ else:
83
+ buckets["long"].append(run)
84
+ return {
85
+ name: {"count": len(items)}
86
+ for name, items in buckets.items()
87
+ if items
88
+ }
89
+
90
+
91
+ def cross_reference_scores(runs, scores_data, tasks_dir):
92
+ """Cross-reference trace patterns with per-task scores."""
93
+ per_task = scores_data.get("per_task", {}) if scores_data else {}
94
+ if not per_task:
95
+ return {}
96
+
97
+ # Load task metadata for category mapping
98
+ task_meta = {}
99
+ if tasks_dir and os.path.isdir(tasks_dir):
100
+ for fname in os.listdir(tasks_dir):
101
+ if fname.endswith(".json"):
102
+ path = os.path.join(tasks_dir, fname)
103
+ try:
104
+ with open(path) as f:
105
+ t = json.load(f)
106
+ tid = t.get("id", fname.replace(".json", ""))
107
+ task_meta[tid] = t.get("metadata", {})
108
+ except (json.JSONDecodeError, OSError):
109
+ pass
110
+
111
+ # Score statistics
112
+ scores = [v.get("score", 0) for v in per_task.values() if isinstance(v, dict)]
113
+ if not scores:
114
+ return {}
115
+
116
+ failing = {tid: v for tid, v in per_task.items() if isinstance(v, dict) and v.get("score", 0) < 0.5}
117
+ passing = {tid: v for tid, v in per_task.items() if isinstance(v, dict) and v.get("score", 0) >= 0.8}
118
+
119
+ # Group failures by category
120
+ failure_categories = {}
121
+ for tid in failing:
122
+ meta = task_meta.get(tid, {})
123
+ cat = meta.get("category", meta.get("type", "unknown"))
124
+ failure_categories.setdefault(cat, []).append(tid)
125
+
126
+ return {
127
+ "total_tasks": len(per_task),
128
+ "avg_score": sum(scores) / len(scores),
129
+ "failing_count": len(failing),
130
+ "passing_count": len(passing),
131
+ "failing_task_ids": list(failing.keys()),
132
+ "failure_categories": {cat: tids for cat, tids in sorted(failure_categories.items(), key=lambda x: -len(x[1]))},
133
+ }
134
+
135
+
136
+ def correlate_tokens_scores(runs, scores_data):
137
+ """Check if token usage correlates with task scores."""
138
+ per_task = scores_data.get("per_task", {}) if scores_data else {}
139
+ if not per_task or not runs:
140
+ return None
141
+
142
+ # Simple correlation: avg score for high-token vs low-token runs
143
+ token_scores = {"low": [], "medium": [], "high": []}
144
+ for run in runs:
145
+ tokens = run.get("tokens") or run.get("total_tokens") or 0
146
+ # Try to match run to task by name
147
+ name = run.get("name", "")
148
+ for tid, tdata in per_task.items():
149
+ if isinstance(tdata, dict) and tid in name:
150
+ score = tdata.get("score", 0)
151
+ if tokens < 500:
152
+ token_scores["low"].append(score)
153
+ elif tokens < 2000:
154
+ token_scores["medium"].append(score)
155
+ else:
156
+ token_scores["high"].append(score)
157
+ break
158
+
159
+ result = {}
160
+ for bucket, scores in token_scores.items():
161
+ if scores:
162
+ result[bucket] = {"count": len(scores), "avg_score": sum(scores) / len(scores)}
163
+ return result if result else None
164
+
165
+
166
+ def generate_hypotheses(error_clusters, token_analysis, response_analysis, score_cross_ref, token_score_corr):
167
+ """Generate data-driven hypotheses about failure patterns."""
168
+ hypotheses = []
169
+
170
+ # Hypothesis: errors cause failures
171
+ if error_clusters:
172
+ total_errors = sum(c["count"] for c in error_clusters)
173
+ top_error = error_clusters[0]
174
+ hypotheses.append(
175
+ f"{total_errors} runs had errors. Most common: \"{top_error['pattern']}\" ({top_error['count']} occurrences)"
176
+ )
177
+
178
+ # Hypothesis: empty responses
179
+ if response_analysis and response_analysis.get("empty", {}).get("count", 0) > 0:
180
+ n = response_analysis["empty"]["count"]
181
+ hypotheses.append(
182
+ f"{n} runs returned empty responses — possible API timeout, rate limiting, or invalid prompt"
183
+ )
184
+
185
+ # Hypothesis: high token usage correlates with low scores
186
+ if token_score_corr:
187
+ high = token_score_corr.get("high", {})
188
+ low = token_score_corr.get("low", {})
189
+ if high.get("avg_score", 1) < low.get("avg_score", 0) - 0.15:
190
+ hypotheses.append(
191
+ f"High-token runs avg score {high['avg_score']:.2f} vs low-token {low['avg_score']:.2f} — model may be verbose but inaccurate"
192
+ )
193
+
194
+ # Hypothesis: specific category failures
195
+ if score_cross_ref and score_cross_ref.get("failure_categories"):
196
+ cats = score_cross_ref["failure_categories"]
197
+ top_cat = next(iter(cats))
198
+ count = len(cats[top_cat])
199
+ hypotheses.append(
200
+ f"Category \"{top_cat}\" has {count} failing tasks — may need targeted prompt or tool improvement"
201
+ )
202
+
203
+ # Hypothesis: many failing
204
+ if score_cross_ref:
205
+ fail_count = score_cross_ref.get("failing_count", 0)
206
+ total = score_cross_ref.get("total_tasks", 1)
207
+ if fail_count > total * 0.5:
208
+ hypotheses.append(
209
+ f"{fail_count}/{total} tasks failing (>{50}%) — fundamental approach issue, not edge cases"
210
+ )
211
+
212
+ return hypotheses
213
+
214
+
215
+ def identify_top_issues(error_clusters, response_analysis, score_cross_ref):
216
+ """Identify the most impactful issues sorted by severity."""
217
+ issues = []
218
+
219
+ # Empty responses = high severity
220
+ if response_analysis and response_analysis.get("empty", {}).get("count", 0) > 0:
221
+ issues.append({
222
+ "type": "empty_response",
223
+ "severity": "high",
224
+ "count": response_analysis["empty"]["count"],
225
+ "description": "Runs returning empty responses",
226
+ })
227
+
228
+ # Errors = high severity
229
+ if error_clusters:
230
+ for cluster in error_clusters[:3]:
231
+ issues.append({
232
+ "type": "error",
233
+ "severity": "high" if cluster["count"] > 2 else "medium",
234
+ "count": cluster["count"],
235
+ "pattern": cluster["pattern"],
236
+ "description": f"Error: {cluster['pattern'][:60]}",
237
+ })
238
+
239
+ # Category-concentrated failures = medium severity
240
+ if score_cross_ref and score_cross_ref.get("failure_categories"):
241
+ for cat, tids in list(score_cross_ref["failure_categories"].items())[:3]:
242
+ issues.append({
243
+ "type": "category_failure",
244
+ "severity": "medium" if len(tids) >= 3 else "low",
245
+ "category": cat,
246
+ "tasks": tids,
247
+ "description": f"Category \"{cat}\" has {len(tids)} failing tasks",
248
+ })
249
+
250
+ # Sort by severity
251
+ severity_order = {"high": 0, "medium": 1, "low": 2}
252
+ issues.sort(key=lambda x: severity_order.get(x.get("severity", "low"), 3))
253
+ return issues
254
+
255
+
256
+ def main():
257
+ parser = argparse.ArgumentParser(description="Generate trace insights from LangSmith data + scores")
258
+ parser.add_argument("--langsmith-runs", required=True, help="Path to langsmith_runs.json")
259
+ parser.add_argument("--langsmith-stats", help="Path to langsmith_stats.json (optional)")
260
+ parser.add_argument("--scores", required=True, help="Path to best version's scores.json")
261
+ parser.add_argument("--tasks-dir", required=True, help="Path to eval/tasks/ directory")
262
+ parser.add_argument("--output", required=True, help="Output path for trace_insights.json")
263
+ args = parser.parse_args()
264
+
265
+ runs = load_json(args.langsmith_runs)
266
+ stats = load_json(args.langsmith_stats)
267
+ scores_data = load_json(args.scores)
268
+
269
+ if not runs and not scores_data:
270
+ # Nothing to analyze — write minimal insights
271
+ insights = {
272
+ "generated_at": datetime.now(timezone.utc).isoformat(),
273
+ "summary": "No trace data or scores available for analysis",
274
+ "error_clusters": [],
275
+ "token_analysis": {},
276
+ "response_analysis": {},
277
+ "hypotheses": [],
278
+ "top_issues": [],
279
+ }
280
+ os.makedirs(os.path.dirname(args.output) or ".", exist_ok=True)
281
+ with open(args.output, "w") as f:
282
+ json.dump(insights, f, indent=2)
283
+ print("No data available — wrote empty insights")
284
+ return
285
+
286
+ runs = runs or []
287
+
288
+ # Phase 1: Cluster traces
289
+ error_clusters = cluster_errors(runs)
290
+ token_analysis = analyze_tokens(runs)
291
+ response_analysis = analyze_responses(runs)
292
+
293
+ # Phase 2: Cross-reference with scores
294
+ score_cross_ref = cross_reference_scores(runs, scores_data, args.tasks_dir)
295
+ token_score_corr = correlate_tokens_scores(runs, scores_data)
296
+
297
+ # Phase 3: Generate hypotheses
298
+ hypotheses = generate_hypotheses(error_clusters, token_analysis, response_analysis, score_cross_ref, token_score_corr)
299
+
300
+ # Phase 4: Identify top issues
301
+ top_issues = identify_top_issues(error_clusters, response_analysis, score_cross_ref)
302
+
303
+ # Build summary line
304
+ parts = []
305
+ if error_clusters:
306
+ parts.append(f"{len(error_clusters)} error pattern(s)")
307
+ if score_cross_ref:
308
+ parts.append(f"{score_cross_ref.get('failing_count', 0)}/{score_cross_ref.get('total_tasks', 0)} tasks failing")
309
+ parts.append(f"avg score {score_cross_ref.get('avg_score', 0):.2f}")
310
+ summary = "; ".join(parts) if parts else "Analysis complete, no major issues found"
311
+
312
+ # Merge stats if available
313
+ stats_summary = {}
314
+ if stats:
315
+ stats_summary = {
316
+ "total_runs": stats.get("total_runs") or stats.get("run_count"),
317
+ "error_rate": stats.get("error_rate"),
318
+ "avg_latency_ms": stats.get("avg_latency_ms") or stats.get("latency_p50"),
319
+ "p95_latency_ms": stats.get("latency_p95"),
320
+ "avg_tokens": stats.get("avg_tokens") or stats.get("avg_total_tokens"),
321
+ }
322
+ # Remove None values
323
+ stats_summary = {k: v for k, v in stats_summary.items() if v is not None}
324
+
325
+ insights = {
326
+ "generated_at": datetime.now(timezone.utc).isoformat(),
327
+ "summary": summary,
328
+ "langsmith_stats": stats_summary if stats_summary else None,
329
+ "error_clusters": error_clusters,
330
+ "token_analysis": token_analysis,
331
+ "response_analysis": response_analysis,
332
+ "score_cross_ref": score_cross_ref if score_cross_ref else None,
333
+ "token_score_correlation": token_score_corr,
334
+ "hypotheses": hypotheses,
335
+ "top_issues": top_issues,
336
+ }
337
+
338
+ # Remove None values at top level
339
+ insights = {k: v for k, v in insights.items() if v is not None}
340
+
341
+ os.makedirs(os.path.dirname(args.output) or ".", exist_ok=True)
342
+ with open(args.output, "w") as f:
343
+ json.dump(insights, f, indent=2)
344
+
345
+ print(f"Trace insights generated: {summary}")
346
+ print(f" {len(error_clusters)} error cluster(s), {len(hypotheses)} hypothesis(es), {len(top_issues)} issue(s)")
347
+
348
+
349
+ if __name__ == "__main__":
350
+ main()