harness-evolver 2.5.1 → 2.6.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +30 -14
- package/agents/harness-evolver-proposer.md +14 -0
- package/package.json +1 -1
- package/skills/evolve/SKILL.md +59 -0
- package/skills/import-traces/SKILL.md +102 -0
- package/tools/import_traces.py +229 -0
- package/tools/test_growth.py +230 -0
- package/tools/trace_insights.py +350 -0
package/README.md
CHANGED
|
@@ -46,12 +46,20 @@ claude
|
|
|
46
46
|
|
|
47
47
|
<table>
|
|
48
48
|
<tr>
|
|
49
|
-
<td><b>5 Proposers</b></td>
|
|
50
|
-
<td>Each iteration spawns 5 parallel agents
|
|
49
|
+
<td><b>5 Adaptive Proposers</b></td>
|
|
50
|
+
<td>Each iteration spawns 5 parallel agents: exploit (targeted fix), explore (bold rewrite), crossover (combine two parents), and 2 failure-focused agents that target the weakest task clusters. Strategies adapt every iteration based on actual per-task scores — no fixed specialists.</td>
|
|
51
51
|
</tr>
|
|
52
52
|
<tr>
|
|
53
|
-
<td><b>
|
|
54
|
-
<td>Every harness run captures stdout, stderr, timing, and per-task I/O. LangSmith auto-tracing for LangChain/LangGraph agents.
|
|
53
|
+
<td><b>Trace Insights</b></td>
|
|
54
|
+
<td>Every harness run captures stdout, stderr, timing, and per-task I/O. LangSmith auto-tracing for LangChain/LangGraph agents. Traces are systematically clustered by error pattern, token usage, and response type — proposers receive structured diagnostic data, not raw logs.</td>
|
|
55
|
+
</tr>
|
|
56
|
+
<tr>
|
|
57
|
+
<td><b>Quality-Diversity Selection</b></td>
|
|
58
|
+
<td>Not winner-take-all. Tracks per-task champions — a candidate that loses overall but excels at specific tasks is preserved as the next crossover parent. The archive never discards variants.</td>
|
|
59
|
+
</tr>
|
|
60
|
+
<tr>
|
|
61
|
+
<td><b>Durable Test Gates</b></td>
|
|
62
|
+
<td>When the loop fixes a failure, regression tasks are automatically generated to lock in the improvement. The test suite grows over iterations — fixed bugs can never silently return.</td>
|
|
55
63
|
</tr>
|
|
56
64
|
<tr>
|
|
57
65
|
<td><b>Critic</b></td>
|
|
@@ -74,13 +82,14 @@ claude
|
|
|
74
82
|
| Command | What it does |
|
|
75
83
|
|---|---|
|
|
76
84
|
| `/harness-evolver:init` | Scan project, create harness/eval/tasks, run baseline |
|
|
77
|
-
| `/harness-evolver:evolve` | Run the autonomous optimization loop (5
|
|
85
|
+
| `/harness-evolver:evolve` | Run the autonomous optimization loop (5 adaptive proposers) |
|
|
78
86
|
| `/harness-evolver:status` | Show progress, scores, stagnation detection |
|
|
79
87
|
| `/harness-evolver:compare` | Diff two versions with per-task analysis |
|
|
80
88
|
| `/harness-evolver:diagnose` | Deep trace analysis of a specific version |
|
|
81
89
|
| `/harness-evolver:deploy` | Promote the best harness back to your project |
|
|
82
90
|
| `/harness-evolver:architect` | Analyze and recommend optimal agent topology |
|
|
83
91
|
| `/harness-evolver:critic` | Evaluate eval quality and detect gaming |
|
|
92
|
+
| `/harness-evolver:import-traces` | Pull production LangSmith traces as eval tasks |
|
|
84
93
|
|
|
85
94
|
---
|
|
86
95
|
|
|
@@ -139,16 +148,20 @@ Works with any language, any framework, any domain. If your project doesn't have
|
|
|
139
148
|
```
|
|
140
149
|
/harness-evolver:evolve
|
|
141
150
|
│
|
|
142
|
-
├─ 1.
|
|
143
|
-
├─
|
|
144
|
-
├─
|
|
145
|
-
├─
|
|
151
|
+
├─ 1. Get next version
|
|
152
|
+
├─ 1.5 Gather LangSmith traces (processed into readable format)
|
|
153
|
+
├─ 1.6 Generate Trace Insights (cluster errors, analyze tokens, cross-ref scores)
|
|
154
|
+
├─ 1.8 Analyze per-task failures (cluster by category for adaptive briefings)
|
|
155
|
+
├─ 2. Spawn 5 proposers in parallel (exploit / explore / crossover / 2× failure-targeted)
|
|
156
|
+
├─ 3. Validate all candidates
|
|
157
|
+
├─ 4. Evaluate all candidates
|
|
146
158
|
├─ 4.5 Judge (if using LLM-as-judge eval)
|
|
147
|
-
├─ 5.
|
|
148
|
-
├─
|
|
159
|
+
├─ 5. Select winner + track per-task champion
|
|
160
|
+
├─ 5.5 Test suite growth (generate regression tasks for fixed failures)
|
|
161
|
+
├─ 6. Report results
|
|
149
162
|
├─ 6.5 Auto-trigger Critic (if score jumped >0.3 or reached 1.0 too fast)
|
|
150
|
-
├─ 7.
|
|
151
|
-
└─ 8.
|
|
163
|
+
├─ 7. Auto-trigger Architect (if regression or stagnation)
|
|
164
|
+
└─ 8. Check stop conditions (target reached, N iterations, stagnation post-architect)
|
|
152
165
|
```
|
|
153
166
|
|
|
154
167
|
---
|
|
@@ -175,9 +188,12 @@ The plugin auto-detects available keys. No key needed for the included example.
|
|
|
175
188
|
|---|---|---|---|---|
|
|
176
189
|
| **Format** | Paper artifact | Framework (Docker) | Plugin (passive) | **Plugin (active)** |
|
|
177
190
|
| **Search** | Code-space | Code-space | Prompt-space | **Code-space** |
|
|
178
|
-
| **Candidates/iter** | 1 | 1 | N/A | **5 parallel** |
|
|
191
|
+
| **Candidates/iter** | 1 | 1 | N/A | **5 parallel (adaptive)** |
|
|
192
|
+
| **Selection** | Single best | Single best | N/A | **Quality-diversity (per-task)** |
|
|
179
193
|
| **Auto-critique** | No | No | No | **Yes (critic + judge)** |
|
|
180
194
|
| **Architecture** | Fixed | Fixed | N/A | **Auto-recommended** |
|
|
195
|
+
| **Trace analysis** | Manual | No | No | **Systematic (clustering + insights)** |
|
|
196
|
+
| **Test growth** | No | No | No | **Yes (durable regression gates)** |
|
|
181
197
|
| **LangSmith** | No | No | No | **Yes** |
|
|
182
198
|
| **Context7** | No | No | No | **Yes** |
|
|
183
199
|
| **Zero-config** | No | No | No | **Yes** |
|
|
@@ -26,6 +26,20 @@ Your prompt contains a `<strategy>` block defining your approach. Follow it:
|
|
|
26
26
|
|
|
27
27
|
If no strategy block is present, default to exploitation (conservative improvement).
|
|
28
28
|
|
|
29
|
+
## Trace Insights
|
|
30
|
+
|
|
31
|
+
If `.harness-evolver/trace_insights.json` exists in your `<files_to_read>`, use it to guide your diagnosis:
|
|
32
|
+
|
|
33
|
+
1. Check `top_issues` first — these are the highest-impact problems sorted by severity
|
|
34
|
+
2. Check `hypotheses` for data-driven theories about failure causes
|
|
35
|
+
3. Use `error_clusters` to understand which error patterns affect which runs
|
|
36
|
+
4. The `token_analysis` and `token_score_correlation` sections show if verbosity correlates with quality
|
|
37
|
+
5. `score_cross_ref.failure_categories` maps failure patterns to task categories
|
|
38
|
+
|
|
39
|
+
These insights are generated from LangSmith traces cross-referenced with per-task scores — they are **data, not guesses**. Prioritize addressing issues marked severity `"high"` over `"medium"` or `"low"`.
|
|
40
|
+
|
|
41
|
+
If trace insights are not available, proceed with manual trace analysis as described in Phase 2.
|
|
42
|
+
|
|
29
43
|
## Context7 — Enrich Your Knowledge
|
|
30
44
|
|
|
31
45
|
You have access to Context7 MCP tools (`resolve-library-id` and `get-library-docs`) for looking up **current, version-specific documentation** of any library.
|
package/package.json
CHANGED
package/skills/evolve/SKILL.md
CHANGED
|
@@ -132,6 +132,32 @@ The resulting `langsmith_runs.json` has clean, readable entries:
|
|
|
132
132
|
|
|
133
133
|
These files are included in the proposer's `<files_to_read>` so it has readable trace data for diagnosis.
|
|
134
134
|
|
|
135
|
+
### 1.6. Generate Trace Insights (systematic analysis)
|
|
136
|
+
|
|
137
|
+
If LangSmith traces were gathered, run systematic analysis to cluster errors, analyze token usage, and cross-reference with scores:
|
|
138
|
+
|
|
139
|
+
```bash
|
|
140
|
+
if [ -f ".harness-evolver/langsmith_runs.json" ]; then
|
|
141
|
+
BEST=$(python3 -c "import json; s=json.load(open('.harness-evolver/summary.json')); print(s['best']['version'])")
|
|
142
|
+
SCORES_PATH=".harness-evolver/harnesses/$BEST/scores.json"
|
|
143
|
+
[ ! -f "$SCORES_PATH" ] && SCORES_PATH=".harness-evolver/baseline/scores.json"
|
|
144
|
+
python3 $TOOLS/trace_insights.py \
|
|
145
|
+
--langsmith-runs .harness-evolver/langsmith_runs.json \
|
|
146
|
+
--langsmith-stats .harness-evolver/langsmith_stats.json \
|
|
147
|
+
--scores "$SCORES_PATH" \
|
|
148
|
+
--tasks-dir .harness-evolver/eval/tasks/ \
|
|
149
|
+
--output .harness-evolver/trace_insights.json 2>/dev/null
|
|
150
|
+
fi
|
|
151
|
+
```
|
|
152
|
+
|
|
153
|
+
The resulting `trace_insights.json` contains:
|
|
154
|
+
- `error_clusters`: grouped error patterns with counts
|
|
155
|
+
- `token_analysis`: score distribution by token usage bucket (low/medium/high)
|
|
156
|
+
- `hypotheses`: data-driven theories about failure causes
|
|
157
|
+
- `top_issues`: highest-impact problems sorted by severity
|
|
158
|
+
|
|
159
|
+
This file is included in all proposers' `<files_to_read>` so they have structured diagnostic data.
|
|
160
|
+
|
|
135
161
|
### 1.8. Analyze Per-Task Failures (adaptive briefings for Candidates D and E)
|
|
136
162
|
|
|
137
163
|
Before spawning proposers, analyze which tasks are failing and cluster them:
|
|
@@ -228,6 +254,7 @@ Agent(
|
|
|
228
254
|
- .harness-evolver/langsmith_diagnosis.json (if exists)
|
|
229
255
|
- .harness-evolver/langsmith_stats.json (if exists)
|
|
230
256
|
- .harness-evolver/langsmith_runs.json (if exists)
|
|
257
|
+
- .harness-evolver/trace_insights.json (if exists)
|
|
231
258
|
- .harness-evolver/architecture.json (if exists)
|
|
232
259
|
</files_to_read>
|
|
233
260
|
|
|
@@ -267,6 +294,7 @@ Agent(
|
|
|
267
294
|
- .harness-evolver/harnesses/{explorer_parent}/scores.json
|
|
268
295
|
- .harness-evolver/langsmith_diagnosis.json (if exists)
|
|
269
296
|
- .harness-evolver/langsmith_runs.json (if exists)
|
|
297
|
+
- .harness-evolver/trace_insights.json (if exists)
|
|
270
298
|
- .harness-evolver/architecture.json (if exists)
|
|
271
299
|
</files_to_read>
|
|
272
300
|
|
|
@@ -305,6 +333,7 @@ Agent(
|
|
|
305
333
|
- .harness-evolver/harnesses/{parent_b}/scores.json
|
|
306
334
|
- .harness-evolver/langsmith_diagnosis.json (if exists)
|
|
307
335
|
- .harness-evolver/langsmith_runs.json (if exists)
|
|
336
|
+
- .harness-evolver/trace_insights.json (if exists)
|
|
308
337
|
- .harness-evolver/architecture.json (if exists)
|
|
309
338
|
</files_to_read>
|
|
310
339
|
|
|
@@ -347,6 +376,7 @@ Agent(
|
|
|
347
376
|
- .harness-evolver/harnesses/{best_version}/harness.py
|
|
348
377
|
- .harness-evolver/harnesses/{best_version}/scores.json
|
|
349
378
|
- .harness-evolver/langsmith_runs.json (if exists)
|
|
379
|
+
- .harness-evolver/trace_insights.json (if exists)
|
|
350
380
|
- .harness-evolver/architecture.json (if exists)
|
|
351
381
|
</files_to_read>
|
|
352
382
|
|
|
@@ -407,6 +437,7 @@ Agent(
|
|
|
407
437
|
- .harness-evolver/harnesses/{best_version}/harness.py
|
|
408
438
|
- .harness-evolver/harnesses/{best_version}/scores.json
|
|
409
439
|
- .harness-evolver/langsmith_runs.json (if exists)
|
|
440
|
+
- .harness-evolver/trace_insights.json (if exists)
|
|
410
441
|
- .harness-evolver/architecture.json (if exists)
|
|
411
442
|
</files_to_read>
|
|
412
443
|
|
|
@@ -580,6 +611,34 @@ Iteration {i}/{N} — {num_candidates} candidates evaluated:
|
|
|
580
611
|
|
|
581
612
|
Keep losing candidates in their directories (they're part of the archive — never discard, per DGM).
|
|
582
613
|
|
|
614
|
+
### 5.5. Test Suite Growth (Durable Regression Gates)
|
|
615
|
+
|
|
616
|
+
After the winner is promoted, check if any previously-failing tasks are now passing.
|
|
617
|
+
Generate regression tasks to lock in improvements and prevent future regressions:
|
|
618
|
+
|
|
619
|
+
```bash
|
|
620
|
+
PREV_BEST=$(python3 -c "
|
|
621
|
+
import json
|
|
622
|
+
s = json.load(open('.harness-evolver/summary.json'))
|
|
623
|
+
versions = s.get('versions', [])
|
|
624
|
+
print(versions[-2]['version'] if len(versions) >= 2 else '')
|
|
625
|
+
" 2>/dev/null)
|
|
626
|
+
if [ -n "$PREV_BEST" ] && [ -f ".harness-evolver/harnesses/$PREV_BEST/scores.json" ]; then
|
|
627
|
+
python3 $TOOLS/test_growth.py \
|
|
628
|
+
--current-scores .harness-evolver/harnesses/{version}/scores.json \
|
|
629
|
+
--previous-scores ".harness-evolver/harnesses/$PREV_BEST/scores.json" \
|
|
630
|
+
--tasks-dir .harness-evolver/eval/tasks/ \
|
|
631
|
+
--output-dir .harness-evolver/eval/tasks/ \
|
|
632
|
+
--max-total-tasks 60 2>/dev/null
|
|
633
|
+
fi
|
|
634
|
+
```
|
|
635
|
+
|
|
636
|
+
If new tasks were added, print: "Added {N} regression tasks to lock in improvements on: {task_ids}"
|
|
637
|
+
|
|
638
|
+
This is the "durable test gates" pattern: every fixed failure becomes a permanent regression test.
|
|
639
|
+
New tasks are tagged with `metadata.type: "regression"` and `metadata.source: "regression"` so they
|
|
640
|
+
can be distinguished from original tasks. The test suite only grows — regression tasks are never removed.
|
|
641
|
+
|
|
583
642
|
### 6. Report
|
|
584
643
|
|
|
585
644
|
Read `summary.json`. Print: `Iteration {i}/{N}: {version} scored {score} (best: {best} at {best_score})`
|
|
@@ -0,0 +1,102 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: harness-evolver:import-traces
|
|
3
|
+
description: "Use when the user wants to import real production traces from LangSmith as test tasks, convert traces to eval tasks, enrich their eval set with real-world data, or pull production data into their harness evaluation."
|
|
4
|
+
argument-hint: "[--project NAME] [--limit N]"
|
|
5
|
+
allowed-tools: [Read, Write, Edit, Bash, Glob, Grep]
|
|
6
|
+
---
|
|
7
|
+
|
|
8
|
+
# /harness-evolver:import-traces
|
|
9
|
+
|
|
10
|
+
Import production traces from LangSmith and convert them into eval tasks. This enriches the test suite with real-world inputs, prioritizing traces with negative user feedback.
|
|
11
|
+
|
|
12
|
+
## Prerequisites
|
|
13
|
+
|
|
14
|
+
- `.harness-evolver/` must exist. If not, tell user to run `harness-evolver:init` first.
|
|
15
|
+
- `langsmith-cli` must be available. Check:
|
|
16
|
+
|
|
17
|
+
```bash
|
|
18
|
+
which langsmith-cli 2>/dev/null
|
|
19
|
+
```
|
|
20
|
+
|
|
21
|
+
If not found: "Install langsmith-cli first: `uv tool install langsmith-cli && langsmith-cli auth login`"
|
|
22
|
+
|
|
23
|
+
## Resolve Tool Path
|
|
24
|
+
|
|
25
|
+
```bash
|
|
26
|
+
TOOLS=$([ -d ".harness-evolver/tools" ] && echo ".harness-evolver/tools" || echo "$HOME/.harness-evolver/tools")
|
|
27
|
+
```
|
|
28
|
+
|
|
29
|
+
## Parse Arguments
|
|
30
|
+
|
|
31
|
+
- `--project NAME` — LangSmith project name (if not provided, discover interactively)
|
|
32
|
+
- `--limit N` — max traces to import (default: 20)
|
|
33
|
+
|
|
34
|
+
## Phase 1: Discover Projects
|
|
35
|
+
|
|
36
|
+
If `--project` not provided, list available projects:
|
|
37
|
+
|
|
38
|
+
```bash
|
|
39
|
+
langsmith-cli --json projects list --limit 20 2>/dev/null
|
|
40
|
+
```
|
|
41
|
+
|
|
42
|
+
Show the user a list of projects with run counts. Let them pick one, or use the most recent.
|
|
43
|
+
|
|
44
|
+
If `--project` is provided, use it directly.
|
|
45
|
+
|
|
46
|
+
## Phase 2: Fetch Traces
|
|
47
|
+
|
|
48
|
+
```bash
|
|
49
|
+
langsmith-cli --json runs list \
|
|
50
|
+
--project "{project_name}" \
|
|
51
|
+
--limit {limit} \
|
|
52
|
+
--fields id,name,inputs,outputs,error,feedback_stats,total_tokens \
|
|
53
|
+
> /tmp/harness_import_traces.json 2>/dev/null
|
|
54
|
+
```
|
|
55
|
+
|
|
56
|
+
Check the output has data:
|
|
57
|
+
```bash
|
|
58
|
+
python3 -c "import json; data=json.load(open('/tmp/harness_import_traces.json')); print(f'{len(data)} traces fetched')"
|
|
59
|
+
```
|
|
60
|
+
|
|
61
|
+
If no traces found, tell user the project may be empty or the name may be wrong.
|
|
62
|
+
|
|
63
|
+
## Phase 3: Convert to Tasks
|
|
64
|
+
|
|
65
|
+
```bash
|
|
66
|
+
python3 $TOOLS/import_traces.py \
|
|
67
|
+
--traces-json /tmp/harness_import_traces.json \
|
|
68
|
+
--output-dir .harness-evolver/eval/tasks/ \
|
|
69
|
+
--prefix imported \
|
|
70
|
+
--max-tasks {limit}
|
|
71
|
+
```
|
|
72
|
+
|
|
73
|
+
## Phase 4: Report
|
|
74
|
+
|
|
75
|
+
Read the tool output and report:
|
|
76
|
+
- How many traces were imported
|
|
77
|
+
- How many had negative feedback (high priority)
|
|
78
|
+
- How many were skipped (no extractable input, duplicates)
|
|
79
|
+
- Total tasks now in eval set
|
|
80
|
+
|
|
81
|
+
```bash
|
|
82
|
+
ls .harness-evolver/eval/tasks/*.json | wc -l
|
|
83
|
+
```
|
|
84
|
+
|
|
85
|
+
Print:
|
|
86
|
+
```
|
|
87
|
+
Imported {N} production traces as eval tasks.
|
|
88
|
+
{M} with negative user feedback (high priority)
|
|
89
|
+
{K} skipped (no input or duplicates)
|
|
90
|
+
Total eval tasks: {total}
|
|
91
|
+
|
|
92
|
+
Next: run `harness-evolver:evolve` to optimize against real-world inputs.
|
|
93
|
+
```
|
|
94
|
+
|
|
95
|
+
## Gotchas
|
|
96
|
+
|
|
97
|
+
- Traces with no extractable user input are skipped (e.g., system-only runs)
|
|
98
|
+
- Duplicate traces (same run ID) are automatically skipped
|
|
99
|
+
- Imported tasks are tagged with `metadata.source: "imported"` and `metadata.type: "production"`
|
|
100
|
+
- Tasks with negative feedback get `metadata.user_feedback: "negative"` — the proposer should prioritize these
|
|
101
|
+
- The `metadata.langsmith_run_id` field links back to the original trace for debugging
|
|
102
|
+
- Cleanup: `rm /tmp/harness_import_traces.json` after import
|
|
@@ -0,0 +1,229 @@
|
|
|
1
|
+
#!/usr/bin/env python3
|
|
2
|
+
"""Import LangSmith Traces as Eval Tasks for Harness Evolver.
|
|
3
|
+
|
|
4
|
+
Transforms LangSmith trace JSON (from langsmith-cli) into task JSON files
|
|
5
|
+
for the evaluation set. Prioritizes traces with negative feedback.
|
|
6
|
+
|
|
7
|
+
Usage:
|
|
8
|
+
python3 import_traces.py \
|
|
9
|
+
--traces-json /tmp/langsmith_traces.json \
|
|
10
|
+
--output-dir .harness-evolver/eval/tasks/ \
|
|
11
|
+
--prefix imported \
|
|
12
|
+
[--max-tasks 30]
|
|
13
|
+
|
|
14
|
+
Stdlib-only. No external dependencies.
|
|
15
|
+
"""
|
|
16
|
+
|
|
17
|
+
import argparse
|
|
18
|
+
import hashlib
|
|
19
|
+
import json
|
|
20
|
+
import os
|
|
21
|
+
import re
|
|
22
|
+
import sys
|
|
23
|
+
|
|
24
|
+
|
|
25
|
+
def load_json(path):
|
|
26
|
+
"""Load JSON file, return None if missing or invalid."""
|
|
27
|
+
if not path or not os.path.exists(path):
|
|
28
|
+
return None
|
|
29
|
+
try:
|
|
30
|
+
with open(path) as f:
|
|
31
|
+
return json.load(f)
|
|
32
|
+
except (json.JSONDecodeError, OSError):
|
|
33
|
+
return None
|
|
34
|
+
|
|
35
|
+
|
|
36
|
+
def extract_input_from_trace(run):
|
|
37
|
+
"""Extract the user input from a LangSmith run's inputs field.
|
|
38
|
+
|
|
39
|
+
Handles multiple LangChain serialization formats:
|
|
40
|
+
- Direct {"input": "..."} field
|
|
41
|
+
- {"messages": [[HumanMessage, ...]]} format
|
|
42
|
+
- {"question": "..."} or {"query": "..."} fields
|
|
43
|
+
"""
|
|
44
|
+
inputs = run.get("inputs", {})
|
|
45
|
+
if not inputs:
|
|
46
|
+
return None
|
|
47
|
+
|
|
48
|
+
if isinstance(inputs, str):
|
|
49
|
+
return inputs
|
|
50
|
+
|
|
51
|
+
# Direct input field
|
|
52
|
+
for key in ("input", "question", "query", "prompt", "text", "user_input"):
|
|
53
|
+
if key in inputs and isinstance(inputs[key], str):
|
|
54
|
+
return inputs[key]
|
|
55
|
+
|
|
56
|
+
# LangChain messages format
|
|
57
|
+
messages = inputs.get("messages") or inputs.get("input")
|
|
58
|
+
if isinstance(messages, list):
|
|
59
|
+
# Might be [[msg1, msg2]] (batched) or [msg1, msg2]
|
|
60
|
+
if messages and isinstance(messages[0], list):
|
|
61
|
+
messages = messages[0]
|
|
62
|
+
for msg in messages:
|
|
63
|
+
if isinstance(msg, dict):
|
|
64
|
+
# {"type": "human", "content": "..."}
|
|
65
|
+
if msg.get("type") in ("human", "HumanMessage") or msg.get("role") == "user":
|
|
66
|
+
content = msg.get("content", "")
|
|
67
|
+
if isinstance(content, str) and content:
|
|
68
|
+
return content
|
|
69
|
+
if isinstance(content, list):
|
|
70
|
+
# Multi-modal: [{"type": "text", "text": "..."}]
|
|
71
|
+
for part in content:
|
|
72
|
+
if isinstance(part, dict) and part.get("type") == "text":
|
|
73
|
+
return part.get("text", "")
|
|
74
|
+
elif isinstance(msg, str) and msg:
|
|
75
|
+
return msg
|
|
76
|
+
|
|
77
|
+
# Fallback: stringify the whole inputs
|
|
78
|
+
flat = json.dumps(inputs)
|
|
79
|
+
if len(flat) > 20: # Only if there's meaningful content
|
|
80
|
+
return flat[:2000]
|
|
81
|
+
|
|
82
|
+
return None
|
|
83
|
+
|
|
84
|
+
|
|
85
|
+
def extract_feedback(run):
|
|
86
|
+
"""Extract user feedback from a LangSmith run."""
|
|
87
|
+
feedback = run.get("feedback_stats") or run.get("feedback") or {}
|
|
88
|
+
if not feedback:
|
|
89
|
+
return None
|
|
90
|
+
|
|
91
|
+
# feedback_stats format: {"thumbs_up": N, "thumbs_down": N}
|
|
92
|
+
if isinstance(feedback, dict):
|
|
93
|
+
up = feedback.get("thumbs_up", 0) or feedback.get("positive", 0)
|
|
94
|
+
down = feedback.get("thumbs_down", 0) or feedback.get("negative", 0)
|
|
95
|
+
if down > 0:
|
|
96
|
+
return "negative"
|
|
97
|
+
if up > 0:
|
|
98
|
+
return "positive"
|
|
99
|
+
return None
|
|
100
|
+
|
|
101
|
+
|
|
102
|
+
def infer_difficulty(text):
|
|
103
|
+
"""Infer difficulty from input characteristics."""
|
|
104
|
+
if not text:
|
|
105
|
+
return "medium"
|
|
106
|
+
length = len(text)
|
|
107
|
+
# Count question marks, clauses, etc.
|
|
108
|
+
questions = text.count("?")
|
|
109
|
+
sentences = len(re.split(r"[.!?]+", text))
|
|
110
|
+
|
|
111
|
+
if length < 50 and questions <= 1:
|
|
112
|
+
return "easy"
|
|
113
|
+
if length > 500 or questions > 2 or sentences > 5:
|
|
114
|
+
return "hard"
|
|
115
|
+
return "medium"
|
|
116
|
+
|
|
117
|
+
|
|
118
|
+
def short_id(run_id):
|
|
119
|
+
"""Create a short deterministic ID from a full run ID."""
|
|
120
|
+
return hashlib.md5(str(run_id).encode()).hexdigest()[:8]
|
|
121
|
+
|
|
122
|
+
|
|
123
|
+
def main():
|
|
124
|
+
parser = argparse.ArgumentParser(description="Import LangSmith traces as eval tasks")
|
|
125
|
+
parser.add_argument("--traces-json", required=True, help="Path to langsmith-cli JSON output")
|
|
126
|
+
parser.add_argument("--output-dir", required=True, help="Directory to write task JSON files")
|
|
127
|
+
parser.add_argument("--prefix", default="imported", help="Prefix for task IDs (default: imported)")
|
|
128
|
+
parser.add_argument("--max-tasks", type=int, default=30, help="Max tasks to import (default: 30)")
|
|
129
|
+
parser.add_argument("--prioritize-negative", action="store_true", default=True,
|
|
130
|
+
help="Import negative-feedback traces first (default: true)")
|
|
131
|
+
args = parser.parse_args()
|
|
132
|
+
|
|
133
|
+
traces = load_json(args.traces_json)
|
|
134
|
+
if not traces:
|
|
135
|
+
print("No traces found or invalid JSON — nothing to import")
|
|
136
|
+
return
|
|
137
|
+
|
|
138
|
+
if isinstance(traces, dict):
|
|
139
|
+
# Might be wrapped in {"runs": [...]}
|
|
140
|
+
traces = traces.get("runs", traces.get("data", [traces]))
|
|
141
|
+
|
|
142
|
+
if not isinstance(traces, list):
|
|
143
|
+
print("Unexpected traces format — expected a JSON array")
|
|
144
|
+
return
|
|
145
|
+
|
|
146
|
+
# Sort: negative feedback first, then errors, then the rest
|
|
147
|
+
if args.prioritize_negative:
|
|
148
|
+
def priority(run):
|
|
149
|
+
fb = extract_feedback(run)
|
|
150
|
+
has_error = bool(run.get("error"))
|
|
151
|
+
if fb == "negative":
|
|
152
|
+
return 0
|
|
153
|
+
if has_error:
|
|
154
|
+
return 1
|
|
155
|
+
return 2
|
|
156
|
+
traces.sort(key=priority)
|
|
157
|
+
|
|
158
|
+
os.makedirs(args.output_dir, exist_ok=True)
|
|
159
|
+
|
|
160
|
+
# Check for existing imported tasks to avoid duplicates
|
|
161
|
+
existing_run_ids = set()
|
|
162
|
+
for fname in os.listdir(args.output_dir):
|
|
163
|
+
if fname.endswith(".json"):
|
|
164
|
+
task = load_json(os.path.join(args.output_dir, fname))
|
|
165
|
+
if task and task.get("metadata", {}).get("langsmith_run_id"):
|
|
166
|
+
existing_run_ids.add(task["metadata"]["langsmith_run_id"])
|
|
167
|
+
|
|
168
|
+
imported = 0
|
|
169
|
+
skipped_no_input = 0
|
|
170
|
+
skipped_duplicate = 0
|
|
171
|
+
negative_count = 0
|
|
172
|
+
|
|
173
|
+
for run in traces:
|
|
174
|
+
if imported >= args.max_tasks:
|
|
175
|
+
break
|
|
176
|
+
|
|
177
|
+
run_id = str(run.get("id", ""))
|
|
178
|
+
if run_id in existing_run_ids:
|
|
179
|
+
skipped_duplicate += 1
|
|
180
|
+
continue
|
|
181
|
+
|
|
182
|
+
user_input = extract_input_from_trace(run)
|
|
183
|
+
if not user_input or len(user_input.strip()) < 5:
|
|
184
|
+
skipped_no_input += 1
|
|
185
|
+
continue
|
|
186
|
+
|
|
187
|
+
feedback = extract_feedback(run)
|
|
188
|
+
has_error = bool(run.get("error"))
|
|
189
|
+
task_id = f"{args.prefix}_{short_id(run_id)}"
|
|
190
|
+
|
|
191
|
+
task = {
|
|
192
|
+
"id": task_id,
|
|
193
|
+
"input": user_input.strip(),
|
|
194
|
+
"metadata": {
|
|
195
|
+
"difficulty": infer_difficulty(user_input),
|
|
196
|
+
"category": run.get("name", "unknown"),
|
|
197
|
+
"type": "production",
|
|
198
|
+
"source": "imported",
|
|
199
|
+
"langsmith_run_id": run_id,
|
|
200
|
+
"had_error": has_error,
|
|
201
|
+
"user_feedback": feedback,
|
|
202
|
+
},
|
|
203
|
+
}
|
|
204
|
+
|
|
205
|
+
out_path = os.path.join(args.output_dir, f"{task_id}.json")
|
|
206
|
+
with open(out_path, "w") as f:
|
|
207
|
+
json.dump(task, f, indent=2)
|
|
208
|
+
|
|
209
|
+
imported += 1
|
|
210
|
+
if feedback == "negative":
|
|
211
|
+
negative_count += 1
|
|
212
|
+
|
|
213
|
+
summary = {
|
|
214
|
+
"imported": imported,
|
|
215
|
+
"negative_feedback": negative_count,
|
|
216
|
+
"skipped_no_input": skipped_no_input,
|
|
217
|
+
"skipped_duplicate": skipped_duplicate,
|
|
218
|
+
"total_traces": len(traces),
|
|
219
|
+
}
|
|
220
|
+
print(json.dumps(summary))
|
|
221
|
+
print(f"Imported {imported} production traces as tasks ({negative_count} with negative feedback)")
|
|
222
|
+
if skipped_duplicate:
|
|
223
|
+
print(f" Skipped {skipped_duplicate} already-imported traces")
|
|
224
|
+
if skipped_no_input:
|
|
225
|
+
print(f" Skipped {skipped_no_input} traces with no extractable input")
|
|
226
|
+
|
|
227
|
+
|
|
228
|
+
if __name__ == "__main__":
|
|
229
|
+
main()
|
|
@@ -0,0 +1,230 @@
|
|
|
1
|
+
#!/usr/bin/env python3
|
|
2
|
+
"""Test Suite Growth for Harness Evolver.
|
|
3
|
+
|
|
4
|
+
Generates regression test tasks when previously-failing tasks are now passing.
|
|
5
|
+
Creates mechanical variations of fixed tasks to prevent future regressions.
|
|
6
|
+
|
|
7
|
+
Usage:
|
|
8
|
+
python3 test_growth.py \
|
|
9
|
+
--current-scores .harness-evolver/harnesses/v003/scores.json \
|
|
10
|
+
--previous-scores .harness-evolver/harnesses/v002/scores.json \
|
|
11
|
+
--tasks-dir .harness-evolver/eval/tasks/ \
|
|
12
|
+
--output-dir .harness-evolver/eval/tasks/ \
|
|
13
|
+
--max-total-tasks 60
|
|
14
|
+
|
|
15
|
+
Stdlib-only. No external dependencies.
|
|
16
|
+
"""
|
|
17
|
+
|
|
18
|
+
import argparse
|
|
19
|
+
import json
|
|
20
|
+
import os
|
|
21
|
+
import re
|
|
22
|
+
import sys
|
|
23
|
+
|
|
24
|
+
|
|
25
|
+
def load_json(path):
|
|
26
|
+
"""Load JSON file, return None if missing or invalid."""
|
|
27
|
+
if not path or not os.path.exists(path):
|
|
28
|
+
return None
|
|
29
|
+
try:
|
|
30
|
+
with open(path) as f:
|
|
31
|
+
return json.load(f)
|
|
32
|
+
except (json.JSONDecodeError, OSError):
|
|
33
|
+
return None
|
|
34
|
+
|
|
35
|
+
|
|
36
|
+
def find_fixed_tasks(current_scores, previous_scores, fix_threshold_before=0.5, fix_threshold_after=0.8):
|
|
37
|
+
"""Find tasks that improved significantly: score < before_threshold → > after_threshold."""
|
|
38
|
+
current_per_task = current_scores.get("per_task", {})
|
|
39
|
+
previous_per_task = previous_scores.get("per_task", {})
|
|
40
|
+
|
|
41
|
+
fixed = []
|
|
42
|
+
for tid, curr_data in current_per_task.items():
|
|
43
|
+
if not isinstance(curr_data, dict):
|
|
44
|
+
continue
|
|
45
|
+
curr_score = curr_data.get("score", 0)
|
|
46
|
+
prev_data = previous_per_task.get(tid, {})
|
|
47
|
+
prev_score = prev_data.get("score", 0) if isinstance(prev_data, dict) else 0
|
|
48
|
+
|
|
49
|
+
if prev_score < fix_threshold_before and curr_score > fix_threshold_after:
|
|
50
|
+
fixed.append({
|
|
51
|
+
"task_id": tid,
|
|
52
|
+
"previous_score": prev_score,
|
|
53
|
+
"current_score": curr_score,
|
|
54
|
+
"improvement": curr_score - prev_score,
|
|
55
|
+
})
|
|
56
|
+
|
|
57
|
+
# Sort by improvement (biggest fixes first)
|
|
58
|
+
fixed.sort(key=lambda x: -x["improvement"])
|
|
59
|
+
return fixed
|
|
60
|
+
|
|
61
|
+
|
|
62
|
+
def count_existing_tasks(directory):
|
|
63
|
+
"""Count existing task JSON files in directory."""
|
|
64
|
+
if not os.path.isdir(directory):
|
|
65
|
+
return 0
|
|
66
|
+
return sum(1 for f in os.listdir(directory) if f.endswith(".json"))
|
|
67
|
+
|
|
68
|
+
|
|
69
|
+
def next_regression_id(output_dir):
|
|
70
|
+
"""Find the next available regression task ID."""
|
|
71
|
+
existing = set()
|
|
72
|
+
if os.path.isdir(output_dir):
|
|
73
|
+
for fname in os.listdir(output_dir):
|
|
74
|
+
m = re.match(r"regression_(\d+)\.json", fname)
|
|
75
|
+
if m:
|
|
76
|
+
existing.add(int(m.group(1)))
|
|
77
|
+
n = 1
|
|
78
|
+
while n in existing:
|
|
79
|
+
n += 1
|
|
80
|
+
return n
|
|
81
|
+
|
|
82
|
+
|
|
83
|
+
def generate_variations(original_input, task_id):
|
|
84
|
+
"""Generate 2-3 mechanical variations of an input string.
|
|
85
|
+
|
|
86
|
+
Uses simple string transforms — no LLM needed:
|
|
87
|
+
- Rephrase by reordering
|
|
88
|
+
- Add qualifying clause
|
|
89
|
+
- Simplify to minimal form
|
|
90
|
+
"""
|
|
91
|
+
variations = []
|
|
92
|
+
text = original_input.strip()
|
|
93
|
+
|
|
94
|
+
# Variation 1: Add a qualifying clause
|
|
95
|
+
qualifiers = [
|
|
96
|
+
"Please be specific and detailed in your response.",
|
|
97
|
+
"Consider edge cases in your answer.",
|
|
98
|
+
"Provide a concise but thorough response.",
|
|
99
|
+
"Think step by step before answering.",
|
|
100
|
+
]
|
|
101
|
+
# Pick qualifier based on hash of task_id for determinism
|
|
102
|
+
qi = hash(task_id) % len(qualifiers)
|
|
103
|
+
v1 = f"{text}\n\n{qualifiers[qi]}"
|
|
104
|
+
variations.append(("qualified", v1))
|
|
105
|
+
|
|
106
|
+
# Variation 2: Reorder sentences if multiple exist
|
|
107
|
+
sentences = re.split(r"(?<=[.!?])\s+", text)
|
|
108
|
+
if len(sentences) >= 2:
|
|
109
|
+
# Swap first two sentences
|
|
110
|
+
reordered = sentences[1:] + sentences[:1]
|
|
111
|
+
v2 = " ".join(reordered)
|
|
112
|
+
variations.append(("reordered", v2))
|
|
113
|
+
else:
|
|
114
|
+
# If single sentence, prepend "Given the context: "
|
|
115
|
+
v2 = f"Given the following context, {text[0].lower()}{text[1:]}" if len(text) > 1 else text
|
|
116
|
+
variations.append(("rephrased", v2))
|
|
117
|
+
|
|
118
|
+
# Variation 3: Minimal version — strip to core question
|
|
119
|
+
# Remove qualifiers, keep just the main ask
|
|
120
|
+
minimal = text
|
|
121
|
+
# Strip common padding phrases
|
|
122
|
+
for prefix in ["Please ", "Can you ", "Could you ", "I would like you to ", "I need you to "]:
|
|
123
|
+
if minimal.startswith(prefix):
|
|
124
|
+
minimal = minimal[len(prefix):]
|
|
125
|
+
minimal = minimal[0].upper() + minimal[1:] if minimal else minimal
|
|
126
|
+
break
|
|
127
|
+
if minimal != text:
|
|
128
|
+
variations.append(("minimal", minimal))
|
|
129
|
+
|
|
130
|
+
return variations
|
|
131
|
+
|
|
132
|
+
|
|
133
|
+
def main():
|
|
134
|
+
parser = argparse.ArgumentParser(description="Generate regression test tasks from score improvements")
|
|
135
|
+
parser.add_argument("--current-scores", required=True, help="Path to current version's scores.json")
|
|
136
|
+
parser.add_argument("--previous-scores", required=True, help="Path to previous version's scores.json")
|
|
137
|
+
parser.add_argument("--tasks-dir", required=True, help="Path to eval/tasks/ (to read originals)")
|
|
138
|
+
parser.add_argument("--output-dir", required=True, help="Directory to write regression tasks")
|
|
139
|
+
parser.add_argument("--max-total-tasks", type=int, default=60, help="Cap total tasks in output-dir (default 60)")
|
|
140
|
+
args = parser.parse_args()
|
|
141
|
+
|
|
142
|
+
current = load_json(args.current_scores)
|
|
143
|
+
previous = load_json(args.previous_scores)
|
|
144
|
+
|
|
145
|
+
if not current or not previous:
|
|
146
|
+
print("Missing scores files — skipping test growth")
|
|
147
|
+
return
|
|
148
|
+
|
|
149
|
+
# Find tasks that were fixed
|
|
150
|
+
fixed = find_fixed_tasks(current, previous)
|
|
151
|
+
if not fixed:
|
|
152
|
+
print("No tasks improved significantly — no regression tasks needed")
|
|
153
|
+
return
|
|
154
|
+
|
|
155
|
+
# Check capacity
|
|
156
|
+
existing_count = count_existing_tasks(args.output_dir)
|
|
157
|
+
available_slots = args.max_total_tasks - existing_count
|
|
158
|
+
if available_slots <= 0:
|
|
159
|
+
print(f"Task suite already at capacity ({existing_count}/{args.max_total_tasks}) — skipping growth")
|
|
160
|
+
return
|
|
161
|
+
|
|
162
|
+
os.makedirs(args.output_dir, exist_ok=True)
|
|
163
|
+
regression_id = next_regression_id(args.output_dir)
|
|
164
|
+
tasks_added = 0
|
|
165
|
+
fixed_ids = []
|
|
166
|
+
|
|
167
|
+
for fix_info in fixed:
|
|
168
|
+
if tasks_added >= available_slots:
|
|
169
|
+
break
|
|
170
|
+
|
|
171
|
+
tid = fix_info["task_id"]
|
|
172
|
+
# Load original task
|
|
173
|
+
task_path = os.path.join(args.tasks_dir, f"{tid}.json")
|
|
174
|
+
original = load_json(task_path)
|
|
175
|
+
if not original:
|
|
176
|
+
continue
|
|
177
|
+
|
|
178
|
+
original_input = original.get("input", "")
|
|
179
|
+
if not original_input:
|
|
180
|
+
continue
|
|
181
|
+
|
|
182
|
+
original_meta = original.get("metadata", {})
|
|
183
|
+
variations = generate_variations(original_input, tid)
|
|
184
|
+
|
|
185
|
+
for var_type, var_input in variations:
|
|
186
|
+
if tasks_added >= available_slots:
|
|
187
|
+
break
|
|
188
|
+
|
|
189
|
+
reg_id = f"regression_{regression_id:03d}"
|
|
190
|
+
task = {
|
|
191
|
+
"id": reg_id,
|
|
192
|
+
"input": var_input,
|
|
193
|
+
"metadata": {
|
|
194
|
+
"difficulty": original_meta.get("difficulty", "medium"),
|
|
195
|
+
"category": original_meta.get("category", "unknown"),
|
|
196
|
+
"type": "regression",
|
|
197
|
+
"source": "regression",
|
|
198
|
+
"regression_for": tid,
|
|
199
|
+
"variation": var_type,
|
|
200
|
+
"previous_score": fix_info["previous_score"],
|
|
201
|
+
"fixed_at_score": fix_info["current_score"],
|
|
202
|
+
},
|
|
203
|
+
}
|
|
204
|
+
|
|
205
|
+
# Include expected if original had it
|
|
206
|
+
if "expected" in original:
|
|
207
|
+
task["expected"] = original["expected"]
|
|
208
|
+
|
|
209
|
+
out_path = os.path.join(args.output_dir, f"{reg_id}.json")
|
|
210
|
+
with open(out_path, "w") as f:
|
|
211
|
+
json.dump(task, f, indent=2)
|
|
212
|
+
|
|
213
|
+
tasks_added += 1
|
|
214
|
+
regression_id += 1
|
|
215
|
+
|
|
216
|
+
fixed_ids.append(tid)
|
|
217
|
+
|
|
218
|
+
# Output summary
|
|
219
|
+
summary = {
|
|
220
|
+
"tasks_added": tasks_added,
|
|
221
|
+
"fixed_tasks": fixed_ids,
|
|
222
|
+
"total_tasks_now": existing_count + tasks_added,
|
|
223
|
+
"max_total_tasks": args.max_total_tasks,
|
|
224
|
+
}
|
|
225
|
+
print(json.dumps(summary))
|
|
226
|
+
print(f"Added {tasks_added} regression tasks to lock in improvements on: {', '.join(fixed_ids)}")
|
|
227
|
+
|
|
228
|
+
|
|
229
|
+
if __name__ == "__main__":
|
|
230
|
+
main()
|
|
@@ -0,0 +1,350 @@
|
|
|
1
|
+
#!/usr/bin/env python3
|
|
2
|
+
"""Trace Insights Generator for Harness Evolver.
|
|
3
|
+
|
|
4
|
+
Analyzes LangSmith traces + per-task scores to produce structured insights.
|
|
5
|
+
Clusters errors, analyzes token usage, cross-references with scores,
|
|
6
|
+
and generates data-driven hypotheses.
|
|
7
|
+
|
|
8
|
+
Usage:
|
|
9
|
+
python3 trace_insights.py \
|
|
10
|
+
--langsmith-runs .harness-evolver/langsmith_runs.json \
|
|
11
|
+
--scores .harness-evolver/harnesses/v002/scores.json \
|
|
12
|
+
--tasks-dir .harness-evolver/eval/tasks/ \
|
|
13
|
+
--output .harness-evolver/trace_insights.json \
|
|
14
|
+
[--langsmith-stats .harness-evolver/langsmith_stats.json]
|
|
15
|
+
|
|
16
|
+
Stdlib-only. No external dependencies.
|
|
17
|
+
"""
|
|
18
|
+
|
|
19
|
+
import argparse
|
|
20
|
+
import json
|
|
21
|
+
import os
|
|
22
|
+
import sys
|
|
23
|
+
from datetime import datetime, timezone
|
|
24
|
+
|
|
25
|
+
|
|
26
|
+
def load_json(path):
|
|
27
|
+
"""Load JSON file, return None if missing or invalid."""
|
|
28
|
+
if not path or not os.path.exists(path):
|
|
29
|
+
return None
|
|
30
|
+
try:
|
|
31
|
+
with open(path) as f:
|
|
32
|
+
return json.load(f)
|
|
33
|
+
except (json.JSONDecodeError, OSError):
|
|
34
|
+
return None
|
|
35
|
+
|
|
36
|
+
|
|
37
|
+
def cluster_errors(runs):
|
|
38
|
+
"""Group runs by error pattern (first 80 chars of error message)."""
|
|
39
|
+
clusters = {}
|
|
40
|
+
for run in runs:
|
|
41
|
+
error = run.get("error")
|
|
42
|
+
if not error:
|
|
43
|
+
continue
|
|
44
|
+
# Normalize: take first 80 chars, strip whitespace
|
|
45
|
+
pattern = error.strip()[:80]
|
|
46
|
+
clusters.setdefault(pattern, []).append(run)
|
|
47
|
+
return [
|
|
48
|
+
{"pattern": pattern, "count": len(runs_list), "run_names": [r.get("name", "?") for r in runs_list[:5]]}
|
|
49
|
+
for pattern, runs_list in sorted(clusters.items(), key=lambda x: -len(x[1]))
|
|
50
|
+
]
|
|
51
|
+
|
|
52
|
+
|
|
53
|
+
def analyze_tokens(runs):
|
|
54
|
+
"""Bucket runs by token usage: low (<500), medium (500-2000), high (>2000)."""
|
|
55
|
+
buckets = {"low": [], "medium": [], "high": []}
|
|
56
|
+
for run in runs:
|
|
57
|
+
tokens = run.get("tokens") or run.get("total_tokens") or 0
|
|
58
|
+
if tokens < 500:
|
|
59
|
+
buckets["low"].append(run)
|
|
60
|
+
elif tokens < 2000:
|
|
61
|
+
buckets["medium"].append(run)
|
|
62
|
+
else:
|
|
63
|
+
buckets["high"].append(run)
|
|
64
|
+
return {
|
|
65
|
+
name: {"count": len(items), "avg_tokens": sum((r.get("tokens") or r.get("total_tokens") or 0) for r in items) / max(len(items), 1)}
|
|
66
|
+
for name, items in buckets.items()
|
|
67
|
+
}
|
|
68
|
+
|
|
69
|
+
|
|
70
|
+
def analyze_responses(runs):
|
|
71
|
+
"""Bucket runs by response length: empty, short (<100), normal (100-1000), long (>1000)."""
|
|
72
|
+
buckets = {"empty": [], "short": [], "normal": [], "long": []}
|
|
73
|
+
for run in runs:
|
|
74
|
+
resp = run.get("llm_response") or run.get("output") or ""
|
|
75
|
+
length = len(resp)
|
|
76
|
+
if length == 0:
|
|
77
|
+
buckets["empty"].append(run)
|
|
78
|
+
elif length < 100:
|
|
79
|
+
buckets["short"].append(run)
|
|
80
|
+
elif length < 1000:
|
|
81
|
+
buckets["normal"].append(run)
|
|
82
|
+
else:
|
|
83
|
+
buckets["long"].append(run)
|
|
84
|
+
return {
|
|
85
|
+
name: {"count": len(items)}
|
|
86
|
+
for name, items in buckets.items()
|
|
87
|
+
if items
|
|
88
|
+
}
|
|
89
|
+
|
|
90
|
+
|
|
91
|
+
def cross_reference_scores(runs, scores_data, tasks_dir):
|
|
92
|
+
"""Cross-reference trace patterns with per-task scores."""
|
|
93
|
+
per_task = scores_data.get("per_task", {}) if scores_data else {}
|
|
94
|
+
if not per_task:
|
|
95
|
+
return {}
|
|
96
|
+
|
|
97
|
+
# Load task metadata for category mapping
|
|
98
|
+
task_meta = {}
|
|
99
|
+
if tasks_dir and os.path.isdir(tasks_dir):
|
|
100
|
+
for fname in os.listdir(tasks_dir):
|
|
101
|
+
if fname.endswith(".json"):
|
|
102
|
+
path = os.path.join(tasks_dir, fname)
|
|
103
|
+
try:
|
|
104
|
+
with open(path) as f:
|
|
105
|
+
t = json.load(f)
|
|
106
|
+
tid = t.get("id", fname.replace(".json", ""))
|
|
107
|
+
task_meta[tid] = t.get("metadata", {})
|
|
108
|
+
except (json.JSONDecodeError, OSError):
|
|
109
|
+
pass
|
|
110
|
+
|
|
111
|
+
# Score statistics
|
|
112
|
+
scores = [v.get("score", 0) for v in per_task.values() if isinstance(v, dict)]
|
|
113
|
+
if not scores:
|
|
114
|
+
return {}
|
|
115
|
+
|
|
116
|
+
failing = {tid: v for tid, v in per_task.items() if isinstance(v, dict) and v.get("score", 0) < 0.5}
|
|
117
|
+
passing = {tid: v for tid, v in per_task.items() if isinstance(v, dict) and v.get("score", 0) >= 0.8}
|
|
118
|
+
|
|
119
|
+
# Group failures by category
|
|
120
|
+
failure_categories = {}
|
|
121
|
+
for tid in failing:
|
|
122
|
+
meta = task_meta.get(tid, {})
|
|
123
|
+
cat = meta.get("category", meta.get("type", "unknown"))
|
|
124
|
+
failure_categories.setdefault(cat, []).append(tid)
|
|
125
|
+
|
|
126
|
+
return {
|
|
127
|
+
"total_tasks": len(per_task),
|
|
128
|
+
"avg_score": sum(scores) / len(scores),
|
|
129
|
+
"failing_count": len(failing),
|
|
130
|
+
"passing_count": len(passing),
|
|
131
|
+
"failing_task_ids": list(failing.keys()),
|
|
132
|
+
"failure_categories": {cat: tids for cat, tids in sorted(failure_categories.items(), key=lambda x: -len(x[1]))},
|
|
133
|
+
}
|
|
134
|
+
|
|
135
|
+
|
|
136
|
+
def correlate_tokens_scores(runs, scores_data):
|
|
137
|
+
"""Check if token usage correlates with task scores."""
|
|
138
|
+
per_task = scores_data.get("per_task", {}) if scores_data else {}
|
|
139
|
+
if not per_task or not runs:
|
|
140
|
+
return None
|
|
141
|
+
|
|
142
|
+
# Simple correlation: avg score for high-token vs low-token runs
|
|
143
|
+
token_scores = {"low": [], "medium": [], "high": []}
|
|
144
|
+
for run in runs:
|
|
145
|
+
tokens = run.get("tokens") or run.get("total_tokens") or 0
|
|
146
|
+
# Try to match run to task by name
|
|
147
|
+
name = run.get("name", "")
|
|
148
|
+
for tid, tdata in per_task.items():
|
|
149
|
+
if isinstance(tdata, dict) and tid in name:
|
|
150
|
+
score = tdata.get("score", 0)
|
|
151
|
+
if tokens < 500:
|
|
152
|
+
token_scores["low"].append(score)
|
|
153
|
+
elif tokens < 2000:
|
|
154
|
+
token_scores["medium"].append(score)
|
|
155
|
+
else:
|
|
156
|
+
token_scores["high"].append(score)
|
|
157
|
+
break
|
|
158
|
+
|
|
159
|
+
result = {}
|
|
160
|
+
for bucket, scores in token_scores.items():
|
|
161
|
+
if scores:
|
|
162
|
+
result[bucket] = {"count": len(scores), "avg_score": sum(scores) / len(scores)}
|
|
163
|
+
return result if result else None
|
|
164
|
+
|
|
165
|
+
|
|
166
|
+
def generate_hypotheses(error_clusters, token_analysis, response_analysis, score_cross_ref, token_score_corr):
|
|
167
|
+
"""Generate data-driven hypotheses about failure patterns."""
|
|
168
|
+
hypotheses = []
|
|
169
|
+
|
|
170
|
+
# Hypothesis: errors cause failures
|
|
171
|
+
if error_clusters:
|
|
172
|
+
total_errors = sum(c["count"] for c in error_clusters)
|
|
173
|
+
top_error = error_clusters[0]
|
|
174
|
+
hypotheses.append(
|
|
175
|
+
f"{total_errors} runs had errors. Most common: \"{top_error['pattern']}\" ({top_error['count']} occurrences)"
|
|
176
|
+
)
|
|
177
|
+
|
|
178
|
+
# Hypothesis: empty responses
|
|
179
|
+
if response_analysis and response_analysis.get("empty", {}).get("count", 0) > 0:
|
|
180
|
+
n = response_analysis["empty"]["count"]
|
|
181
|
+
hypotheses.append(
|
|
182
|
+
f"{n} runs returned empty responses — possible API timeout, rate limiting, or invalid prompt"
|
|
183
|
+
)
|
|
184
|
+
|
|
185
|
+
# Hypothesis: high token usage correlates with low scores
|
|
186
|
+
if token_score_corr:
|
|
187
|
+
high = token_score_corr.get("high", {})
|
|
188
|
+
low = token_score_corr.get("low", {})
|
|
189
|
+
if high.get("avg_score", 1) < low.get("avg_score", 0) - 0.15:
|
|
190
|
+
hypotheses.append(
|
|
191
|
+
f"High-token runs avg score {high['avg_score']:.2f} vs low-token {low['avg_score']:.2f} — model may be verbose but inaccurate"
|
|
192
|
+
)
|
|
193
|
+
|
|
194
|
+
# Hypothesis: specific category failures
|
|
195
|
+
if score_cross_ref and score_cross_ref.get("failure_categories"):
|
|
196
|
+
cats = score_cross_ref["failure_categories"]
|
|
197
|
+
top_cat = next(iter(cats))
|
|
198
|
+
count = len(cats[top_cat])
|
|
199
|
+
hypotheses.append(
|
|
200
|
+
f"Category \"{top_cat}\" has {count} failing tasks — may need targeted prompt or tool improvement"
|
|
201
|
+
)
|
|
202
|
+
|
|
203
|
+
# Hypothesis: many failing
|
|
204
|
+
if score_cross_ref:
|
|
205
|
+
fail_count = score_cross_ref.get("failing_count", 0)
|
|
206
|
+
total = score_cross_ref.get("total_tasks", 1)
|
|
207
|
+
if fail_count > total * 0.5:
|
|
208
|
+
hypotheses.append(
|
|
209
|
+
f"{fail_count}/{total} tasks failing (>{50}%) — fundamental approach issue, not edge cases"
|
|
210
|
+
)
|
|
211
|
+
|
|
212
|
+
return hypotheses
|
|
213
|
+
|
|
214
|
+
|
|
215
|
+
def identify_top_issues(error_clusters, response_analysis, score_cross_ref):
|
|
216
|
+
"""Identify the most impactful issues sorted by severity."""
|
|
217
|
+
issues = []
|
|
218
|
+
|
|
219
|
+
# Empty responses = high severity
|
|
220
|
+
if response_analysis and response_analysis.get("empty", {}).get("count", 0) > 0:
|
|
221
|
+
issues.append({
|
|
222
|
+
"type": "empty_response",
|
|
223
|
+
"severity": "high",
|
|
224
|
+
"count": response_analysis["empty"]["count"],
|
|
225
|
+
"description": "Runs returning empty responses",
|
|
226
|
+
})
|
|
227
|
+
|
|
228
|
+
# Errors = high severity
|
|
229
|
+
if error_clusters:
|
|
230
|
+
for cluster in error_clusters[:3]:
|
|
231
|
+
issues.append({
|
|
232
|
+
"type": "error",
|
|
233
|
+
"severity": "high" if cluster["count"] > 2 else "medium",
|
|
234
|
+
"count": cluster["count"],
|
|
235
|
+
"pattern": cluster["pattern"],
|
|
236
|
+
"description": f"Error: {cluster['pattern'][:60]}",
|
|
237
|
+
})
|
|
238
|
+
|
|
239
|
+
# Category-concentrated failures = medium severity
|
|
240
|
+
if score_cross_ref and score_cross_ref.get("failure_categories"):
|
|
241
|
+
for cat, tids in list(score_cross_ref["failure_categories"].items())[:3]:
|
|
242
|
+
issues.append({
|
|
243
|
+
"type": "category_failure",
|
|
244
|
+
"severity": "medium" if len(tids) >= 3 else "low",
|
|
245
|
+
"category": cat,
|
|
246
|
+
"tasks": tids,
|
|
247
|
+
"description": f"Category \"{cat}\" has {len(tids)} failing tasks",
|
|
248
|
+
})
|
|
249
|
+
|
|
250
|
+
# Sort by severity
|
|
251
|
+
severity_order = {"high": 0, "medium": 1, "low": 2}
|
|
252
|
+
issues.sort(key=lambda x: severity_order.get(x.get("severity", "low"), 3))
|
|
253
|
+
return issues
|
|
254
|
+
|
|
255
|
+
|
|
256
|
+
def main():
|
|
257
|
+
parser = argparse.ArgumentParser(description="Generate trace insights from LangSmith data + scores")
|
|
258
|
+
parser.add_argument("--langsmith-runs", required=True, help="Path to langsmith_runs.json")
|
|
259
|
+
parser.add_argument("--langsmith-stats", help="Path to langsmith_stats.json (optional)")
|
|
260
|
+
parser.add_argument("--scores", required=True, help="Path to best version's scores.json")
|
|
261
|
+
parser.add_argument("--tasks-dir", required=True, help="Path to eval/tasks/ directory")
|
|
262
|
+
parser.add_argument("--output", required=True, help="Output path for trace_insights.json")
|
|
263
|
+
args = parser.parse_args()
|
|
264
|
+
|
|
265
|
+
runs = load_json(args.langsmith_runs)
|
|
266
|
+
stats = load_json(args.langsmith_stats)
|
|
267
|
+
scores_data = load_json(args.scores)
|
|
268
|
+
|
|
269
|
+
if not runs and not scores_data:
|
|
270
|
+
# Nothing to analyze — write minimal insights
|
|
271
|
+
insights = {
|
|
272
|
+
"generated_at": datetime.now(timezone.utc).isoformat(),
|
|
273
|
+
"summary": "No trace data or scores available for analysis",
|
|
274
|
+
"error_clusters": [],
|
|
275
|
+
"token_analysis": {},
|
|
276
|
+
"response_analysis": {},
|
|
277
|
+
"hypotheses": [],
|
|
278
|
+
"top_issues": [],
|
|
279
|
+
}
|
|
280
|
+
os.makedirs(os.path.dirname(args.output) or ".", exist_ok=True)
|
|
281
|
+
with open(args.output, "w") as f:
|
|
282
|
+
json.dump(insights, f, indent=2)
|
|
283
|
+
print("No data available — wrote empty insights")
|
|
284
|
+
return
|
|
285
|
+
|
|
286
|
+
runs = runs or []
|
|
287
|
+
|
|
288
|
+
# Phase 1: Cluster traces
|
|
289
|
+
error_clusters = cluster_errors(runs)
|
|
290
|
+
token_analysis = analyze_tokens(runs)
|
|
291
|
+
response_analysis = analyze_responses(runs)
|
|
292
|
+
|
|
293
|
+
# Phase 2: Cross-reference with scores
|
|
294
|
+
score_cross_ref = cross_reference_scores(runs, scores_data, args.tasks_dir)
|
|
295
|
+
token_score_corr = correlate_tokens_scores(runs, scores_data)
|
|
296
|
+
|
|
297
|
+
# Phase 3: Generate hypotheses
|
|
298
|
+
hypotheses = generate_hypotheses(error_clusters, token_analysis, response_analysis, score_cross_ref, token_score_corr)
|
|
299
|
+
|
|
300
|
+
# Phase 4: Identify top issues
|
|
301
|
+
top_issues = identify_top_issues(error_clusters, response_analysis, score_cross_ref)
|
|
302
|
+
|
|
303
|
+
# Build summary line
|
|
304
|
+
parts = []
|
|
305
|
+
if error_clusters:
|
|
306
|
+
parts.append(f"{len(error_clusters)} error pattern(s)")
|
|
307
|
+
if score_cross_ref:
|
|
308
|
+
parts.append(f"{score_cross_ref.get('failing_count', 0)}/{score_cross_ref.get('total_tasks', 0)} tasks failing")
|
|
309
|
+
parts.append(f"avg score {score_cross_ref.get('avg_score', 0):.2f}")
|
|
310
|
+
summary = "; ".join(parts) if parts else "Analysis complete, no major issues found"
|
|
311
|
+
|
|
312
|
+
# Merge stats if available
|
|
313
|
+
stats_summary = {}
|
|
314
|
+
if stats:
|
|
315
|
+
stats_summary = {
|
|
316
|
+
"total_runs": stats.get("total_runs") or stats.get("run_count"),
|
|
317
|
+
"error_rate": stats.get("error_rate"),
|
|
318
|
+
"avg_latency_ms": stats.get("avg_latency_ms") or stats.get("latency_p50"),
|
|
319
|
+
"p95_latency_ms": stats.get("latency_p95"),
|
|
320
|
+
"avg_tokens": stats.get("avg_tokens") or stats.get("avg_total_tokens"),
|
|
321
|
+
}
|
|
322
|
+
# Remove None values
|
|
323
|
+
stats_summary = {k: v for k, v in stats_summary.items() if v is not None}
|
|
324
|
+
|
|
325
|
+
insights = {
|
|
326
|
+
"generated_at": datetime.now(timezone.utc).isoformat(),
|
|
327
|
+
"summary": summary,
|
|
328
|
+
"langsmith_stats": stats_summary if stats_summary else None,
|
|
329
|
+
"error_clusters": error_clusters,
|
|
330
|
+
"token_analysis": token_analysis,
|
|
331
|
+
"response_analysis": response_analysis,
|
|
332
|
+
"score_cross_ref": score_cross_ref if score_cross_ref else None,
|
|
333
|
+
"token_score_correlation": token_score_corr,
|
|
334
|
+
"hypotheses": hypotheses,
|
|
335
|
+
"top_issues": top_issues,
|
|
336
|
+
}
|
|
337
|
+
|
|
338
|
+
# Remove None values at top level
|
|
339
|
+
insights = {k: v for k, v in insights.items() if v is not None}
|
|
340
|
+
|
|
341
|
+
os.makedirs(os.path.dirname(args.output) or ".", exist_ok=True)
|
|
342
|
+
with open(args.output, "w") as f:
|
|
343
|
+
json.dump(insights, f, indent=2)
|
|
344
|
+
|
|
345
|
+
print(f"Trace insights generated: {summary}")
|
|
346
|
+
print(f" {len(error_clusters)} error cluster(s), {len(hypotheses)} hypothesis(es), {len(top_issues)} issue(s)")
|
|
347
|
+
|
|
348
|
+
|
|
349
|
+
if __name__ == "__main__":
|
|
350
|
+
main()
|