harness-evolver 4.4.0 → 4.5.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/.claude-plugin/plugin.json +1 -1
- package/agents/evolver-evaluator.md +18 -1
- package/agents/evolver-testgen.md +5 -3
- package/package.json +1 -1
- package/skills/evolve/SKILL.md +117 -6
- package/tools/constraint_check.py +154 -0
- package/tools/dataset_health.py +31 -0
- package/tools/evolution_chart.py +16 -2
- package/tools/mine_sessions.py +150 -0
- package/tools/read_results.py +110 -4
- package/tools/run_eval.py +41 -1
- package/tools/secret_filter.py +97 -0
- package/tools/seed_from_traces.py +15 -0
- package/tools/setup.py +17 -1
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
{
|
|
2
2
|
"name": "harness-evolver",
|
|
3
3
|
"description": "LangSmith-native autonomous agent optimization — evolves LLM agent code using multi-agent proposers, LangSmith experiments, and git worktrees",
|
|
4
|
-
"version": "4.
|
|
4
|
+
"version": "4.5.1",
|
|
5
5
|
"author": {
|
|
6
6
|
"name": "Raphael Valdetaro"
|
|
7
7
|
},
|
|
@@ -85,7 +85,24 @@ For each run, apply the requested evaluators. The evaluators you may be asked to
|
|
|
85
85
|
#### correctness
|
|
86
86
|
Judge: **Is the output a correct, accurate, and complete response to the input?**
|
|
87
87
|
|
|
88
|
-
|
|
88
|
+
**Rubric-aware scoring:** Some dataset examples have an `expected_behavior` rubric in their metadata. Before scoring, fetch example metadata:
|
|
89
|
+
|
|
90
|
+
```bash
|
|
91
|
+
langsmith-cli --json examples list \
|
|
92
|
+
--dataset "{dataset_name}" \
|
|
93
|
+
--fields id,metadata \
|
|
94
|
+
--limit 200 \
|
|
95
|
+
--output example_metadata.jsonl
|
|
96
|
+
```
|
|
97
|
+
|
|
98
|
+
Build a map of `reference_example_id → expected_behavior`. When scoring a run whose example has a rubric, evaluate against the rubric criteria specifically.
|
|
99
|
+
|
|
100
|
+
**With rubric:**
|
|
101
|
+
- `1.0` — Response satisfies all criteria in the rubric
|
|
102
|
+
- `0.5` — Response partially satisfies the rubric (some criteria met, others missing)
|
|
103
|
+
- `0.0` — Response fails to meet the rubric criteria
|
|
104
|
+
|
|
105
|
+
**Without rubric** (generic scoring):
|
|
89
106
|
- `1.0` — Correct and complete. The response accurately addresses the input.
|
|
90
107
|
- `0.0` — Incorrect, incomplete, or off-topic.
|
|
91
108
|
|
|
@@ -37,16 +37,18 @@ Do NOT copy production inputs verbatim — generate VARIATIONS.
|
|
|
37
37
|
|
|
38
38
|
### Phase 3: Generate Inputs
|
|
39
39
|
|
|
40
|
-
Generate 30 test inputs as a JSON file:
|
|
40
|
+
Generate 30 test inputs as a JSON file. Each example MUST include an `expected_behavior` rubric — a description of what a correct response should cover (NOT exact expected text):
|
|
41
41
|
|
|
42
42
|
```json
|
|
43
43
|
[
|
|
44
|
-
{"input": "
|
|
45
|
-
{"input": "
|
|
44
|
+
{"input": "What is Kotlin?", "expected_behavior": "Should explain Kotlin is a JVM language by JetBrains, mention null safety, and reference Android development as primary use case", "difficulty": "easy", "category": "knowledge"},
|
|
45
|
+
{"input": "Calculate 2^32", "expected_behavior": "Should return 4294967296, showing the calculation step", "difficulty": "easy", "category": "calculation"},
|
|
46
46
|
...
|
|
47
47
|
]
|
|
48
48
|
```
|
|
49
49
|
|
|
50
|
+
The `expected_behavior` is a **rubric**, not exact text. The LLM judge uses it to score responses. Write 1-3 specific, verifiable criteria per example.
|
|
51
|
+
|
|
50
52
|
Distribution:
|
|
51
53
|
- **40% Standard** (12): typical, well-formed inputs
|
|
52
54
|
- **20% Edge Cases** (6): boundary conditions, minimal inputs
|
package/package.json
CHANGED
package/skills/evolve/SKILL.md
CHANGED
|
@@ -133,6 +133,61 @@ If critical issues found, ask user whether to continue or fix first via AskUserQ
|
|
|
133
133
|
|
|
134
134
|
Invoke `/evolver:health` to check and auto-correct dataset issues. If health_report.json shows critical issues that couldn't be auto-corrected, ask user whether to proceed via AskUserQuestion.
|
|
135
135
|
|
|
136
|
+
### 0.7. Ensure Baseline Has LLM-Judge Scores
|
|
137
|
+
|
|
138
|
+
The baseline experiment (from setup) only runs code-based evaluators (has_output, token_efficiency). Without LLM-judge scores, the baseline score is inflated — any agent that produces text gets 1.0, making gate checks stop evolution prematurely.
|
|
139
|
+
|
|
140
|
+
Check if LLM evaluators are configured and the baseline needs scoring:
|
|
141
|
+
|
|
142
|
+
```bash
|
|
143
|
+
LLM_EVALS=$(python3 -c "import json; c=json.load(open('.evolver.json')); llm=[k for k in c['evaluators'] if k in ('correctness','conciseness')]; print(','.join(llm) if llm else '')")
|
|
144
|
+
BASELINE=$(python3 -c "import json; print(json.load(open('.evolver.json')).get('baseline_experiment', ''))")
|
|
145
|
+
```
|
|
146
|
+
|
|
147
|
+
If `LLM_EVALS` is non-empty and `BASELINE` exists, check if LLM scores already exist:
|
|
148
|
+
|
|
149
|
+
```bash
|
|
150
|
+
HAS_LLM_SCORES=$($EVOLVER_PY $TOOLS/read_results.py --experiment "$BASELINE" --config .evolver.json 2>/dev/null | python3 -c "
|
|
151
|
+
import sys, json
|
|
152
|
+
try:
|
|
153
|
+
r = json.load(sys.stdin)
|
|
154
|
+
scored_keys = set()
|
|
155
|
+
for ex in r.get('per_example', {}).values():
|
|
156
|
+
scored_keys.update(ex.get('scores', {}).keys())
|
|
157
|
+
llm_keys = set('correctness,conciseness'.split(','))
|
|
158
|
+
configured = set(k for k in llm_keys if k in '$LLM_EVALS'.split(','))
|
|
159
|
+
print('yes' if configured.issubset(scored_keys) else 'no')
|
|
160
|
+
except: print('no')
|
|
161
|
+
")
|
|
162
|
+
```
|
|
163
|
+
|
|
164
|
+
If `HAS_LLM_SCORES` is "no", trigger the evaluator agent on the baseline:
|
|
165
|
+
|
|
166
|
+
```
|
|
167
|
+
Agent(
|
|
168
|
+
subagent_type: "evolver-evaluator",
|
|
169
|
+
description: "Score baseline with LLM-judge",
|
|
170
|
+
prompt: "Experiments to evaluate: {baseline_experiment}. Evaluators: {llm_evaluator_list}. Framework: {framework}. Entry point: {entry_point}. Dataset: {dataset_name}. NOTE: This is the baseline — score it fairly so evolution has a meaningful starting point. Some examples have expected_behavior rubrics in their metadata — fetch example metadata and use rubrics for scoring when available."
|
|
171
|
+
)
|
|
172
|
+
```
|
|
173
|
+
|
|
174
|
+
After the evaluator completes, re-read the baseline score and update `.evolver.json`:
|
|
175
|
+
|
|
176
|
+
```bash
|
|
177
|
+
$EVOLVER_PY $TOOLS/read_results.py --experiment "$BASELINE" --config .evolver.json --output best_results.json 2>/dev/null
|
|
178
|
+
python3 -c "
|
|
179
|
+
import json
|
|
180
|
+
br = json.load(open('best_results.json'))
|
|
181
|
+
c = json.load(open('.evolver.json'))
|
|
182
|
+
new_score = br.get('combined_score', c['best_score'])
|
|
183
|
+
c['best_score'] = new_score
|
|
184
|
+
if c.get('history'):
|
|
185
|
+
c['history'][0]['score'] = new_score
|
|
186
|
+
json.dump(c, open('.evolver.json', 'w'), indent=2)
|
|
187
|
+
print(f'Baseline re-scored with LLM-judge: {new_score:.3f}')
|
|
188
|
+
"
|
|
189
|
+
```
|
|
190
|
+
|
|
136
191
|
### 0.8. Resolve Project Directory
|
|
137
192
|
|
|
138
193
|
If the project is in a subdirectory of the git repo (e.g., `playground/react-agent/`), worktrees replicate the full repo structure. Read `project_dir` from `.evolver.json` to resolve paths correctly:
|
|
@@ -189,6 +244,14 @@ wait # Wait for all data gathering to complete
|
|
|
189
244
|
```
|
|
190
245
|
|
|
191
246
|
If `best_results.json` exists, parse it to find failing examples (score < 0.7). Group by metadata or error pattern.
|
|
247
|
+
**For each failing example, include the judge's feedback comment** (from the `feedback` field) in the strategy. This gives proposers specific, actionable information about WHY examples fail:
|
|
248
|
+
|
|
249
|
+
```
|
|
250
|
+
## Failing Examples (with judge feedback)
|
|
251
|
+
- "What is Kotlin?" (score: 0.3) — Judge: "Response was factually correct but missed null safety and Android development use cases"
|
|
252
|
+
- "Calculate 2^32" (score: 0.0) — Judge: "Run failed with timeout error"
|
|
253
|
+
```
|
|
254
|
+
|
|
192
255
|
This failure data feeds into the strategy and lens generation step (1.8a).
|
|
193
256
|
If no best_results.json (first iteration without baseline), all proposers work from code analysis only — no failure data available.
|
|
194
257
|
|
|
@@ -332,10 +395,22 @@ Only run evaluation (Step 3) for proposers that committed changes (not abstained
|
|
|
332
395
|
|
|
333
396
|
### 3. Run Target for Each Candidate (Parallel)
|
|
334
397
|
|
|
335
|
-
|
|
398
|
+
First, copy config files into each worktree (untracked files aren't replicated by git — this was the #1 bug in all real-world runs):
|
|
399
|
+
|
|
400
|
+
```bash
|
|
401
|
+
for WORKTREE in {worktree_paths_with_commits}; do
|
|
402
|
+
WORKTREE_PROJECT="$WORKTREE"
|
|
403
|
+
[ -n "$PROJECT_DIR" ] && WORKTREE_PROJECT="$WORKTREE/$PROJECT_DIR"
|
|
404
|
+
|
|
405
|
+
# Copy untracked config files needed by run_eval.py and the agent
|
|
406
|
+
cp .evolver.json "$WORKTREE_PROJECT/.evolver.json" 2>/dev/null
|
|
407
|
+
[ -f .env ] && cp .env "$WORKTREE_PROJECT/.env" 2>/dev/null
|
|
408
|
+
done
|
|
409
|
+
```
|
|
410
|
+
|
|
411
|
+
Then run evaluations for ALL candidates simultaneously:
|
|
336
412
|
|
|
337
413
|
```bash
|
|
338
|
-
# Launch all evaluations in parallel
|
|
339
414
|
for WORKTREE in {worktree_paths_with_commits}; do
|
|
340
415
|
WORKTREE_PROJECT="$WORKTREE"
|
|
341
416
|
[ -n "$PROJECT_DIR" ] && WORKTREE_PROJECT="$WORKTREE/$PROJECT_DIR"
|
|
@@ -373,7 +448,7 @@ Then spawn ONE evaluator agent that scores ALL candidates in a single pass. This
|
|
|
373
448
|
Agent(
|
|
374
449
|
subagent_type: "evolver-evaluator",
|
|
375
450
|
description: "Evaluate all candidates for iteration v{NNN}",
|
|
376
|
-
prompt: "Experiments to evaluate: {comma-separated experiment names from non-abstained proposers}. Evaluators: {llm_evaluator_list}. Framework: {framework}. Entry point: {entry_point}."
|
|
451
|
+
prompt: "Experiments to evaluate: {comma-separated experiment names from non-abstained proposers}. Evaluators: {llm_evaluator_list}. Framework: {framework}. Entry point: {entry_point}. Dataset: {dataset_name}. NOTE: Some examples have expected_behavior rubrics in their metadata — fetch example metadata and use rubrics for scoring when available."
|
|
377
452
|
)
|
|
378
453
|
```
|
|
379
454
|
|
|
@@ -385,17 +460,47 @@ Wait for the evaluator agent to complete before proceeding.
|
|
|
385
460
|
$EVOLVER_PY $TOOLS/read_results.py \
|
|
386
461
|
--experiments "{comma-separated list of experiment names from non-abstained proposers}" \
|
|
387
462
|
--config .evolver.json \
|
|
463
|
+
--split held_out \
|
|
388
464
|
--output comparison.json
|
|
389
465
|
```
|
|
390
466
|
|
|
391
467
|
Parse `comparison.json`:
|
|
392
|
-
- `comparison.winner` — highest combined score
|
|
468
|
+
- `comparison.winner` — highest combined score **on held-out data** (never seen during optimization)
|
|
393
469
|
- `comparison.champion` — per-task champion (for next iteration's context)
|
|
470
|
+
- `comparison.pareto_front` — non-dominated candidates across evaluators (if >1, report tradeoffs)
|
|
394
471
|
- `comparison.all_candidates` — all scores for reporting
|
|
395
472
|
|
|
473
|
+
If `comparison.pareto_front` has more than 1 entry, report it:
|
|
474
|
+
```
|
|
475
|
+
Pareto front ({N} non-dominated candidates):
|
|
476
|
+
v{NNN}-1: {evaluator_scores} (winner by combined score)
|
|
477
|
+
v{NNN}-3: {evaluator_scores} (different tradeoff)
|
|
478
|
+
```
|
|
479
|
+
|
|
480
|
+
### 4.5. Constraint Gate
|
|
481
|
+
|
|
482
|
+
Before merging, validate the winner passes hard constraints:
|
|
483
|
+
|
|
484
|
+
```bash
|
|
485
|
+
$EVOLVER_PY $TOOLS/constraint_check.py \
|
|
486
|
+
--config .evolver.json \
|
|
487
|
+
--worktree-path "{winner_worktree_path}" \
|
|
488
|
+
--baseline-path "." \
|
|
489
|
+
--output constraint_result.json
|
|
490
|
+
```
|
|
491
|
+
|
|
492
|
+
If `all_pass` is false, skip this candidate and try the next-best from `comparison.all_candidates`. If NO candidates pass constraints, log a warning and proceed to next iteration without merging:
|
|
493
|
+
|
|
494
|
+
```
|
|
495
|
+
WARNING: No candidates passed constraint gates. Skipping merge.
|
|
496
|
+
growth: {growth_pct}% (limit: 30%)
|
|
497
|
+
entry_point: {pass/fail}
|
|
498
|
+
tests: {pass/fail}
|
|
499
|
+
```
|
|
500
|
+
|
|
396
501
|
### 5. Merge Winner
|
|
397
502
|
|
|
398
|
-
If the winner scored higher than the current best:
|
|
503
|
+
If the winner scored higher than the current best AND passed constraint gates:
|
|
399
504
|
|
|
400
505
|
```bash
|
|
401
506
|
# Get the winning worktree's branch
|
|
@@ -413,6 +518,11 @@ Extract winner metrics for the chart:
|
|
|
413
518
|
- `per_evaluator` → average each evaluator's scores across per_example from best_results.json
|
|
414
519
|
- `approach` → first line of `## Approach` section from winner's proposal.md
|
|
415
520
|
- `lens` → the `source` field from the winning proposer's lens in lenses.json
|
|
521
|
+
- `code_loc` → count lines of code after merge for growth tracking:
|
|
522
|
+
|
|
523
|
+
```bash
|
|
524
|
+
CODE_LOC=$(find . -name "*.py" -not -path "./.venv/*" -not -path "./venv/*" -not -path "./__pycache__/*" | xargs wc -l 2>/dev/null | tail -1 | awk '{print $1}')
|
|
525
|
+
```
|
|
416
526
|
|
|
417
527
|
```python
|
|
418
528
|
import json
|
|
@@ -431,7 +541,8 @@ c['history'].append({
|
|
|
431
541
|
'total': {winner_total},
|
|
432
542
|
'per_evaluator': {winner_per_evaluator_dict},
|
|
433
543
|
'approach': '{approach_from_proposal_md}',
|
|
434
|
-
'lens': '{lens_source}'
|
|
544
|
+
'lens': '{lens_source}',
|
|
545
|
+
'code_loc': {code_loc}
|
|
435
546
|
})
|
|
436
547
|
json.dump(c, open('.evolver.json', 'w'), indent=2)
|
|
437
548
|
```
|
|
@@ -0,0 +1,154 @@
|
|
|
1
|
+
#!/usr/bin/env python3
|
|
2
|
+
"""Constraint checker for evolution proposals.
|
|
3
|
+
|
|
4
|
+
Validates that a candidate proposal doesn't violate hard constraints
|
|
5
|
+
before it's merged. Inspired by Hermes Agent Self-Evolution.
|
|
6
|
+
|
|
7
|
+
Usage:
|
|
8
|
+
python3 constraint_check.py \
|
|
9
|
+
--config .evolver.json \
|
|
10
|
+
--worktree-path /tmp/worktree \
|
|
11
|
+
--baseline-path /path/to/main \
|
|
12
|
+
--output constraint_result.json
|
|
13
|
+
|
|
14
|
+
Stdlib-only — no langsmith dependency.
|
|
15
|
+
"""
|
|
16
|
+
|
|
17
|
+
import argparse
|
|
18
|
+
import json
|
|
19
|
+
import os
|
|
20
|
+
import subprocess
|
|
21
|
+
import sys
|
|
22
|
+
|
|
23
|
+
|
|
24
|
+
def count_loc(directory, extensions=(".py", ".js", ".ts", ".jsx", ".tsx")):
|
|
25
|
+
"""Count lines of code in a directory, excluding venvs and node_modules."""
|
|
26
|
+
total = 0
|
|
27
|
+
skip_dirs = {".venv", "venv", "node_modules", "__pycache__", ".git"}
|
|
28
|
+
for root, dirs, files in os.walk(directory):
|
|
29
|
+
dirs[:] = [d for d in dirs if d not in skip_dirs]
|
|
30
|
+
for f in files:
|
|
31
|
+
if any(f.endswith(ext) for ext in extensions):
|
|
32
|
+
try:
|
|
33
|
+
with open(os.path.join(root, f)) as fh:
|
|
34
|
+
total += sum(1 for _ in fh)
|
|
35
|
+
except (OSError, UnicodeDecodeError):
|
|
36
|
+
pass
|
|
37
|
+
return total
|
|
38
|
+
|
|
39
|
+
|
|
40
|
+
def check_growth(baseline_loc, candidate_loc, max_growth_pct=30):
|
|
41
|
+
"""Check code didn't grow beyond threshold."""
|
|
42
|
+
if baseline_loc == 0:
|
|
43
|
+
return {"pass": True, "reason": "no baseline LOC"}
|
|
44
|
+
growth = ((candidate_loc - baseline_loc) / baseline_loc) * 100
|
|
45
|
+
passed = growth <= max_growth_pct
|
|
46
|
+
return {
|
|
47
|
+
"pass": passed,
|
|
48
|
+
"baseline_loc": baseline_loc,
|
|
49
|
+
"candidate_loc": candidate_loc,
|
|
50
|
+
"growth_pct": round(growth, 1),
|
|
51
|
+
"max_growth_pct": max_growth_pct,
|
|
52
|
+
"reason": f"Code growth {growth:.1f}% {'<=' if passed else '>'} {max_growth_pct}% limit",
|
|
53
|
+
}
|
|
54
|
+
|
|
55
|
+
|
|
56
|
+
def check_entry_point(worktree_path, entry_point):
|
|
57
|
+
"""Check that the entry point is still runnable (syntax check)."""
|
|
58
|
+
parts = entry_point.split()
|
|
59
|
+
script_file = None
|
|
60
|
+
for part in parts:
|
|
61
|
+
if part.endswith((".py", ".js", ".ts", ".sh")):
|
|
62
|
+
script_file = part
|
|
63
|
+
break
|
|
64
|
+
|
|
65
|
+
if not script_file:
|
|
66
|
+
return {"pass": True, "reason": "no script file detected in entry_point"}
|
|
67
|
+
|
|
68
|
+
full_path = os.path.join(worktree_path, script_file)
|
|
69
|
+
if not os.path.exists(full_path):
|
|
70
|
+
return {"pass": False, "reason": f"entry point file missing: {script_file}"}
|
|
71
|
+
|
|
72
|
+
if script_file.endswith(".py"):
|
|
73
|
+
result = subprocess.run(
|
|
74
|
+
["python3", "-m", "py_compile", full_path],
|
|
75
|
+
capture_output=True, text=True,
|
|
76
|
+
)
|
|
77
|
+
if result.returncode != 0:
|
|
78
|
+
return {"pass": False, "reason": f"syntax error: {result.stderr[:200]}"}
|
|
79
|
+
|
|
80
|
+
return {"pass": True, "reason": "entry point exists and has valid syntax"}
|
|
81
|
+
|
|
82
|
+
|
|
83
|
+
def check_tests(worktree_path):
|
|
84
|
+
"""Run test suite if it exists. Returns pass if no tests found."""
|
|
85
|
+
test_dirs = ["tests", "test"]
|
|
86
|
+
has_tests = False
|
|
87
|
+
for td in test_dirs:
|
|
88
|
+
test_path = os.path.join(worktree_path, td)
|
|
89
|
+
if os.path.isdir(test_path):
|
|
90
|
+
for f in os.listdir(test_path):
|
|
91
|
+
if f.startswith("test_") and f.endswith(".py"):
|
|
92
|
+
has_tests = True
|
|
93
|
+
break
|
|
94
|
+
|
|
95
|
+
if not has_tests:
|
|
96
|
+
return {"pass": True, "reason": "no test suite found (skipped)", "skipped": True}
|
|
97
|
+
|
|
98
|
+
try:
|
|
99
|
+
result = subprocess.run(
|
|
100
|
+
["python3", "-m", "pytest", "-q", "--tb=no"],
|
|
101
|
+
capture_output=True, text=True,
|
|
102
|
+
cwd=worktree_path, timeout=120,
|
|
103
|
+
)
|
|
104
|
+
passed = result.returncode == 0
|
|
105
|
+
return {
|
|
106
|
+
"pass": passed,
|
|
107
|
+
"reason": result.stdout.strip()[:200] if passed else result.stderr.strip()[:200],
|
|
108
|
+
"skipped": False,
|
|
109
|
+
}
|
|
110
|
+
except FileNotFoundError:
|
|
111
|
+
return {"pass": True, "reason": "pytest not available (skipped)", "skipped": True}
|
|
112
|
+
except subprocess.TimeoutExpired:
|
|
113
|
+
return {"pass": False, "reason": "test suite timed out after 120s", "skipped": False}
|
|
114
|
+
|
|
115
|
+
|
|
116
|
+
def main():
|
|
117
|
+
parser = argparse.ArgumentParser(description="Check constraints on a proposal")
|
|
118
|
+
parser.add_argument("--config", default=".evolver.json")
|
|
119
|
+
parser.add_argument("--worktree-path", required=True, help="Candidate worktree path")
|
|
120
|
+
parser.add_argument("--baseline-path", default=".", help="Baseline (main) path")
|
|
121
|
+
parser.add_argument("--max-growth", type=int, default=30, help="Max code growth %% (default 30)")
|
|
122
|
+
parser.add_argument("--output", default=None)
|
|
123
|
+
args = parser.parse_args()
|
|
124
|
+
|
|
125
|
+
with open(args.config) as f:
|
|
126
|
+
config = json.load(f)
|
|
127
|
+
|
|
128
|
+
entry_point = config.get("entry_point", "")
|
|
129
|
+
ep_for_check = entry_point.split("python ")[-1].split("python3 ")[-1]
|
|
130
|
+
|
|
131
|
+
results = {
|
|
132
|
+
"growth": check_growth(
|
|
133
|
+
count_loc(args.baseline_path),
|
|
134
|
+
count_loc(args.worktree_path),
|
|
135
|
+
args.max_growth,
|
|
136
|
+
),
|
|
137
|
+
"entry_point": check_entry_point(args.worktree_path, ep_for_check),
|
|
138
|
+
"tests": check_tests(args.worktree_path),
|
|
139
|
+
}
|
|
140
|
+
|
|
141
|
+
all_pass = all(r["pass"] for r in results.values())
|
|
142
|
+
output = {"all_pass": all_pass, "constraints": results}
|
|
143
|
+
|
|
144
|
+
out_str = json.dumps(output, indent=2)
|
|
145
|
+
if args.output:
|
|
146
|
+
with open(args.output, "w") as f:
|
|
147
|
+
f.write(out_str)
|
|
148
|
+
print(out_str)
|
|
149
|
+
|
|
150
|
+
sys.exit(0 if all_pass else 1)
|
|
151
|
+
|
|
152
|
+
|
|
153
|
+
if __name__ == "__main__":
|
|
154
|
+
main()
|
package/tools/dataset_health.py
CHANGED
|
@@ -15,6 +15,14 @@ import os
|
|
|
15
15
|
import sys
|
|
16
16
|
from datetime import datetime, timezone
|
|
17
17
|
|
|
18
|
+
# Secret detection (local import from same directory)
|
|
19
|
+
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
|
20
|
+
try:
|
|
21
|
+
from secret_filter import has_secrets
|
|
22
|
+
except ImportError:
|
|
23
|
+
def has_secrets(text):
|
|
24
|
+
return False
|
|
25
|
+
|
|
18
26
|
|
|
19
27
|
def ensure_langsmith_api_key():
|
|
20
28
|
"""Load API key from langsmith-cli credentials if not in env."""
|
|
@@ -344,10 +352,32 @@ def main():
|
|
|
344
352
|
except Exception:
|
|
345
353
|
pass
|
|
346
354
|
|
|
355
|
+
# Check for secrets in dataset examples
|
|
356
|
+
secrets_check = {"checked": True, "flagged_count": 0, "flagged_ids": [], "clean": True}
|
|
357
|
+
for ex in examples:
|
|
358
|
+
text = str(getattr(ex, 'inputs', '') or '') + str(getattr(ex, 'outputs', '') or '')
|
|
359
|
+
if has_secrets(text):
|
|
360
|
+
secrets_check["flagged_count"] += 1
|
|
361
|
+
secrets_check["flagged_ids"].append(str(ex.id))
|
|
362
|
+
secrets_check["clean"] = False
|
|
363
|
+
secrets_check["flagged_ids"] = secrets_check["flagged_ids"][:10] # Cap at 10
|
|
364
|
+
|
|
347
365
|
# Compute health score and build report
|
|
348
366
|
health_score = compute_health_score(size_info, difficulty, dead, coverage, splits)
|
|
349
367
|
issues, corrections = build_issues_and_corrections(size_info, difficulty, dead, coverage, splits)
|
|
350
368
|
|
|
369
|
+
# Add secret issues
|
|
370
|
+
if not secrets_check["clean"]:
|
|
371
|
+
issues.append({
|
|
372
|
+
"severity": "critical",
|
|
373
|
+
"message": f"{secrets_check['flagged_count']} example(s) contain potential secrets (API keys, tokens)",
|
|
374
|
+
})
|
|
375
|
+
corrections.append({
|
|
376
|
+
"action": "remove_secrets",
|
|
377
|
+
"description": f"Remove or redact {secrets_check['flagged_count']} examples with detected secrets",
|
|
378
|
+
"example_ids": secrets_check["flagged_ids"],
|
|
379
|
+
})
|
|
380
|
+
|
|
351
381
|
report = {
|
|
352
382
|
"generated_at": datetime.now(timezone.utc).isoformat(),
|
|
353
383
|
"health_score": health_score,
|
|
@@ -357,6 +387,7 @@ def main():
|
|
|
357
387
|
"dead_examples": dead,
|
|
358
388
|
"coverage": coverage,
|
|
359
389
|
"splits": splits,
|
|
390
|
+
"secrets": secrets_check,
|
|
360
391
|
"issues": issues,
|
|
361
392
|
"corrections": corrections,
|
|
362
393
|
}
|
package/tools/evolution_chart.py
CHANGED
|
@@ -101,7 +101,9 @@ def render_score_table(history, scores, c):
|
|
|
101
101
|
lines = []
|
|
102
102
|
lines.append(f' {c.B}SCORE PROGRESSION{c.RST}')
|
|
103
103
|
lines.append(f' {c.D}{"─" * W}{c.RST}')
|
|
104
|
-
|
|
104
|
+
has_loc = any(h.get('code_loc') for h in history)
|
|
105
|
+
loc_hdr = f'{"LOC":>6}' if has_loc else ''
|
|
106
|
+
lines.append(f' {c.D}{"Version":<10}{"Score":>6}{"Δ":>8}{"vs Base":>9}{"Pass":>7}{"Err":>5}{"Tokens":>8}{"Latency":>9}{loc_hdr}{c.RST}')
|
|
105
107
|
lines.append(f' {c.D}{"─" * W}{c.RST}')
|
|
106
108
|
|
|
107
109
|
for i, h in enumerate(history):
|
|
@@ -140,7 +142,19 @@ def render_score_table(history, scores, c):
|
|
|
140
142
|
tok_str = fmt_tokens(tokens)
|
|
141
143
|
lat_str = f'{latency}ms' if latency else '—'
|
|
142
144
|
|
|
143
|
-
|
|
145
|
+
loc_str = ''
|
|
146
|
+
if has_loc:
|
|
147
|
+
loc = h.get('code_loc')
|
|
148
|
+
if loc:
|
|
149
|
+
base_loc = history[0].get('code_loc', 0)
|
|
150
|
+
if base_loc and loc > base_loc * 1.3:
|
|
151
|
+
loc_str = f' {c.R}{loc}{c.RST}⚠'
|
|
152
|
+
else:
|
|
153
|
+
loc_str = f' {loc:>5}'
|
|
154
|
+
else:
|
|
155
|
+
loc_str = ' —'
|
|
156
|
+
|
|
157
|
+
lines.append(f' {v:<10}{s_str:>6} {d_str} {p_str} {pass_str:>5} {e_str:>3} {tok_str:>6} {lat_str:>6}{loc_str} {icon}')
|
|
144
158
|
|
|
145
159
|
return '\n'.join(lines)
|
|
146
160
|
|
|
@@ -0,0 +1,150 @@
|
|
|
1
|
+
#!/usr/bin/env python3
|
|
2
|
+
"""Mine Claude Code session history for eval dataset examples.
|
|
3
|
+
|
|
4
|
+
Reads ~/.claude/ session files to extract real user interactions
|
|
5
|
+
that can be used as evaluation data. Filters for relevance to the
|
|
6
|
+
agent being optimized, detects and skips secrets.
|
|
7
|
+
|
|
8
|
+
Usage:
|
|
9
|
+
python3 mine_sessions.py \
|
|
10
|
+
--agent-description "A ReAct agent that answers questions using tools" \
|
|
11
|
+
--output session_examples.json \
|
|
12
|
+
[--max-examples 50]
|
|
13
|
+
|
|
14
|
+
Stdlib-only except for secret_filter (local import).
|
|
15
|
+
"""
|
|
16
|
+
|
|
17
|
+
import argparse
|
|
18
|
+
import glob
|
|
19
|
+
import json
|
|
20
|
+
import os
|
|
21
|
+
import sys
|
|
22
|
+
|
|
23
|
+
|
|
24
|
+
def find_session_files():
|
|
25
|
+
"""Find Claude Code session history files."""
|
|
26
|
+
candidates = [
|
|
27
|
+
os.path.expanduser("~/.claude/history.jsonl"),
|
|
28
|
+
os.path.expanduser("~/.claude/sessions/*/messages.jsonl"),
|
|
29
|
+
]
|
|
30
|
+
found = []
|
|
31
|
+
for pattern in candidates:
|
|
32
|
+
found.extend(glob.glob(pattern))
|
|
33
|
+
return found
|
|
34
|
+
|
|
35
|
+
|
|
36
|
+
def extract_messages(file_path):
|
|
37
|
+
"""Extract user->assistant message pairs from a session file."""
|
|
38
|
+
pairs = []
|
|
39
|
+
try:
|
|
40
|
+
with open(file_path) as f:
|
|
41
|
+
messages = []
|
|
42
|
+
for line in f:
|
|
43
|
+
line = line.strip()
|
|
44
|
+
if not line:
|
|
45
|
+
continue
|
|
46
|
+
try:
|
|
47
|
+
msg = json.loads(line)
|
|
48
|
+
messages.append(msg)
|
|
49
|
+
except json.JSONDecodeError:
|
|
50
|
+
continue
|
|
51
|
+
|
|
52
|
+
for i in range(len(messages) - 1):
|
|
53
|
+
if (messages[i].get("role") == "user" and
|
|
54
|
+
messages[i + 1].get("role") == "assistant"):
|
|
55
|
+
user_text = messages[i].get("content", "")
|
|
56
|
+
if isinstance(user_text, list):
|
|
57
|
+
user_text = " ".join(
|
|
58
|
+
p.get("text", "") for p in user_text
|
|
59
|
+
if isinstance(p, dict) and p.get("type") == "text"
|
|
60
|
+
)
|
|
61
|
+
asst_text = messages[i + 1].get("content", "")
|
|
62
|
+
if isinstance(asst_text, list):
|
|
63
|
+
asst_text = " ".join(
|
|
64
|
+
p.get("text", "") for p in asst_text
|
|
65
|
+
if isinstance(p, dict) and p.get("type") == "text"
|
|
66
|
+
)
|
|
67
|
+
|
|
68
|
+
if user_text and len(user_text) > 10:
|
|
69
|
+
pairs.append({
|
|
70
|
+
"input": user_text[:500],
|
|
71
|
+
"output_preview": asst_text[:200] if asst_text else "",
|
|
72
|
+
"source_file": os.path.basename(file_path),
|
|
73
|
+
})
|
|
74
|
+
except (OSError, UnicodeDecodeError):
|
|
75
|
+
pass
|
|
76
|
+
return pairs
|
|
77
|
+
|
|
78
|
+
|
|
79
|
+
def filter_relevant(pairs, agent_description, max_examples=50):
|
|
80
|
+
"""Simple keyword-based relevance filter."""
|
|
81
|
+
stop_words = {"a", "an", "the", "is", "are", "was", "were", "that", "this",
|
|
82
|
+
"and", "or", "for", "to", "in", "on", "with", "using"}
|
|
83
|
+
keywords = set(
|
|
84
|
+
w.lower() for w in agent_description.split()
|
|
85
|
+
if len(w) > 3 and w.lower() not in stop_words
|
|
86
|
+
)
|
|
87
|
+
|
|
88
|
+
scored = []
|
|
89
|
+
for pair in pairs:
|
|
90
|
+
input_words = set(pair["input"].lower().split())
|
|
91
|
+
overlap = len(keywords & input_words)
|
|
92
|
+
if overlap >= 1:
|
|
93
|
+
scored.append((overlap, pair))
|
|
94
|
+
|
|
95
|
+
scored.sort(key=lambda x: -x[0])
|
|
96
|
+
return [pair for _, pair in scored[:max_examples]]
|
|
97
|
+
|
|
98
|
+
|
|
99
|
+
def main():
|
|
100
|
+
parser = argparse.ArgumentParser(description="Mine Claude Code sessions for eval data")
|
|
101
|
+
parser.add_argument("--agent-description", required=True, help="Description of the agent being optimized")
|
|
102
|
+
parser.add_argument("--output", default="session_examples.json")
|
|
103
|
+
parser.add_argument("--max-examples", type=int, default=50)
|
|
104
|
+
args = parser.parse_args()
|
|
105
|
+
|
|
106
|
+
sys.path.insert(0, os.path.dirname(__file__))
|
|
107
|
+
try:
|
|
108
|
+
from secret_filter import has_secrets
|
|
109
|
+
except ImportError:
|
|
110
|
+
has_secrets = lambda text: False # noqa: E731
|
|
111
|
+
|
|
112
|
+
session_files = find_session_files()
|
|
113
|
+
if not session_files:
|
|
114
|
+
print("No Claude Code session files found.", file=sys.stderr)
|
|
115
|
+
print(json.dumps({"mined": 0, "output": args.output}))
|
|
116
|
+
sys.exit(0)
|
|
117
|
+
|
|
118
|
+
print(f"Found {len(session_files)} session file(s)", file=sys.stderr)
|
|
119
|
+
|
|
120
|
+
all_pairs = []
|
|
121
|
+
secrets_skipped = 0
|
|
122
|
+
for sf in session_files:
|
|
123
|
+
pairs = extract_messages(sf)
|
|
124
|
+
for p in pairs:
|
|
125
|
+
if has_secrets(p["input"]) or has_secrets(p.get("output_preview", "")):
|
|
126
|
+
secrets_skipped += 1
|
|
127
|
+
continue
|
|
128
|
+
all_pairs.append(p)
|
|
129
|
+
|
|
130
|
+
print(f"Extracted {len(all_pairs)} message pairs ({secrets_skipped} skipped for secrets)", file=sys.stderr)
|
|
131
|
+
|
|
132
|
+
relevant = filter_relevant(all_pairs, args.agent_description, args.max_examples)
|
|
133
|
+
print(f"Filtered to {len(relevant)} relevant examples", file=sys.stderr)
|
|
134
|
+
|
|
135
|
+
examples = []
|
|
136
|
+
for p in relevant:
|
|
137
|
+
examples.append({
|
|
138
|
+
"input": p["input"],
|
|
139
|
+
"metadata": {"source": "session_mining", "source_file": p["source_file"]},
|
|
140
|
+
})
|
|
141
|
+
|
|
142
|
+
output = {"examples": examples, "count": len(examples), "source": "claude_code_sessions"}
|
|
143
|
+
with open(args.output, "w") as f:
|
|
144
|
+
json.dump(output, f, indent=2)
|
|
145
|
+
|
|
146
|
+
print(json.dumps({"mined": len(examples), "output": args.output}))
|
|
147
|
+
|
|
148
|
+
|
|
149
|
+
if __name__ == "__main__":
|
|
150
|
+
main()
|
package/tools/read_results.py
CHANGED
|
@@ -61,7 +61,28 @@ def ensure_langsmith_api_key():
|
|
|
61
61
|
return False
|
|
62
62
|
|
|
63
63
|
|
|
64
|
-
def
|
|
64
|
+
def weighted_score(scores, weights=None):
|
|
65
|
+
"""Calculate weighted average of evaluator scores.
|
|
66
|
+
|
|
67
|
+
If weights provided, use them. Otherwise flat average.
|
|
68
|
+
Weights are normalized (don't need to sum to 1).
|
|
69
|
+
"""
|
|
70
|
+
if not scores:
|
|
71
|
+
return 0.0
|
|
72
|
+
if not weights:
|
|
73
|
+
return sum(scores.values()) / len(scores)
|
|
74
|
+
|
|
75
|
+
total_weight = 0
|
|
76
|
+
weighted_sum = 0
|
|
77
|
+
for key, val in scores.items():
|
|
78
|
+
w = weights.get(key, 1.0)
|
|
79
|
+
weighted_sum += val * w
|
|
80
|
+
total_weight += w
|
|
81
|
+
|
|
82
|
+
return weighted_sum / total_weight if total_weight > 0 else 0.0
|
|
83
|
+
|
|
84
|
+
|
|
85
|
+
def read_experiment(client, experiment_name, weights=None):
|
|
65
86
|
"""Read results from a single LangSmith experiment."""
|
|
66
87
|
try:
|
|
67
88
|
# List runs for this experiment
|
|
@@ -103,13 +124,17 @@ def read_experiment(client, experiment_name):
|
|
|
103
124
|
# Read feedback/scores from pre-fetched batch
|
|
104
125
|
feedbacks = fb_map.get(str(run.id), [])
|
|
105
126
|
scores = {}
|
|
127
|
+
feedback_comments = {}
|
|
106
128
|
for fb in feedbacks:
|
|
107
129
|
if fb.score is not None:
|
|
108
130
|
scores[fb.key] = fb.score
|
|
131
|
+
if fb.comment:
|
|
132
|
+
feedback_comments[fb.key] = fb.comment
|
|
109
133
|
|
|
110
134
|
per_example[example_id] = {
|
|
111
|
-
"score":
|
|
135
|
+
"score": weighted_score(scores, weights),
|
|
112
136
|
"scores": scores,
|
|
137
|
+
"feedback": feedback_comments,
|
|
113
138
|
"tokens": tokens,
|
|
114
139
|
"latency_ms": latency_ms,
|
|
115
140
|
"error": run.error[:200] if run.error else None,
|
|
@@ -136,6 +161,50 @@ def read_experiment(client, experiment_name):
|
|
|
136
161
|
return {"experiment": experiment_name, "error": str(e), "combined_score": 0.0}
|
|
137
162
|
|
|
138
163
|
|
|
164
|
+
def pareto_front(candidates):
|
|
165
|
+
"""Find Pareto-optimal candidates (not dominated on any evaluator).
|
|
166
|
+
|
|
167
|
+
A candidate is dominated if another scores >= on ALL evaluators
|
|
168
|
+
and strictly > on at least one.
|
|
169
|
+
"""
|
|
170
|
+
if len(candidates) <= 1:
|
|
171
|
+
return candidates
|
|
172
|
+
|
|
173
|
+
front = []
|
|
174
|
+
for i, ci in enumerate(candidates):
|
|
175
|
+
dominated = False
|
|
176
|
+
ci_scores = ci.get("evaluator_scores", {})
|
|
177
|
+
if not ci_scores:
|
|
178
|
+
front.append(ci)
|
|
179
|
+
continue
|
|
180
|
+
|
|
181
|
+
for j, cj in enumerate(candidates):
|
|
182
|
+
if i == j:
|
|
183
|
+
continue
|
|
184
|
+
cj_scores = cj.get("evaluator_scores", {})
|
|
185
|
+
if not cj_scores:
|
|
186
|
+
continue
|
|
187
|
+
|
|
188
|
+
all_geq = True
|
|
189
|
+
any_gt = False
|
|
190
|
+
for key in ci_scores:
|
|
191
|
+
if key in cj_scores:
|
|
192
|
+
if cj_scores[key] < ci_scores[key]:
|
|
193
|
+
all_geq = False
|
|
194
|
+
break
|
|
195
|
+
if cj_scores[key] > ci_scores[key]:
|
|
196
|
+
any_gt = True
|
|
197
|
+
|
|
198
|
+
if all_geq and any_gt:
|
|
199
|
+
dominated = True
|
|
200
|
+
break
|
|
201
|
+
|
|
202
|
+
if not dominated:
|
|
203
|
+
front.append(ci)
|
|
204
|
+
|
|
205
|
+
return front if front else candidates[:1]
|
|
206
|
+
|
|
207
|
+
|
|
139
208
|
def compare_experiments(results_list):
|
|
140
209
|
"""Compare multiple experiment results and find winner + per-task champion."""
|
|
141
210
|
if not results_list:
|
|
@@ -173,12 +242,27 @@ def compare_experiments(results_list):
|
|
|
173
242
|
"task_wins": task_wins[champion_name],
|
|
174
243
|
}
|
|
175
244
|
|
|
245
|
+
# Compute per-evaluator averages for Pareto analysis
|
|
246
|
+
for result in valid:
|
|
247
|
+
eval_avgs = {}
|
|
248
|
+
for ex_data in result.get("per_example", {}).values():
|
|
249
|
+
for ev_key, ev_score in ex_data.get("scores", {}).items():
|
|
250
|
+
eval_avgs.setdefault(ev_key, []).append(ev_score)
|
|
251
|
+
result["evaluator_scores"] = {k: sum(v) / len(v) for k, v in eval_avgs.items()}
|
|
252
|
+
|
|
253
|
+
front = pareto_front(valid)
|
|
254
|
+
|
|
176
255
|
return {
|
|
177
256
|
"winner": {
|
|
178
257
|
"experiment": winner["experiment"],
|
|
179
258
|
"score": winner["combined_score"],
|
|
180
259
|
},
|
|
181
260
|
"champion": champion,
|
|
261
|
+
"pareto_front": [
|
|
262
|
+
{"experiment": r["experiment"], "score": r["combined_score"],
|
|
263
|
+
"evaluator_scores": r.get("evaluator_scores", {})}
|
|
264
|
+
for r in front
|
|
265
|
+
],
|
|
182
266
|
"all_candidates": [
|
|
183
267
|
{
|
|
184
268
|
"experiment": r["experiment"],
|
|
@@ -231,12 +315,19 @@ def main():
|
|
|
231
315
|
args = parser.parse_args()
|
|
232
316
|
ensure_langsmith_api_key()
|
|
233
317
|
|
|
318
|
+
# Load evaluator weights from config if available
|
|
319
|
+
weights = None
|
|
320
|
+
if os.path.exists(args.config):
|
|
321
|
+
with open(args.config) as f:
|
|
322
|
+
cfg = json.load(f)
|
|
323
|
+
weights = cfg.get("evaluator_weights")
|
|
324
|
+
|
|
234
325
|
from langsmith import Client
|
|
235
326
|
client = Client()
|
|
236
327
|
|
|
237
328
|
if args.experiment:
|
|
238
329
|
# Single experiment
|
|
239
|
-
result = read_experiment(client, args.experiment)
|
|
330
|
+
result = read_experiment(client, args.experiment, weights=weights)
|
|
240
331
|
if not result:
|
|
241
332
|
print(f"No results found for experiment: {args.experiment}", file=sys.stderr)
|
|
242
333
|
sys.exit(1)
|
|
@@ -267,10 +358,25 @@ def main():
|
|
|
267
358
|
experiment_names = [e.strip() for e in args.experiments.split(",")]
|
|
268
359
|
results_list = []
|
|
269
360
|
|
|
361
|
+
# Load split filter if requested
|
|
362
|
+
split_example_ids = None
|
|
363
|
+
if args.split:
|
|
364
|
+
with open(args.config) as f:
|
|
365
|
+
cfg_for_split = json.load(f)
|
|
366
|
+
split_example_ids = set()
|
|
367
|
+
for ex in client.list_examples(dataset_name=cfg_for_split["dataset"], splits=[args.split]):
|
|
368
|
+
split_example_ids.add(str(ex.id))
|
|
369
|
+
|
|
270
370
|
for name in experiment_names:
|
|
271
371
|
print(f"Reading experiment: {name}...", file=sys.stderr)
|
|
272
|
-
result = read_experiment(client, name)
|
|
372
|
+
result = read_experiment(client, name, weights=weights)
|
|
273
373
|
if result:
|
|
374
|
+
# Apply split filter to each experiment
|
|
375
|
+
if split_example_ids is not None and "per_example" in result:
|
|
376
|
+
result["per_example"] = {k: v for k, v in result["per_example"].items() if k in split_example_ids}
|
|
377
|
+
all_scores = [v["score"] for v in result["per_example"].values()]
|
|
378
|
+
result["combined_score"] = sum(all_scores) / len(all_scores) if all_scores else 0.0
|
|
379
|
+
result["num_examples"] = len(result["per_example"])
|
|
274
380
|
results_list.append(result)
|
|
275
381
|
|
|
276
382
|
if not results_list:
|
package/tools/run_eval.py
CHANGED
|
@@ -72,7 +72,20 @@ def make_target(entry_point, cwd):
|
|
|
72
72
|
|
|
73
73
|
try:
|
|
74
74
|
cmd = entry_point
|
|
75
|
-
|
|
75
|
+
|
|
76
|
+
# {input_text}: extract plain text from inputs dict (for agents expecting --query "text")
|
|
77
|
+
if "{input_text}" in cmd:
|
|
78
|
+
import shlex
|
|
79
|
+
text = ""
|
|
80
|
+
for key in ("input", "question", "query", "prompt", "text", "user_input"):
|
|
81
|
+
if key in inputs and isinstance(inputs[key], str):
|
|
82
|
+
text = inputs[key]
|
|
83
|
+
break
|
|
84
|
+
if not text and inputs:
|
|
85
|
+
first_val = next(iter(inputs.values()), "")
|
|
86
|
+
text = str(first_val) if not isinstance(first_val, str) else first_val
|
|
87
|
+
cmd = cmd.replace("{input_text}", shlex.quote(text))
|
|
88
|
+
elif "{input}" in cmd:
|
|
76
89
|
# Placeholder: replace with path to JSON file
|
|
77
90
|
cmd = cmd.replace("{input}", input_path)
|
|
78
91
|
elif "{input_json}" in cmd:
|
|
@@ -167,6 +180,7 @@ def main():
|
|
|
167
180
|
parser.add_argument("--experiment-prefix", required=True, help="Experiment name prefix (e.g. v001a)")
|
|
168
181
|
parser.add_argument("--timeout", type=int, default=120, help="Per-task timeout in seconds")
|
|
169
182
|
parser.add_argument("--concurrency", type=int, default=None, help="Max concurrent evaluations (default: from config or 1)")
|
|
183
|
+
parser.add_argument("--no-canary", action="store_true", help="Skip canary preflight check")
|
|
170
184
|
args = parser.parse_args()
|
|
171
185
|
|
|
172
186
|
with open(args.config) as f:
|
|
@@ -187,6 +201,32 @@ def main():
|
|
|
187
201
|
llm_evaluators = [k for k in config["evaluators"] if k in ("correctness", "conciseness")]
|
|
188
202
|
code_evaluators = [k for k in config["evaluators"] if k not in ("correctness", "conciseness")]
|
|
189
203
|
|
|
204
|
+
# Canary run: verify agent works before burning through full dataset
|
|
205
|
+
if not args.no_canary:
|
|
206
|
+
print(" Canary: running 1 example preflight...", file=sys.stderr)
|
|
207
|
+
try:
|
|
208
|
+
canary_examples = list(client.list_examples(dataset_name=config["dataset"], limit=1))
|
|
209
|
+
if canary_examples:
|
|
210
|
+
canary_result = target(canary_examples[0].inputs)
|
|
211
|
+
canary_output = canary_result.get("output", "")
|
|
212
|
+
canary_error = canary_result.get("error", "")
|
|
213
|
+
if not canary_output and canary_error:
|
|
214
|
+
print(f" CANARY FAILED: Agent produced no output.", file=sys.stderr)
|
|
215
|
+
print(f" Error: {canary_error}", file=sys.stderr)
|
|
216
|
+
print(f" Fix the agent before running full evaluation.", file=sys.stderr)
|
|
217
|
+
output = {
|
|
218
|
+
"experiment": None,
|
|
219
|
+
"prefix": args.experiment_prefix,
|
|
220
|
+
"combined_score": 0.0,
|
|
221
|
+
"error": f"Canary failed: {canary_error[:200]}",
|
|
222
|
+
}
|
|
223
|
+
print(json.dumps(output))
|
|
224
|
+
sys.exit(2)
|
|
225
|
+
else:
|
|
226
|
+
print(f" Canary passed: got output ({len(str(canary_output))} chars)", file=sys.stderr)
|
|
227
|
+
except Exception as e:
|
|
228
|
+
print(f" Canary check failed: {e} (proceeding anyway)", file=sys.stderr)
|
|
229
|
+
|
|
190
230
|
print(f"Running evaluation: {args.experiment_prefix}")
|
|
191
231
|
print(f" Dataset: {config['dataset']}")
|
|
192
232
|
print(f" Worktree: {args.worktree_path}")
|
|
@@ -0,0 +1,97 @@
|
|
|
1
|
+
#!/usr/bin/env python3
|
|
2
|
+
"""Secret detection and filtering for eval datasets.
|
|
3
|
+
|
|
4
|
+
Detects API keys, tokens, passwords, and other sensitive data in text.
|
|
5
|
+
Used by seed_from_traces.py and dataset_health.py.
|
|
6
|
+
|
|
7
|
+
Usage:
|
|
8
|
+
echo "text with sk-ant-api..." | python3 secret_filter.py
|
|
9
|
+
python3 secret_filter.py < file.txt
|
|
10
|
+
|
|
11
|
+
Stdlib-only — no external dependencies.
|
|
12
|
+
"""
|
|
13
|
+
|
|
14
|
+
import re
|
|
15
|
+
import json
|
|
16
|
+
import sys
|
|
17
|
+
|
|
18
|
+
|
|
19
|
+
SECRET_PATTERNS = re.compile(
|
|
20
|
+
r'('
|
|
21
|
+
r'sk-ant-api\S{20,}'
|
|
22
|
+
r'|sk-or-v1-\S{20,}'
|
|
23
|
+
r'|sk-\S{20,}'
|
|
24
|
+
r'|ghp_\S{20,}'
|
|
25
|
+
r'|gho_\S{20,}'
|
|
26
|
+
r'|github_pat_\S{20,}'
|
|
27
|
+
r'|xoxb-\S{20,}'
|
|
28
|
+
r'|xapp-\S{20,}'
|
|
29
|
+
r'|ntn_\S{20,}'
|
|
30
|
+
r'|AKIA[A-Z0-9]{16}'
|
|
31
|
+
r'|Bearer\s+[A-Za-z0-9\-._~+/]{20,}'
|
|
32
|
+
r'|-----BEGIN\s+(RSA\s+)?PRIVATE\s+KEY-----'
|
|
33
|
+
r')',
|
|
34
|
+
re.IGNORECASE,
|
|
35
|
+
)
|
|
36
|
+
|
|
37
|
+
ENV_PATTERNS = re.compile(
|
|
38
|
+
r'(?:ANTHROPIC_API_KEY|OPENAI_API_KEY|LANGSMITH_API_KEY|LANGCHAIN_API_KEY'
|
|
39
|
+
r'|AWS_SECRET_ACCESS_KEY|DATABASE_URL|POSTGRES_PASSWORD'
|
|
40
|
+
r'|SLACK_TOKEN|GITHUB_TOKEN|API_KEY|SECRET_KEY'
|
|
41
|
+
r')\s*[=:]\s*["\']?\S{10,}',
|
|
42
|
+
re.IGNORECASE,
|
|
43
|
+
)
|
|
44
|
+
|
|
45
|
+
ASSIGN_PATTERNS = re.compile(
|
|
46
|
+
r'(?:password|secret|token|api_key|apikey)\s*[=:]\s*["\']?\S{10,}',
|
|
47
|
+
re.IGNORECASE,
|
|
48
|
+
)
|
|
49
|
+
|
|
50
|
+
|
|
51
|
+
def detect_secrets(text):
|
|
52
|
+
"""Return list of secret matches found in text."""
|
|
53
|
+
if not text:
|
|
54
|
+
return []
|
|
55
|
+
findings = []
|
|
56
|
+
for pattern, name in [
|
|
57
|
+
(SECRET_PATTERNS, "secret_key"),
|
|
58
|
+
(ENV_PATTERNS, "env_variable"),
|
|
59
|
+
(ASSIGN_PATTERNS, "assignment"),
|
|
60
|
+
]:
|
|
61
|
+
for m in pattern.finditer(text):
|
|
62
|
+
match_text = m.group()
|
|
63
|
+
redacted = match_text[:10] + "..." + match_text[-4:] if len(match_text) > 20 else match_text
|
|
64
|
+
findings.append({
|
|
65
|
+
"pattern": name,
|
|
66
|
+
"match": redacted,
|
|
67
|
+
"position": m.start(),
|
|
68
|
+
})
|
|
69
|
+
return findings
|
|
70
|
+
|
|
71
|
+
|
|
72
|
+
def has_secrets(text):
|
|
73
|
+
"""Quick boolean check — does text contain any secrets?"""
|
|
74
|
+
if not text:
|
|
75
|
+
return False
|
|
76
|
+
return bool(SECRET_PATTERNS.search(text) or ENV_PATTERNS.search(text) or ASSIGN_PATTERNS.search(text))
|
|
77
|
+
|
|
78
|
+
|
|
79
|
+
def redact_secrets(text):
|
|
80
|
+
"""Replace detected secrets with [REDACTED]."""
|
|
81
|
+
if not text:
|
|
82
|
+
return text
|
|
83
|
+
text = SECRET_PATTERNS.sub("[REDACTED]", text)
|
|
84
|
+
text = ENV_PATTERNS.sub("[REDACTED]", text)
|
|
85
|
+
text = ASSIGN_PATTERNS.sub("[REDACTED]", text)
|
|
86
|
+
return text
|
|
87
|
+
|
|
88
|
+
|
|
89
|
+
if __name__ == "__main__":
|
|
90
|
+
text = sys.stdin.read()
|
|
91
|
+
findings = detect_secrets(text)
|
|
92
|
+
if findings:
|
|
93
|
+
print(json.dumps({"has_secrets": True, "count": len(findings), "findings": findings}, indent=2))
|
|
94
|
+
sys.exit(1)
|
|
95
|
+
else:
|
|
96
|
+
print(json.dumps({"has_secrets": False, "count": 0}))
|
|
97
|
+
sys.exit(0)
|
|
@@ -22,6 +22,14 @@ import sys
|
|
|
22
22
|
from collections import Counter
|
|
23
23
|
from datetime import datetime, timezone
|
|
24
24
|
|
|
25
|
+
# Secret detection (local import from same directory)
|
|
26
|
+
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
|
27
|
+
try:
|
|
28
|
+
from secret_filter import has_secrets
|
|
29
|
+
except ImportError:
|
|
30
|
+
def has_secrets(text):
|
|
31
|
+
return False
|
|
32
|
+
|
|
25
33
|
|
|
26
34
|
def extract_input(run):
|
|
27
35
|
"""Extract user input from a run's inputs field."""
|
|
@@ -118,9 +126,16 @@ def analyze_runs(runs):
|
|
|
118
126
|
token_counts = []
|
|
119
127
|
feedbacks = {"positive": 0, "negative": 0, "none": 0}
|
|
120
128
|
|
|
129
|
+
secrets_filtered = 0
|
|
121
130
|
for run in runs:
|
|
122
131
|
user_input = extract_input(run)
|
|
123
132
|
output = extract_output(run)
|
|
133
|
+
|
|
134
|
+
# Skip runs containing secrets (API keys, tokens, passwords)
|
|
135
|
+
if has_secrets(str(user_input or '')) or has_secrets(str(output or '')):
|
|
136
|
+
secrets_filtered += 1
|
|
137
|
+
continue
|
|
138
|
+
|
|
124
139
|
error = run.get("error")
|
|
125
140
|
tokens = run.get("total_tokens") or 0
|
|
126
141
|
latency_ms = None
|
package/tools/setup.py
CHANGED
|
@@ -180,9 +180,24 @@ def create_dataset_from_file(client, dataset_name, file_path):
|
|
|
180
180
|
elif "expected" in item:
|
|
181
181
|
ex["outputs"] = {"expected": item["expected"]}
|
|
182
182
|
|
|
183
|
+
# Include rubric/expected behavior in metadata
|
|
184
|
+
if "expected_behavior" in item:
|
|
185
|
+
if "metadata" not in ex:
|
|
186
|
+
ex["metadata"] = {}
|
|
187
|
+
ex["metadata"]["expected_behavior"] = item["expected_behavior"]
|
|
188
|
+
|
|
189
|
+
# Include difficulty and category in metadata
|
|
190
|
+
for field in ("difficulty", "category"):
|
|
191
|
+
if field in item:
|
|
192
|
+
if "metadata" not in ex:
|
|
193
|
+
ex["metadata"] = {}
|
|
194
|
+
ex["metadata"][field] = item[field]
|
|
195
|
+
|
|
183
196
|
# Include metadata
|
|
184
|
-
if "metadata" in item:
|
|
197
|
+
if "metadata" in item and "metadata" not in ex:
|
|
185
198
|
ex["metadata"] = item["metadata"]
|
|
199
|
+
elif "metadata" in item:
|
|
200
|
+
ex["metadata"].update(item["metadata"])
|
|
186
201
|
|
|
187
202
|
if "metadata" not in ex:
|
|
188
203
|
ex["metadata"] = {}
|
|
@@ -548,6 +563,7 @@ def main():
|
|
|
548
563
|
"project_dir": project_dir,
|
|
549
564
|
"entry_point": entry_point,
|
|
550
565
|
"evaluators": evaluator_keys,
|
|
566
|
+
"evaluator_weights": None,
|
|
551
567
|
"optimization_goals": goals,
|
|
552
568
|
"production_project": args.production_project,
|
|
553
569
|
"baseline_experiment": baseline_experiment,
|