agentv 4.26.1 → 4.27.0-next.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/{chunk-JA4WQNE6.js → chunk-47JX7NNZ.js} +10 -2
- package/dist/chunk-47JX7NNZ.js.map +1 -0
- package/dist/{chunk-XBUHMRX2.js → chunk-V3LWJB5X.js} +431 -49
- package/dist/chunk-V3LWJB5X.js.map +1 -0
- package/dist/cli.js +2 -2
- package/dist/index.js +2 -2
- package/dist/{interactive-YMKWKPD7.js → interactive-L6PIIFNQ.js} +2 -2
- package/dist/skills/agentv-bench/LICENSE.txt +202 -0
- package/dist/skills/agentv-bench/SKILL.md +459 -0
- package/dist/skills/agentv-bench/agents/analyzer.md +177 -0
- package/dist/skills/agentv-bench/agents/comparator.md +247 -0
- package/dist/skills/agentv-bench/agents/executor.md +30 -0
- package/dist/skills/agentv-bench/agents/grader.md +238 -0
- package/dist/skills/agentv-bench/agents/mutator.md +172 -0
- package/dist/skills/agentv-bench/references/autoresearch.md +309 -0
- package/dist/skills/agentv-bench/references/description-optimization.md +66 -0
- package/dist/skills/agentv-bench/references/environment-adaptation.md +82 -0
- package/dist/skills/agentv-bench/references/eval-yaml-spec.md +338 -0
- package/dist/skills/agentv-bench/references/migrating-from-skill-creator.md +103 -0
- package/dist/skills/agentv-bench/references/schemas.md +432 -0
- package/dist/skills/agentv-bench/references/subagent-pipeline.md +181 -0
- package/dist/skills/agentv-bench/scripts/trajectory.html +462 -0
- package/dist/skills/agentv-eval-review/SKILL.md +53 -0
- package/dist/skills/agentv-eval-review/scripts/lint_eval.py +239 -0
- package/dist/skills/agentv-eval-writer/SKILL.md +707 -0
- package/dist/skills/agentv-eval-writer/references/config-schema.json +63 -0
- package/dist/skills/agentv-eval-writer/references/custom-evaluators.md +119 -0
- package/dist/skills/agentv-eval-writer/references/eval-schema.json +19077 -0
- package/dist/skills/agentv-eval-writer/references/rubric-evaluator.md +114 -0
- package/dist/skills/agentv-governance/SKILL.md +79 -0
- package/dist/skills/agentv-governance/references/eu-ai-act-risk-tiers.md +37 -0
- package/dist/skills/agentv-governance/references/governance-yaml-shape.md +125 -0
- package/dist/skills/agentv-governance/references/iso-42001-controls.md +46 -0
- package/dist/skills/agentv-governance/references/lint-rules.md +169 -0
- package/dist/skills/agentv-governance/references/mitre-atlas.md +38 -0
- package/dist/skills/agentv-governance/references/owasp-agentic-top-10-2025.md +28 -0
- package/dist/skills/agentv-governance/references/owasp-llm-top-10-2025.md +25 -0
- package/dist/skills/agentv-trace-analyst/SKILL.md +161 -0
- package/package.json +1 -1
- package/dist/chunk-JA4WQNE6.js.map +0 -1
- package/dist/chunk-XBUHMRX2.js.map +0 -1
- /package/dist/{interactive-YMKWKPD7.js.map → interactive-L6PIIFNQ.js.map} +0 -0
|
@@ -0,0 +1,462 @@
|
|
|
1
|
+
<!DOCTYPE html>
|
|
2
|
+
<html lang="en">
|
|
3
|
+
<head>
|
|
4
|
+
<meta charset="UTF-8">
|
|
5
|
+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
|
6
|
+
<meta http-equiv="refresh" content="2"> <!-- __AUTO_REFRESH__ -->
|
|
7
|
+
<title>AgentV Autoresearch Trajectory</title>
|
|
8
|
+
<script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
|
|
9
|
+
<style>
|
|
10
|
+
/*
|
|
11
|
+
* Visual language aligned with Studio DESIGN.md:
|
|
12
|
+
* - bg-gray-950 (#030712) canvas, bg-gray-900 (#111827) containers
|
|
13
|
+
* - border-gray-800 (#1f2937) borders, divide-gray-800/50 row dividers
|
|
14
|
+
* - Cyan-only accent (#22d3ee), emerald/red for data status
|
|
15
|
+
* - System sans-serif, tabular-nums on every number
|
|
16
|
+
* - font-medium (500) max for headings, no bold (700)
|
|
17
|
+
* - rounded-lg (8px) containers, rounded-md (6px) badges/buttons
|
|
18
|
+
* - No drop shadows — borders carry elevation
|
|
19
|
+
*/
|
|
20
|
+
*, *::before, *::after { box-sizing: border-box; margin: 0; padding: 0; }
|
|
21
|
+
|
|
22
|
+
body {
|
|
23
|
+
font-family: ui-sans-serif, system-ui, -apple-system, BlinkMacSystemFont,
|
|
24
|
+
"Segoe UI", Roboto, "Helvetica Neue", Arial, sans-serif;
|
|
25
|
+
background: #030712; /* bg-gray-950 */
|
|
26
|
+
color: #d1d5db; /* text-gray-300 */
|
|
27
|
+
padding: 24px;
|
|
28
|
+
line-height: 1.5;
|
|
29
|
+
font-size: 0.875rem; /* text-sm default */
|
|
30
|
+
}
|
|
31
|
+
|
|
32
|
+
h1 {
|
|
33
|
+
text-align: center;
|
|
34
|
+
font-size: 1.5rem; /* text-2xl */
|
|
35
|
+
font-weight: 600; /* font-semibold — ceiling */
|
|
36
|
+
color: #fff; /* text-white */
|
|
37
|
+
margin-bottom: 8px;
|
|
38
|
+
}
|
|
39
|
+
|
|
40
|
+
.subtitle {
|
|
41
|
+
text-align: center;
|
|
42
|
+
font-size: 0.875rem;
|
|
43
|
+
color: #6b7280; /* text-gray-500 */
|
|
44
|
+
margin-bottom: 24px;
|
|
45
|
+
}
|
|
46
|
+
|
|
47
|
+
/* Summary cards */
|
|
48
|
+
.summary {
|
|
49
|
+
display: grid;
|
|
50
|
+
grid-template-columns: repeat(auto-fit, minmax(150px, 1fr));
|
|
51
|
+
gap: 12px;
|
|
52
|
+
margin-bottom: 24px; /* space-y-6 */
|
|
53
|
+
max-width: 900px;
|
|
54
|
+
margin-left: auto;
|
|
55
|
+
margin-right: auto;
|
|
56
|
+
}
|
|
57
|
+
|
|
58
|
+
.card {
|
|
59
|
+
background: #111827; /* bg-gray-900 */
|
|
60
|
+
border: 1px solid #1f2937; /* border-gray-800 */
|
|
61
|
+
border-radius: 8px; /* rounded-lg */
|
|
62
|
+
padding: 14px 16px;
|
|
63
|
+
text-align: center;
|
|
64
|
+
}
|
|
65
|
+
|
|
66
|
+
.card .label {
|
|
67
|
+
font-size: 0.75rem; /* text-xs */
|
|
68
|
+
text-transform: uppercase;
|
|
69
|
+
letter-spacing: 0.05em; /* tracking-wider */
|
|
70
|
+
color: #6b7280; /* text-gray-500 */
|
|
71
|
+
margin-bottom: 4px;
|
|
72
|
+
}
|
|
73
|
+
|
|
74
|
+
.card .value {
|
|
75
|
+
font-size: 1.5rem;
|
|
76
|
+
font-weight: 500; /* font-medium — NOT 700 */
|
|
77
|
+
color: #fff;
|
|
78
|
+
font-variant-numeric: tabular-nums;
|
|
79
|
+
}
|
|
80
|
+
|
|
81
|
+
.card .value.positive { color: #34d399; } /* text-emerald-400 */
|
|
82
|
+
.card .value.negative { color: #f87171; } /* text-red-400 */
|
|
83
|
+
.card .value.neutral { color: #22d3ee; } /* text-cyan-400 */
|
|
84
|
+
|
|
85
|
+
/* Charts grid */
|
|
86
|
+
.charts {
|
|
87
|
+
display: grid;
|
|
88
|
+
grid-template-columns: repeat(auto-fit, minmax(380px, 1fr));
|
|
89
|
+
gap: 16px; /* space-y-4 */
|
|
90
|
+
margin-bottom: 24px;
|
|
91
|
+
max-width: 1400px;
|
|
92
|
+
margin-left: auto;
|
|
93
|
+
margin-right: auto;
|
|
94
|
+
}
|
|
95
|
+
|
|
96
|
+
.chart-container {
|
|
97
|
+
background: #111827;
|
|
98
|
+
border: 1px solid #1f2937;
|
|
99
|
+
border-radius: 8px;
|
|
100
|
+
padding: 16px;
|
|
101
|
+
}
|
|
102
|
+
|
|
103
|
+
.chart-container h2 {
|
|
104
|
+
font-size: 0.875rem;
|
|
105
|
+
font-weight: 500; /* font-medium */
|
|
106
|
+
color: #9ca3af; /* text-gray-400 */
|
|
107
|
+
margin-bottom: 12px;
|
|
108
|
+
text-align: center;
|
|
109
|
+
}
|
|
110
|
+
|
|
111
|
+
canvas { width: 100% !important; }
|
|
112
|
+
|
|
113
|
+
/* Iterations table */
|
|
114
|
+
.table-section {
|
|
115
|
+
max-width: 1400px;
|
|
116
|
+
margin: 0 auto;
|
|
117
|
+
}
|
|
118
|
+
|
|
119
|
+
.table-section h2 {
|
|
120
|
+
font-size: 1.25rem; /* text-xl */
|
|
121
|
+
font-weight: 600; /* font-semibold */
|
|
122
|
+
color: #fff; /* text-white */
|
|
123
|
+
margin-bottom: 12px;
|
|
124
|
+
}
|
|
125
|
+
|
|
126
|
+
.table-wrapper {
|
|
127
|
+
overflow-x: auto;
|
|
128
|
+
border-radius: 8px; /* rounded-lg */
|
|
129
|
+
border: 1px solid #1f2937;
|
|
130
|
+
}
|
|
131
|
+
|
|
132
|
+
table {
|
|
133
|
+
width: 100%;
|
|
134
|
+
border-collapse: collapse;
|
|
135
|
+
font-size: 0.875rem; /* text-sm */
|
|
136
|
+
text-align: left;
|
|
137
|
+
}
|
|
138
|
+
|
|
139
|
+
thead {
|
|
140
|
+
background: rgba(17, 24, 39, 0.5); /* bg-gray-900/50 */
|
|
141
|
+
border-bottom: 1px solid #1f2937;
|
|
142
|
+
}
|
|
143
|
+
|
|
144
|
+
th {
|
|
145
|
+
padding: 12px 16px; /* px-4 py-3 */
|
|
146
|
+
text-align: left;
|
|
147
|
+
font-weight: 500; /* font-medium */
|
|
148
|
+
color: #9ca3af; /* text-gray-400 */
|
|
149
|
+
white-space: nowrap;
|
|
150
|
+
}
|
|
151
|
+
|
|
152
|
+
td {
|
|
153
|
+
padding: 12px 16px; /* px-4 py-3 */
|
|
154
|
+
color: #d1d5db; /* text-gray-300 */
|
|
155
|
+
}
|
|
156
|
+
|
|
157
|
+
tbody tr { border-top: 1px solid rgba(31, 41, 55, 0.5); } /* divide-gray-800/50 */
|
|
158
|
+
tbody tr:hover { background: rgba(17, 24, 39, 0.3); } /* hover:bg-gray-900/30 */
|
|
159
|
+
|
|
160
|
+
td.num {
|
|
161
|
+
text-align: right;
|
|
162
|
+
font-variant-numeric: tabular-nums;
|
|
163
|
+
color: #9ca3af; /* text-gray-400 for numbers */
|
|
164
|
+
}
|
|
165
|
+
|
|
166
|
+
.badge {
|
|
167
|
+
display: inline-block;
|
|
168
|
+
padding: 2px 8px;
|
|
169
|
+
border-radius: 6px; /* rounded-md */
|
|
170
|
+
font-size: 0.75rem;
|
|
171
|
+
font-weight: 500; /* font-medium */
|
|
172
|
+
text-transform: uppercase;
|
|
173
|
+
}
|
|
174
|
+
|
|
175
|
+
.badge.keep {
|
|
176
|
+
border: 1px solid rgba(6, 78, 59, 0.6); /* border-emerald-900/60 */
|
|
177
|
+
background: rgba(6, 78, 59, 0.3); /* bg-emerald-950/30 */
|
|
178
|
+
color: #34d399; /* text-emerald-400 */
|
|
179
|
+
}
|
|
180
|
+
.badge.discard {
|
|
181
|
+
border: 1px solid rgba(127, 29, 29, 0.6); /* border-red-900/60 */
|
|
182
|
+
background: rgba(127, 29, 29, 0.3); /* bg-red-950/30 */
|
|
183
|
+
color: #f87171; /* text-red-400 */
|
|
184
|
+
}
|
|
185
|
+
|
|
186
|
+
.empty-state {
|
|
187
|
+
text-align: center;
|
|
188
|
+
padding: 32px 20px; /* p-8 for empty states */
|
|
189
|
+
color: #4b5563; /* text-gray-600 */
|
|
190
|
+
font-size: 0.875rem;
|
|
191
|
+
}
|
|
192
|
+
|
|
193
|
+
.mutation-cell {
|
|
194
|
+
max-width: 320px;
|
|
195
|
+
overflow: hidden;
|
|
196
|
+
text-overflow: ellipsis;
|
|
197
|
+
white-space: nowrap;
|
|
198
|
+
}
|
|
199
|
+
</style>
|
|
200
|
+
</head>
|
|
201
|
+
<body>
|
|
202
|
+
|
|
203
|
+
<h1>AgentV Autoresearch Trajectory</h1>
|
|
204
|
+
<div class="subtitle" id="subtitle"></div>
|
|
205
|
+
|
|
206
|
+
<!-- Summary cards -->
|
|
207
|
+
<div class="summary" id="summary"></div>
|
|
208
|
+
|
|
209
|
+
<!-- Charts -->
|
|
210
|
+
<div class="charts" id="charts-area">
|
|
211
|
+
<div class="chart-container">
|
|
212
|
+
<h2>Score over Iterations</h2>
|
|
213
|
+
<canvas id="scoreChart"></canvas>
|
|
214
|
+
</div>
|
|
215
|
+
<div class="chart-container">
|
|
216
|
+
<h2>Per-Assertion Pass Rates</h2>
|
|
217
|
+
<canvas id="assertionChart"></canvas>
|
|
218
|
+
</div>
|
|
219
|
+
<div class="chart-container">
|
|
220
|
+
<h2>Cumulative Cost (USD)</h2>
|
|
221
|
+
<canvas id="costChart"></canvas>
|
|
222
|
+
</div>
|
|
223
|
+
</div>
|
|
224
|
+
|
|
225
|
+
<!-- Iterations table -->
|
|
226
|
+
<div class="table-section">
|
|
227
|
+
<h2>Iteration Log</h2>
|
|
228
|
+
<div class="table-wrapper" id="table-wrapper"></div>
|
|
229
|
+
</div>
|
|
230
|
+
|
|
231
|
+
<script>
|
|
232
|
+
// ── Data loading ─────────────────────────────────────────────────────
|
|
233
|
+
// Fetch iterations.jsonl from the same directory. Each line is one JSON object.
|
|
234
|
+
// Auto-refreshes every 2s (via the meta tag) so the chart stays live.
|
|
235
|
+
async function loadData() {
|
|
236
|
+
try {
|
|
237
|
+
const resp = await fetch('iterations.jsonl');
|
|
238
|
+
if (!resp.ok) return [];
|
|
239
|
+
const text = await resp.text();
|
|
240
|
+
return text.trim().split('\n').filter(Boolean).map(line => JSON.parse(line));
|
|
241
|
+
} catch (_) {
|
|
242
|
+
return [];
|
|
243
|
+
}
|
|
244
|
+
}
|
|
245
|
+
|
|
246
|
+
// ── Helpers ────────────────────────────────────────────────────────────
|
|
247
|
+
const fmtPct = (n) => (n == null ? '—' : Math.round(Number(n) * 100) + '%');
|
|
248
|
+
const fmtPctDelta = (n) => (n == null ? '—' : (n >= 0 ? '+' : '') + Math.round(Number(n) * 100) + '%');
|
|
249
|
+
const fmtUsd = (n) => (n == null ? '—' : '$' + Number(n).toFixed(4));
|
|
250
|
+
|
|
251
|
+
// ── Chart.js defaults (aligned with Studio DESIGN.md) ─────────────────
|
|
252
|
+
Chart.defaults.color = '#9ca3af'; // text-gray-400
|
|
253
|
+
Chart.defaults.borderColor = 'rgba(31, 41, 55, 0.5)'; // border-gray-800/50
|
|
254
|
+
Chart.defaults.font.family = 'ui-sans-serif, system-ui, -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, sans-serif';
|
|
255
|
+
|
|
256
|
+
// Cyan accent + Studio data-tone palette
|
|
257
|
+
const CHART_COLORS = [
|
|
258
|
+
'#22d3ee', // cyan-400 (primary accent)
|
|
259
|
+
'#34d399', // emerald-400
|
|
260
|
+
'#f472b6', // pink-400
|
|
261
|
+
'#facc15', // yellow-400
|
|
262
|
+
'#c084fc', // purple-400
|
|
263
|
+
'#fb923c', // orange-400
|
|
264
|
+
'#a3e635', // lime-400
|
|
265
|
+
'#e879f9', // fuchsia-400
|
|
266
|
+
'#60a5fa', // blue-400
|
|
267
|
+
'#fbbf24', // amber-400
|
|
268
|
+
];
|
|
269
|
+
|
|
270
|
+
// ── Render ─────────────────────────────────────────────────────────────
|
|
271
|
+
function render(data) {
|
|
272
|
+
if (!data || data.length === 0) {
|
|
273
|
+
document.getElementById('subtitle').textContent = 'Waiting for data…';
|
|
274
|
+
document.getElementById('charts-area').style.display = 'none';
|
|
275
|
+
document.getElementById('summary').innerHTML = '';
|
|
276
|
+
document.getElementById('table-wrapper').innerHTML =
|
|
277
|
+
'<div class="empty-state">No iterations recorded yet. The chart will update automatically.</div>';
|
|
278
|
+
return;
|
|
279
|
+
}
|
|
280
|
+
|
|
281
|
+
// Validate each iteration has at minimum a cycle and score
|
|
282
|
+
const valid = data.filter(it => it && typeof it.cycle === 'number' && typeof it.score === 'number');
|
|
283
|
+
if (valid.length === 0) {
|
|
284
|
+
document.getElementById('subtitle').textContent = 'Data present but missing required fields (cycle, score).';
|
|
285
|
+
document.getElementById('charts-area').style.display = 'none';
|
|
286
|
+
document.getElementById('summary').innerHTML = '';
|
|
287
|
+
document.getElementById('table-wrapper').innerHTML =
|
|
288
|
+
'<div class="empty-state">Iterations data is malformed. Each entry needs at minimum: cycle (number), score (number).</div>';
|
|
289
|
+
return;
|
|
290
|
+
}
|
|
291
|
+
const sorted = [...valid].sort((a, b) => a.cycle - b.cycle);
|
|
292
|
+
|
|
293
|
+
// ── Summary ──────────────────────────────────────────────────────────
|
|
294
|
+
const originalScore = sorted[0].score;
|
|
295
|
+
const bestIteration = sorted.reduce((best, it) => (it.score > best.score ? it : best), sorted[0]);
|
|
296
|
+
const bestScore = bestIteration.score;
|
|
297
|
+
const totalCycles = sorted.length;
|
|
298
|
+
const totalCost = sorted.reduce((sum, it) => sum + (parseFloat(it.cost_usd) || 0), 0);
|
|
299
|
+
const delta = bestScore - originalScore;
|
|
300
|
+
|
|
301
|
+
document.getElementById('subtitle').textContent =
|
|
302
|
+
`${totalCycles} cycle${totalCycles !== 1 ? 's' : ''} completed`;
|
|
303
|
+
|
|
304
|
+
const deltaClass = delta > 0 ? 'positive' : delta < 0 ? 'negative' : 'neutral';
|
|
305
|
+
|
|
306
|
+
document.getElementById('summary').innerHTML = `
|
|
307
|
+
<div class="card"><div class="label">Original Score</div><div class="value neutral">${fmtPct(originalScore)}</div></div>
|
|
308
|
+
<div class="card"><div class="label">Best Score</div><div class="value positive">${fmtPct(bestScore)}</div></div>
|
|
309
|
+
<div class="card"><div class="label">Improvement</div><div class="value ${deltaClass}">${fmtPctDelta(delta)}</div></div>
|
|
310
|
+
<div class="card"><div class="label">Total Cycles</div><div class="value">${totalCycles}</div></div>
|
|
311
|
+
<div class="card"><div class="label">Total Cost</div><div class="value">${fmtUsd(totalCost)}</div></div>
|
|
312
|
+
`;
|
|
313
|
+
|
|
314
|
+
// ── Chart data prep ──────────────────────────────────────────────────
|
|
315
|
+
const labels = sorted.map(it => it.cycle);
|
|
316
|
+
const scores = sorted.map(it => it.score);
|
|
317
|
+
const decisions = sorted.map(it => (it.decision || '').toLowerCase());
|
|
318
|
+
|
|
319
|
+
// Cumulative cost (parseFloat guards against string or null values)
|
|
320
|
+
let cumCost = 0;
|
|
321
|
+
const cumCosts = sorted.map(it => { cumCost += parseFloat(it.cost_usd) || 0; return cumCost; });
|
|
322
|
+
|
|
323
|
+
// Assertion keys (union of all)
|
|
324
|
+
const assertionKeys = [...new Set(sorted.flatMap(it => Object.keys(it.assertions || {})))].sort();
|
|
325
|
+
|
|
326
|
+
document.getElementById('charts-area').style.display = '';
|
|
327
|
+
|
|
328
|
+
// ── 1. Score chart (cyan line, emerald/red decision dots) ──────────
|
|
329
|
+
const scoresPct = scores.map(s => s * 100);
|
|
330
|
+
const scoreCtx = document.getElementById('scoreChart').getContext('2d');
|
|
331
|
+
new Chart(scoreCtx, {
|
|
332
|
+
type: 'line',
|
|
333
|
+
data: {
|
|
334
|
+
labels,
|
|
335
|
+
datasets: [{
|
|
336
|
+
label: 'Score',
|
|
337
|
+
data: scoresPct,
|
|
338
|
+
borderColor: '#22d3ee', // cyan-400
|
|
339
|
+
backgroundColor: 'rgba(34, 211, 238, 0.1)', // cyan-400/10
|
|
340
|
+
borderWidth: 2,
|
|
341
|
+
tension: 0.2,
|
|
342
|
+
fill: true,
|
|
343
|
+
pointRadius: 6,
|
|
344
|
+
pointBorderWidth: 2,
|
|
345
|
+
pointBackgroundColor: decisions.map(d => d === 'keep' ? '#34d399' : '#f87171'),
|
|
346
|
+
pointBorderColor: decisions.map(d => d === 'keep' ? '#065f46' : '#7f1d1d'),
|
|
347
|
+
}],
|
|
348
|
+
},
|
|
349
|
+
options: {
|
|
350
|
+
responsive: true,
|
|
351
|
+
scales: {
|
|
352
|
+
y: { min: 0, max: 100, ticks: { stepSize: 10, callback: v => v + '%' } },
|
|
353
|
+
x: { title: { display: true, text: 'Cycle' } },
|
|
354
|
+
},
|
|
355
|
+
plugins: {
|
|
356
|
+
legend: { display: false },
|
|
357
|
+
tooltip: {
|
|
358
|
+
callbacks: {
|
|
359
|
+
label: (ctx) => `Score: ${Math.round(ctx.parsed.y)}%`,
|
|
360
|
+
afterLabel: (ctx) => `Decision: ${decisions[ctx.dataIndex] || '—'}`,
|
|
361
|
+
},
|
|
362
|
+
},
|
|
363
|
+
},
|
|
364
|
+
},
|
|
365
|
+
});
|
|
366
|
+
|
|
367
|
+
// ── 2. Assertion chart ───────────────────────────────────────────────
|
|
368
|
+
const assertCtx = document.getElementById('assertionChart').getContext('2d');
|
|
369
|
+
new Chart(assertCtx, {
|
|
370
|
+
type: 'line',
|
|
371
|
+
data: {
|
|
372
|
+
labels,
|
|
373
|
+
datasets: assertionKeys.map((key, i) => ({
|
|
374
|
+
label: key,
|
|
375
|
+
data: sorted.map(it => { const v = (it.assertions || {})[key]; return v == null ? null : v * 100; }),
|
|
376
|
+
borderColor: CHART_COLORS[i % CHART_COLORS.length],
|
|
377
|
+
backgroundColor: 'transparent',
|
|
378
|
+
borderWidth: 2,
|
|
379
|
+
tension: 0.2,
|
|
380
|
+
pointRadius: 3,
|
|
381
|
+
spanGaps: true,
|
|
382
|
+
})),
|
|
383
|
+
},
|
|
384
|
+
options: {
|
|
385
|
+
responsive: true,
|
|
386
|
+
scales: {
|
|
387
|
+
y: { min: 0, max: 100, ticks: { stepSize: 20, callback: v => v + '%' } },
|
|
388
|
+
x: { title: { display: true, text: 'Cycle' } },
|
|
389
|
+
},
|
|
390
|
+
plugins: {
|
|
391
|
+
legend: { position: 'bottom', labels: { boxWidth: 12, padding: 10, font: { size: 11 } } },
|
|
392
|
+
tooltip: { callbacks: { label: (ctx) => `${ctx.dataset.label}: ${Math.round(ctx.parsed.y)}%` } },
|
|
393
|
+
},
|
|
394
|
+
},
|
|
395
|
+
});
|
|
396
|
+
|
|
397
|
+
// ── 3. Cost chart (cyan line) ─────────────────────────────────────
|
|
398
|
+
const costCtx = document.getElementById('costChart').getContext('2d');
|
|
399
|
+
new Chart(costCtx, {
|
|
400
|
+
type: 'line',
|
|
401
|
+
data: {
|
|
402
|
+
labels,
|
|
403
|
+
datasets: [{
|
|
404
|
+
label: 'Cumulative Cost (USD)',
|
|
405
|
+
data: cumCosts,
|
|
406
|
+
borderColor: '#22d3ee', // cyan-400
|
|
407
|
+
backgroundColor: 'rgba(34, 211, 238, 0.08)', // cyan-400/8
|
|
408
|
+
borderWidth: 2,
|
|
409
|
+
tension: 0.2,
|
|
410
|
+
fill: true,
|
|
411
|
+
pointRadius: 3,
|
|
412
|
+
}],
|
|
413
|
+
},
|
|
414
|
+
options: {
|
|
415
|
+
responsive: true,
|
|
416
|
+
scales: {
|
|
417
|
+
y: { ticks: { callback: v => '$' + v.toFixed(2) } },
|
|
418
|
+
x: { title: { display: true, text: 'Cycle' } },
|
|
419
|
+
},
|
|
420
|
+
plugins: { legend: { display: false } },
|
|
421
|
+
},
|
|
422
|
+
});
|
|
423
|
+
|
|
424
|
+
// ── Table ────────────────────────────────────────────────────────────
|
|
425
|
+
const rows = sorted.map(it => {
|
|
426
|
+
const dec = (it.decision || '').toLowerCase();
|
|
427
|
+
const badge = dec === 'keep'
|
|
428
|
+
? '<span class="badge keep">keep</span>'
|
|
429
|
+
: '<span class="badge discard">drop</span>';
|
|
430
|
+
const ts = it.timestamp
|
|
431
|
+
? new Date(it.timestamp).toLocaleString(undefined, { dateStyle: 'short', timeStyle: 'medium' })
|
|
432
|
+
: '—';
|
|
433
|
+
return `<tr>
|
|
434
|
+
<td class="num">${it.cycle}</td>
|
|
435
|
+
<td class="num">${fmtPct(it.score)}</td>
|
|
436
|
+
<td>${badge}</td>
|
|
437
|
+
<td class="mutation-cell" title="${(it.mutation || '').replace(/"/g, '"')}">${it.mutation || '—'}</td>
|
|
438
|
+
<td class="num">${fmtUsd(it.cost_usd)}</td>
|
|
439
|
+
<td style="color:#6b7280">${ts}</td>
|
|
440
|
+
</tr>`;
|
|
441
|
+
}).join('');
|
|
442
|
+
|
|
443
|
+
document.getElementById('table-wrapper').innerHTML = `
|
|
444
|
+
<table>
|
|
445
|
+
<thead><tr>
|
|
446
|
+
<th style="text-align:right">Cycle</th>
|
|
447
|
+
<th style="text-align:right">Score</th>
|
|
448
|
+
<th>Decision</th>
|
|
449
|
+
<th>Mutation</th>
|
|
450
|
+
<th style="text-align:right">Cost</th>
|
|
451
|
+
<th>Timestamp</th>
|
|
452
|
+
</tr></thead>
|
|
453
|
+
<tbody>${rows}</tbody>
|
|
454
|
+
</table>
|
|
455
|
+
`;
|
|
456
|
+
}
|
|
457
|
+
|
|
458
|
+
loadData().then(render);
|
|
459
|
+
</script>
|
|
460
|
+
|
|
461
|
+
</body>
|
|
462
|
+
</html>
|
|
@@ -0,0 +1,53 @@
|
|
|
1
|
+
---
|
|
2
|
+
name: agentv-eval-review
|
|
3
|
+
description: >-
|
|
4
|
+
Use when reviewing eval YAML files for quality issues, linting eval files before
|
|
5
|
+
committing, checking eval schema compliance, or when asked to "review these evals",
|
|
6
|
+
"check eval quality", "lint eval files", or "validate eval structure".
|
|
7
|
+
Do NOT use for writing evals (use agentv-eval-writer) or running evals (use agentv-bench).
|
|
8
|
+
---
|
|
9
|
+
|
|
10
|
+
# Eval Review
|
|
11
|
+
|
|
12
|
+
## Overview
|
|
13
|
+
|
|
14
|
+
Lint and review AgentV eval YAML files for structural issues, schema compliance, and quality problems. Apply this checklist deterministically first, then layer LLM judgment for semantic issues a checklist cannot catch.
|
|
15
|
+
|
|
16
|
+
## Process
|
|
17
|
+
|
|
18
|
+
### Step 1: Structural checklist
|
|
19
|
+
|
|
20
|
+
Walk every target eval file and report violations grouped by severity (error > warning > info). For each finding, include the file path and a concrete fix.
|
|
21
|
+
|
|
22
|
+
- File extension is `.eval.yaml` (error if not).
|
|
23
|
+
- `description` field is present at the top level (error if missing).
|
|
24
|
+
- Each entry under `tests` has `id`, `input`, and at least one of `criteria` / `expected_output` / `assertions` (error if missing).
|
|
25
|
+
- File-typed inputs (`type: file`) use a leading `/` in their `path` (error if relative).
|
|
26
|
+
- Tests have an `assertions` block — flag tests that rely solely on `expected_output` (warning).
|
|
27
|
+
- Detect `expected_output` prose patterns like "The agent should…" or "The output is…" (warning — prose belongs in `criteria`, structured matches in `assertions`).
|
|
28
|
+
- Identical file inputs repeated across multiple tests in the same eval should be hoisted to a top-level `input` (info).
|
|
29
|
+
- Eval files in the same directory should share a common `id` prefix (info — flag drift).
|
|
30
|
+
|
|
31
|
+
### Step 2: Semantic review (LLM judgment)
|
|
32
|
+
|
|
33
|
+
The structural checklist catches mechanical issues but cannot assess:
|
|
34
|
+
- **Factual accuracy** — Do tool/command names in expected_output match what the skill documents?
|
|
35
|
+
- **Coverage gaps** — Are important edge cases missing?
|
|
36
|
+
- **Assertion discriminability** — Would assertions pass for both good and bad output?
|
|
37
|
+
- **Cross-file consistency** — Do output filenames match across evals and skills?
|
|
38
|
+
|
|
39
|
+
Read the relevant SKILL.md files and cross-check against the eval content for these issues.
|
|
40
|
+
|
|
41
|
+
## Accessing reference files
|
|
42
|
+
|
|
43
|
+
To load a specific reference without pulling the entire skill into context:
|
|
44
|
+
|
|
45
|
+
```bash
|
|
46
|
+
agentv skills get agentv-eval-review --ref <filename>
|
|
47
|
+
```
|
|
48
|
+
|
|
49
|
+
Or resolve the skill directory and read files directly:
|
|
50
|
+
|
|
51
|
+
```bash
|
|
52
|
+
cat $(agentv skills path agentv-eval-review)/references/<filename>.md
|
|
53
|
+
```
|