agentv 4.26.1 → 4.27.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (42) hide show
  1. package/dist/{chunk-XBUHMRX2.js → chunk-PH5MHKPL.js} +431 -49
  2. package/dist/chunk-PH5MHKPL.js.map +1 -0
  3. package/dist/{chunk-JA4WQNE6.js → chunk-VO3THAOI.js} +10 -2
  4. package/dist/chunk-VO3THAOI.js.map +1 -0
  5. package/dist/cli.js +2 -2
  6. package/dist/index.js +2 -2
  7. package/dist/{interactive-YMKWKPD7.js → interactive-UG4YNLYK.js} +2 -2
  8. package/dist/skills/agentv-bench/LICENSE.txt +202 -0
  9. package/dist/skills/agentv-bench/SKILL.md +459 -0
  10. package/dist/skills/agentv-bench/agents/analyzer.md +177 -0
  11. package/dist/skills/agentv-bench/agents/comparator.md +247 -0
  12. package/dist/skills/agentv-bench/agents/executor.md +30 -0
  13. package/dist/skills/agentv-bench/agents/grader.md +238 -0
  14. package/dist/skills/agentv-bench/agents/mutator.md +172 -0
  15. package/dist/skills/agentv-bench/references/autoresearch.md +309 -0
  16. package/dist/skills/agentv-bench/references/description-optimization.md +66 -0
  17. package/dist/skills/agentv-bench/references/environment-adaptation.md +82 -0
  18. package/dist/skills/agentv-bench/references/eval-yaml-spec.md +338 -0
  19. package/dist/skills/agentv-bench/references/migrating-from-skill-creator.md +103 -0
  20. package/dist/skills/agentv-bench/references/schemas.md +432 -0
  21. package/dist/skills/agentv-bench/references/subagent-pipeline.md +181 -0
  22. package/dist/skills/agentv-bench/scripts/trajectory.html +462 -0
  23. package/dist/skills/agentv-eval-review/SKILL.md +53 -0
  24. package/dist/skills/agentv-eval-review/scripts/lint_eval.py +239 -0
  25. package/dist/skills/agentv-eval-writer/SKILL.md +707 -0
  26. package/dist/skills/agentv-eval-writer/references/config-schema.json +63 -0
  27. package/dist/skills/agentv-eval-writer/references/custom-evaluators.md +119 -0
  28. package/dist/skills/agentv-eval-writer/references/eval-schema.json +19077 -0
  29. package/dist/skills/agentv-eval-writer/references/rubric-evaluator.md +114 -0
  30. package/dist/skills/agentv-governance/SKILL.md +79 -0
  31. package/dist/skills/agentv-governance/references/eu-ai-act-risk-tiers.md +37 -0
  32. package/dist/skills/agentv-governance/references/governance-yaml-shape.md +125 -0
  33. package/dist/skills/agentv-governance/references/iso-42001-controls.md +46 -0
  34. package/dist/skills/agentv-governance/references/lint-rules.md +169 -0
  35. package/dist/skills/agentv-governance/references/mitre-atlas.md +38 -0
  36. package/dist/skills/agentv-governance/references/owasp-agentic-top-10-2025.md +28 -0
  37. package/dist/skills/agentv-governance/references/owasp-llm-top-10-2025.md +25 -0
  38. package/dist/skills/agentv-trace-analyst/SKILL.md +161 -0
  39. package/package.json +1 -1
  40. package/dist/chunk-JA4WQNE6.js.map +0 -1
  41. package/dist/chunk-XBUHMRX2.js.map +0 -1
  42. /package/dist/{interactive-YMKWKPD7.js.map → interactive-UG4YNLYK.js.map} +0 -0
@@ -0,0 +1,462 @@
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="UTF-8">
5
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
6
+ <meta http-equiv="refresh" content="2"> <!-- __AUTO_REFRESH__ -->
7
+ <title>AgentV Autoresearch Trajectory</title>
8
+ <script src="https://cdn.jsdelivr.net/npm/chart.js"></script>
9
+ <style>
10
+ /*
11
+ * Visual language aligned with Studio DESIGN.md:
12
+ * - bg-gray-950 (#030712) canvas, bg-gray-900 (#111827) containers
13
+ * - border-gray-800 (#1f2937) borders, divide-gray-800/50 row dividers
14
+ * - Cyan-only accent (#22d3ee), emerald/red for data status
15
+ * - System sans-serif, tabular-nums on every number
16
+ * - font-medium (500) max for headings, no bold (700)
17
+ * - rounded-lg (8px) containers, rounded-md (6px) badges/buttons
18
+ * - No drop shadows — borders carry elevation
19
+ */
20
+ *, *::before, *::after { box-sizing: border-box; margin: 0; padding: 0; }
21
+
22
+ body {
23
+ font-family: ui-sans-serif, system-ui, -apple-system, BlinkMacSystemFont,
24
+ "Segoe UI", Roboto, "Helvetica Neue", Arial, sans-serif;
25
+ background: #030712; /* bg-gray-950 */
26
+ color: #d1d5db; /* text-gray-300 */
27
+ padding: 24px;
28
+ line-height: 1.5;
29
+ font-size: 0.875rem; /* text-sm default */
30
+ }
31
+
32
+ h1 {
33
+ text-align: center;
34
+ font-size: 1.5rem; /* text-2xl */
35
+ font-weight: 600; /* font-semibold — ceiling */
36
+ color: #fff; /* text-white */
37
+ margin-bottom: 8px;
38
+ }
39
+
40
+ .subtitle {
41
+ text-align: center;
42
+ font-size: 0.875rem;
43
+ color: #6b7280; /* text-gray-500 */
44
+ margin-bottom: 24px;
45
+ }
46
+
47
+ /* Summary cards */
48
+ .summary {
49
+ display: grid;
50
+ grid-template-columns: repeat(auto-fit, minmax(150px, 1fr));
51
+ gap: 12px;
52
+ margin-bottom: 24px; /* space-y-6 */
53
+ max-width: 900px;
54
+ margin-left: auto;
55
+ margin-right: auto;
56
+ }
57
+
58
+ .card {
59
+ background: #111827; /* bg-gray-900 */
60
+ border: 1px solid #1f2937; /* border-gray-800 */
61
+ border-radius: 8px; /* rounded-lg */
62
+ padding: 14px 16px;
63
+ text-align: center;
64
+ }
65
+
66
+ .card .label {
67
+ font-size: 0.75rem; /* text-xs */
68
+ text-transform: uppercase;
69
+ letter-spacing: 0.05em; /* tracking-wider */
70
+ color: #6b7280; /* text-gray-500 */
71
+ margin-bottom: 4px;
72
+ }
73
+
74
+ .card .value {
75
+ font-size: 1.5rem;
76
+ font-weight: 500; /* font-medium — NOT 700 */
77
+ color: #fff;
78
+ font-variant-numeric: tabular-nums;
79
+ }
80
+
81
+ .card .value.positive { color: #34d399; } /* text-emerald-400 */
82
+ .card .value.negative { color: #f87171; } /* text-red-400 */
83
+ .card .value.neutral { color: #22d3ee; } /* text-cyan-400 */
84
+
85
+ /* Charts grid */
86
+ .charts {
87
+ display: grid;
88
+ grid-template-columns: repeat(auto-fit, minmax(380px, 1fr));
89
+ gap: 16px; /* space-y-4 */
90
+ margin-bottom: 24px;
91
+ max-width: 1400px;
92
+ margin-left: auto;
93
+ margin-right: auto;
94
+ }
95
+
96
+ .chart-container {
97
+ background: #111827;
98
+ border: 1px solid #1f2937;
99
+ border-radius: 8px;
100
+ padding: 16px;
101
+ }
102
+
103
+ .chart-container h2 {
104
+ font-size: 0.875rem;
105
+ font-weight: 500; /* font-medium */
106
+ color: #9ca3af; /* text-gray-400 */
107
+ margin-bottom: 12px;
108
+ text-align: center;
109
+ }
110
+
111
+ canvas { width: 100% !important; }
112
+
113
+ /* Iterations table */
114
+ .table-section {
115
+ max-width: 1400px;
116
+ margin: 0 auto;
117
+ }
118
+
119
+ .table-section h2 {
120
+ font-size: 1.25rem; /* text-xl */
121
+ font-weight: 600; /* font-semibold */
122
+ color: #fff; /* text-white */
123
+ margin-bottom: 12px;
124
+ }
125
+
126
+ .table-wrapper {
127
+ overflow-x: auto;
128
+ border-radius: 8px; /* rounded-lg */
129
+ border: 1px solid #1f2937;
130
+ }
131
+
132
+ table {
133
+ width: 100%;
134
+ border-collapse: collapse;
135
+ font-size: 0.875rem; /* text-sm */
136
+ text-align: left;
137
+ }
138
+
139
+ thead {
140
+ background: rgba(17, 24, 39, 0.5); /* bg-gray-900/50 */
141
+ border-bottom: 1px solid #1f2937;
142
+ }
143
+
144
+ th {
145
+ padding: 12px 16px; /* px-4 py-3 */
146
+ text-align: left;
147
+ font-weight: 500; /* font-medium */
148
+ color: #9ca3af; /* text-gray-400 */
149
+ white-space: nowrap;
150
+ }
151
+
152
+ td {
153
+ padding: 12px 16px; /* px-4 py-3 */
154
+ color: #d1d5db; /* text-gray-300 */
155
+ }
156
+
157
+ tbody tr { border-top: 1px solid rgba(31, 41, 55, 0.5); } /* divide-gray-800/50 */
158
+ tbody tr:hover { background: rgba(17, 24, 39, 0.3); } /* hover:bg-gray-900/30 */
159
+
160
+ td.num {
161
+ text-align: right;
162
+ font-variant-numeric: tabular-nums;
163
+ color: #9ca3af; /* text-gray-400 for numbers */
164
+ }
165
+
166
+ .badge {
167
+ display: inline-block;
168
+ padding: 2px 8px;
169
+ border-radius: 6px; /* rounded-md */
170
+ font-size: 0.75rem;
171
+ font-weight: 500; /* font-medium */
172
+ text-transform: uppercase;
173
+ }
174
+
175
+ .badge.keep {
176
+ border: 1px solid rgba(6, 78, 59, 0.6); /* border-emerald-900/60 */
177
+ background: rgba(6, 78, 59, 0.3); /* bg-emerald-950/30 */
178
+ color: #34d399; /* text-emerald-400 */
179
+ }
180
+ .badge.discard {
181
+ border: 1px solid rgba(127, 29, 29, 0.6); /* border-red-900/60 */
182
+ background: rgba(127, 29, 29, 0.3); /* bg-red-950/30 */
183
+ color: #f87171; /* text-red-400 */
184
+ }
185
+
186
+ .empty-state {
187
+ text-align: center;
188
+ padding: 32px 20px; /* p-8 for empty states */
189
+ color: #4b5563; /* text-gray-600 */
190
+ font-size: 0.875rem;
191
+ }
192
+
193
+ .mutation-cell {
194
+ max-width: 320px;
195
+ overflow: hidden;
196
+ text-overflow: ellipsis;
197
+ white-space: nowrap;
198
+ }
199
+ </style>
200
+ </head>
201
+ <body>
202
+
203
+ <h1>AgentV Autoresearch Trajectory</h1>
204
+ <div class="subtitle" id="subtitle"></div>
205
+
206
+ <!-- Summary cards -->
207
+ <div class="summary" id="summary"></div>
208
+
209
+ <!-- Charts -->
210
+ <div class="charts" id="charts-area">
211
+ <div class="chart-container">
212
+ <h2>Score over Iterations</h2>
213
+ <canvas id="scoreChart"></canvas>
214
+ </div>
215
+ <div class="chart-container">
216
+ <h2>Per-Assertion Pass Rates</h2>
217
+ <canvas id="assertionChart"></canvas>
218
+ </div>
219
+ <div class="chart-container">
220
+ <h2>Cumulative Cost (USD)</h2>
221
+ <canvas id="costChart"></canvas>
222
+ </div>
223
+ </div>
224
+
225
+ <!-- Iterations table -->
226
+ <div class="table-section">
227
+ <h2>Iteration Log</h2>
228
+ <div class="table-wrapper" id="table-wrapper"></div>
229
+ </div>
230
+
231
+ <script>
232
+ // ── Data loading ─────────────────────────────────────────────────────
233
+ // Fetch iterations.jsonl from the same directory. Each line is one JSON object.
234
+ // Auto-refreshes every 2s (via the meta tag) so the chart stays live.
235
+ async function loadData() {
236
+ try {
237
+ const resp = await fetch('iterations.jsonl');
238
+ if (!resp.ok) return [];
239
+ const text = await resp.text();
240
+ return text.trim().split('\n').filter(Boolean).map(line => JSON.parse(line));
241
+ } catch (_) {
242
+ return [];
243
+ }
244
+ }
245
+
246
+ // ── Helpers ────────────────────────────────────────────────────────────
247
+ const fmtPct = (n) => (n == null ? '—' : Math.round(Number(n) * 100) + '%');
248
+ const fmtPctDelta = (n) => (n == null ? '—' : (n >= 0 ? '+' : '') + Math.round(Number(n) * 100) + '%');
249
+ const fmtUsd = (n) => (n == null ? '—' : '$' + Number(n).toFixed(4));
250
+
251
+ // ── Chart.js defaults (aligned with Studio DESIGN.md) ─────────────────
252
+ Chart.defaults.color = '#9ca3af'; // text-gray-400
253
+ Chart.defaults.borderColor = 'rgba(31, 41, 55, 0.5)'; // border-gray-800/50
254
+ Chart.defaults.font.family = 'ui-sans-serif, system-ui, -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, sans-serif';
255
+
256
+ // Cyan accent + Studio data-tone palette
257
+ const CHART_COLORS = [
258
+ '#22d3ee', // cyan-400 (primary accent)
259
+ '#34d399', // emerald-400
260
+ '#f472b6', // pink-400
261
+ '#facc15', // yellow-400
262
+ '#c084fc', // purple-400
263
+ '#fb923c', // orange-400
264
+ '#a3e635', // lime-400
265
+ '#e879f9', // fuchsia-400
266
+ '#60a5fa', // blue-400
267
+ '#fbbf24', // amber-400
268
+ ];
269
+
270
+ // ── Render ─────────────────────────────────────────────────────────────
271
+ function render(data) {
272
+ if (!data || data.length === 0) {
273
+ document.getElementById('subtitle').textContent = 'Waiting for data…';
274
+ document.getElementById('charts-area').style.display = 'none';
275
+ document.getElementById('summary').innerHTML = '';
276
+ document.getElementById('table-wrapper').innerHTML =
277
+ '<div class="empty-state">No iterations recorded yet. The chart will update automatically.</div>';
278
+ return;
279
+ }
280
+
281
+ // Validate each iteration has at minimum a cycle and score
282
+ const valid = data.filter(it => it && typeof it.cycle === 'number' && typeof it.score === 'number');
283
+ if (valid.length === 0) {
284
+ document.getElementById('subtitle').textContent = 'Data present but missing required fields (cycle, score).';
285
+ document.getElementById('charts-area').style.display = 'none';
286
+ document.getElementById('summary').innerHTML = '';
287
+ document.getElementById('table-wrapper').innerHTML =
288
+ '<div class="empty-state">Iterations data is malformed. Each entry needs at minimum: cycle (number), score (number).</div>';
289
+ return;
290
+ }
291
+ const sorted = [...valid].sort((a, b) => a.cycle - b.cycle);
292
+
293
+ // ── Summary ──────────────────────────────────────────────────────────
294
+ const originalScore = sorted[0].score;
295
+ const bestIteration = sorted.reduce((best, it) => (it.score > best.score ? it : best), sorted[0]);
296
+ const bestScore = bestIteration.score;
297
+ const totalCycles = sorted.length;
298
+ const totalCost = sorted.reduce((sum, it) => sum + (parseFloat(it.cost_usd) || 0), 0);
299
+ const delta = bestScore - originalScore;
300
+
301
+ document.getElementById('subtitle').textContent =
302
+ `${totalCycles} cycle${totalCycles !== 1 ? 's' : ''} completed`;
303
+
304
+ const deltaClass = delta > 0 ? 'positive' : delta < 0 ? 'negative' : 'neutral';
305
+
306
+ document.getElementById('summary').innerHTML = `
307
+ <div class="card"><div class="label">Original Score</div><div class="value neutral">${fmtPct(originalScore)}</div></div>
308
+ <div class="card"><div class="label">Best Score</div><div class="value positive">${fmtPct(bestScore)}</div></div>
309
+ <div class="card"><div class="label">Improvement</div><div class="value ${deltaClass}">${fmtPctDelta(delta)}</div></div>
310
+ <div class="card"><div class="label">Total Cycles</div><div class="value">${totalCycles}</div></div>
311
+ <div class="card"><div class="label">Total Cost</div><div class="value">${fmtUsd(totalCost)}</div></div>
312
+ `;
313
+
314
+ // ── Chart data prep ──────────────────────────────────────────────────
315
+ const labels = sorted.map(it => it.cycle);
316
+ const scores = sorted.map(it => it.score);
317
+ const decisions = sorted.map(it => (it.decision || '').toLowerCase());
318
+
319
+ // Cumulative cost (parseFloat guards against string or null values)
320
+ let cumCost = 0;
321
+ const cumCosts = sorted.map(it => { cumCost += parseFloat(it.cost_usd) || 0; return cumCost; });
322
+
323
+ // Assertion keys (union of all)
324
+ const assertionKeys = [...new Set(sorted.flatMap(it => Object.keys(it.assertions || {})))].sort();
325
+
326
+ document.getElementById('charts-area').style.display = '';
327
+
328
+ // ── 1. Score chart (cyan line, emerald/red decision dots) ──────────
329
+ const scoresPct = scores.map(s => s * 100);
330
+ const scoreCtx = document.getElementById('scoreChart').getContext('2d');
331
+ new Chart(scoreCtx, {
332
+ type: 'line',
333
+ data: {
334
+ labels,
335
+ datasets: [{
336
+ label: 'Score',
337
+ data: scoresPct,
338
+ borderColor: '#22d3ee', // cyan-400
339
+ backgroundColor: 'rgba(34, 211, 238, 0.1)', // cyan-400/10
340
+ borderWidth: 2,
341
+ tension: 0.2,
342
+ fill: true,
343
+ pointRadius: 6,
344
+ pointBorderWidth: 2,
345
+ pointBackgroundColor: decisions.map(d => d === 'keep' ? '#34d399' : '#f87171'),
346
+ pointBorderColor: decisions.map(d => d === 'keep' ? '#065f46' : '#7f1d1d'),
347
+ }],
348
+ },
349
+ options: {
350
+ responsive: true,
351
+ scales: {
352
+ y: { min: 0, max: 100, ticks: { stepSize: 10, callback: v => v + '%' } },
353
+ x: { title: { display: true, text: 'Cycle' } },
354
+ },
355
+ plugins: {
356
+ legend: { display: false },
357
+ tooltip: {
358
+ callbacks: {
359
+ label: (ctx) => `Score: ${Math.round(ctx.parsed.y)}%`,
360
+ afterLabel: (ctx) => `Decision: ${decisions[ctx.dataIndex] || '—'}`,
361
+ },
362
+ },
363
+ },
364
+ },
365
+ });
366
+
367
+ // ── 2. Assertion chart ───────────────────────────────────────────────
368
+ const assertCtx = document.getElementById('assertionChart').getContext('2d');
369
+ new Chart(assertCtx, {
370
+ type: 'line',
371
+ data: {
372
+ labels,
373
+ datasets: assertionKeys.map((key, i) => ({
374
+ label: key,
375
+ data: sorted.map(it => { const v = (it.assertions || {})[key]; return v == null ? null : v * 100; }),
376
+ borderColor: CHART_COLORS[i % CHART_COLORS.length],
377
+ backgroundColor: 'transparent',
378
+ borderWidth: 2,
379
+ tension: 0.2,
380
+ pointRadius: 3,
381
+ spanGaps: true,
382
+ })),
383
+ },
384
+ options: {
385
+ responsive: true,
386
+ scales: {
387
+ y: { min: 0, max: 100, ticks: { stepSize: 20, callback: v => v + '%' } },
388
+ x: { title: { display: true, text: 'Cycle' } },
389
+ },
390
+ plugins: {
391
+ legend: { position: 'bottom', labels: { boxWidth: 12, padding: 10, font: { size: 11 } } },
392
+ tooltip: { callbacks: { label: (ctx) => `${ctx.dataset.label}: ${Math.round(ctx.parsed.y)}%` } },
393
+ },
394
+ },
395
+ });
396
+
397
+ // ── 3. Cost chart (cyan line) ─────────────────────────────────────
398
+ const costCtx = document.getElementById('costChart').getContext('2d');
399
+ new Chart(costCtx, {
400
+ type: 'line',
401
+ data: {
402
+ labels,
403
+ datasets: [{
404
+ label: 'Cumulative Cost (USD)',
405
+ data: cumCosts,
406
+ borderColor: '#22d3ee', // cyan-400
407
+ backgroundColor: 'rgba(34, 211, 238, 0.08)', // cyan-400/8
408
+ borderWidth: 2,
409
+ tension: 0.2,
410
+ fill: true,
411
+ pointRadius: 3,
412
+ }],
413
+ },
414
+ options: {
415
+ responsive: true,
416
+ scales: {
417
+ y: { ticks: { callback: v => '$' + v.toFixed(2) } },
418
+ x: { title: { display: true, text: 'Cycle' } },
419
+ },
420
+ plugins: { legend: { display: false } },
421
+ },
422
+ });
423
+
424
+ // ── Table ────────────────────────────────────────────────────────────
425
+ const rows = sorted.map(it => {
426
+ const dec = (it.decision || '').toLowerCase();
427
+ const badge = dec === 'keep'
428
+ ? '<span class="badge keep">keep</span>'
429
+ : '<span class="badge discard">drop</span>';
430
+ const ts = it.timestamp
431
+ ? new Date(it.timestamp).toLocaleString(undefined, { dateStyle: 'short', timeStyle: 'medium' })
432
+ : '—';
433
+ return `<tr>
434
+ <td class="num">${it.cycle}</td>
435
+ <td class="num">${fmtPct(it.score)}</td>
436
+ <td>${badge}</td>
437
+ <td class="mutation-cell" title="${(it.mutation || '').replace(/"/g, '&quot;')}">${it.mutation || '—'}</td>
438
+ <td class="num">${fmtUsd(it.cost_usd)}</td>
439
+ <td style="color:#6b7280">${ts}</td>
440
+ </tr>`;
441
+ }).join('');
442
+
443
+ document.getElementById('table-wrapper').innerHTML = `
444
+ <table>
445
+ <thead><tr>
446
+ <th style="text-align:right">Cycle</th>
447
+ <th style="text-align:right">Score</th>
448
+ <th>Decision</th>
449
+ <th>Mutation</th>
450
+ <th style="text-align:right">Cost</th>
451
+ <th>Timestamp</th>
452
+ </tr></thead>
453
+ <tbody>${rows}</tbody>
454
+ </table>
455
+ `;
456
+ }
457
+
458
+ loadData().then(render);
459
+ </script>
460
+
461
+ </body>
462
+ </html>
@@ -0,0 +1,53 @@
1
+ ---
2
+ name: agentv-eval-review
3
+ description: >-
4
+ Use when reviewing eval YAML files for quality issues, linting eval files before
5
+ committing, checking eval schema compliance, or when asked to "review these evals",
6
+ "check eval quality", "lint eval files", or "validate eval structure".
7
+ Do NOT use for writing evals (use agentv-eval-writer) or running evals (use agentv-bench).
8
+ ---
9
+
10
+ # Eval Review
11
+
12
+ ## Overview
13
+
14
+ Lint and review AgentV eval YAML files for structural issues, schema compliance, and quality problems. Apply this checklist deterministically first, then layer LLM judgment for semantic issues a checklist cannot catch.
15
+
16
+ ## Process
17
+
18
+ ### Step 1: Structural checklist
19
+
20
+ Walk every target eval file and report violations grouped by severity (error > warning > info). For each finding, include the file path and a concrete fix.
21
+
22
+ - File extension is `.eval.yaml` (error if not).
23
+ - `description` field is present at the top level (error if missing).
24
+ - Each entry under `tests` has `id`, `input`, and at least one of `criteria` / `expected_output` / `assertions` (error if missing).
25
+ - File-typed inputs (`type: file`) use a leading `/` in their `path` (error if relative).
26
+ - Tests have an `assertions` block — flag tests that rely solely on `expected_output` (warning).
27
+ - Detect `expected_output` prose patterns like "The agent should…" or "The output is…" (warning — prose belongs in `criteria`, structured matches in `assertions`).
28
+ - Identical file inputs repeated across multiple tests in the same eval should be hoisted to a top-level `input` (info).
29
+ - Eval files in the same directory should share a common `id` prefix (info — flag drift).
30
+
31
+ ### Step 2: Semantic review (LLM judgment)
32
+
33
+ The structural checklist catches mechanical issues but cannot assess:
34
+ - **Factual accuracy** — Do tool/command names in expected_output match what the skill documents?
35
+ - **Coverage gaps** — Are important edge cases missing?
36
+ - **Assertion discriminability** — Would assertions pass for both good and bad output?
37
+ - **Cross-file consistency** — Do output filenames match across evals and skills?
38
+
39
+ Read the relevant SKILL.md files and cross-check against the eval content for these issues.
40
+
41
+ ## Accessing reference files
42
+
43
+ To load a specific reference without pulling the entire skill into context:
44
+
45
+ ```bash
46
+ agentv skills get agentv-eval-review --ref <filename>
47
+ ```
48
+
49
+ Or resolve the skill directory and read files directly:
50
+
51
+ ```bash
52
+ cat $(agentv skills path agentv-eval-review)/references/<filename>.md
53
+ ```