agentv 4.26.1 → 4.27.0-next.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (42) hide show
  1. package/dist/{chunk-JA4WQNE6.js → chunk-47JX7NNZ.js} +10 -2
  2. package/dist/chunk-47JX7NNZ.js.map +1 -0
  3. package/dist/{chunk-XBUHMRX2.js → chunk-V3LWJB5X.js} +431 -49
  4. package/dist/chunk-V3LWJB5X.js.map +1 -0
  5. package/dist/cli.js +2 -2
  6. package/dist/index.js +2 -2
  7. package/dist/{interactive-YMKWKPD7.js → interactive-L6PIIFNQ.js} +2 -2
  8. package/dist/skills/agentv-bench/LICENSE.txt +202 -0
  9. package/dist/skills/agentv-bench/SKILL.md +459 -0
  10. package/dist/skills/agentv-bench/agents/analyzer.md +177 -0
  11. package/dist/skills/agentv-bench/agents/comparator.md +247 -0
  12. package/dist/skills/agentv-bench/agents/executor.md +30 -0
  13. package/dist/skills/agentv-bench/agents/grader.md +238 -0
  14. package/dist/skills/agentv-bench/agents/mutator.md +172 -0
  15. package/dist/skills/agentv-bench/references/autoresearch.md +309 -0
  16. package/dist/skills/agentv-bench/references/description-optimization.md +66 -0
  17. package/dist/skills/agentv-bench/references/environment-adaptation.md +82 -0
  18. package/dist/skills/agentv-bench/references/eval-yaml-spec.md +338 -0
  19. package/dist/skills/agentv-bench/references/migrating-from-skill-creator.md +103 -0
  20. package/dist/skills/agentv-bench/references/schemas.md +432 -0
  21. package/dist/skills/agentv-bench/references/subagent-pipeline.md +181 -0
  22. package/dist/skills/agentv-bench/scripts/trajectory.html +462 -0
  23. package/dist/skills/agentv-eval-review/SKILL.md +53 -0
  24. package/dist/skills/agentv-eval-review/scripts/lint_eval.py +239 -0
  25. package/dist/skills/agentv-eval-writer/SKILL.md +707 -0
  26. package/dist/skills/agentv-eval-writer/references/config-schema.json +63 -0
  27. package/dist/skills/agentv-eval-writer/references/custom-evaluators.md +119 -0
  28. package/dist/skills/agentv-eval-writer/references/eval-schema.json +19077 -0
  29. package/dist/skills/agentv-eval-writer/references/rubric-evaluator.md +114 -0
  30. package/dist/skills/agentv-governance/SKILL.md +79 -0
  31. package/dist/skills/agentv-governance/references/eu-ai-act-risk-tiers.md +37 -0
  32. package/dist/skills/agentv-governance/references/governance-yaml-shape.md +125 -0
  33. package/dist/skills/agentv-governance/references/iso-42001-controls.md +46 -0
  34. package/dist/skills/agentv-governance/references/lint-rules.md +169 -0
  35. package/dist/skills/agentv-governance/references/mitre-atlas.md +38 -0
  36. package/dist/skills/agentv-governance/references/owasp-agentic-top-10-2025.md +28 -0
  37. package/dist/skills/agentv-governance/references/owasp-llm-top-10-2025.md +25 -0
  38. package/dist/skills/agentv-trace-analyst/SKILL.md +161 -0
  39. package/package.json +1 -1
  40. package/dist/chunk-JA4WQNE6.js.map +0 -1
  41. package/dist/chunk-XBUHMRX2.js.map +0 -1
  42. /package/dist/{interactive-YMKWKPD7.js.map → interactive-L6PIIFNQ.js.map} +0 -0
@@ -0,0 +1,161 @@
1
+ ---
2
+ name: agentv-trace-analyst
3
+ description: >-
4
+ Analyze AgentV evaluation traces and result JSONL files using `agentv inspect` and `agentv compare` CLI commands.
5
+ Use when asked to inspect AgentV eval results, find regressions between AgentV evaluation runs,
6
+ identify failure patterns in AgentV trace data, analyze tool trajectories, or compute cost/latency/score statistics
7
+ from AgentV result files.
8
+ Do NOT use for benchmarking skill trigger accuracy, analyzing skill-creator eval performance,
9
+ or measuring skill description quality — those tasks belong to the skill-creator skill.
10
+ ---
11
+
12
+ # AgentV Trace Analyst
13
+
14
+ Analyze evaluation traces headlessly using `agentv inspect` primitives and `jq`.
15
+
16
+ ## Primitives
17
+
18
+ ```bash
19
+ # List result files (most recent first)
20
+ agentv inspect list [--limit N] [--format json|table]
21
+
22
+ # Show results with trace details
23
+ agentv inspect show <result-file> [--test-id <id>] [--tree] [--format json|table]
24
+
25
+ # Percentile statistics
26
+ agentv inspect stats <result-file> [--group-by target|suite|test-id] [--format json|table]
27
+
28
+ # A/B comparison between runs
29
+ agentv compare <baseline.jsonl> <candidate.jsonl> [--threshold 0.1] [--format json|table]
30
+ ```
31
+
32
+ ## Analysis Workflow
33
+
34
+ ### 1. Discover results
35
+
36
+ ```bash
37
+ agentv inspect list
38
+ ```
39
+
40
+ Pick the result file to analyze. Most recent is first.
41
+
42
+ ### 2. Get overview
43
+
44
+ ```bash
45
+ agentv inspect stats <result-file>
46
+ ```
47
+
48
+ Read the percentile table. Key signals:
49
+ - **score p50 < 0.8**: Significant quality issues
50
+ - **latency p90 > 30s**: Performance bottleneck
51
+ - **cost p99 spike**: Outlier cost tests to investigate
52
+ - **tool_calls p90 >> p50**: Some tests are much chattier
53
+
54
+ ### 3. Investigate failures
55
+
56
+ ```bash
57
+ agentv inspect show <result-file> --format json | jq '[.[] | select(.score < 0.8) | {test_id, score, assertions: [.assertions[] | select(.passed | not)], trace: {tools: (.trace.tool_calls | keys)}, duration_ms, cost_usd}]'
58
+ ```
59
+
60
+ For each failing test, examine:
61
+ - **assertions (failed)**: What criteria were not met? (filter for `passed: false`)
62
+ - **trace.tool_calls**: Did the agent use expected tools?
63
+ - **duration_ms**: Did it time out or run too long?
64
+ - **reasoning**: Why did the grader score it low?
65
+
66
+ ### 4. Inspect specific tests
67
+
68
+ ```bash
69
+ # Flat view with trace summary
70
+ agentv inspect show <result-file> --test-id <id>
71
+
72
+ # Tree view (if output messages available)
73
+ agentv inspect show <result-file> --test-id <id> --tree
74
+ ```
75
+
76
+ The tree view shows the agent's execution path — LLM calls interspersed with tool invocations. Look for:
77
+ - **Excessive tool calls**: Agent looping or exploring unnecessarily
78
+ - **Missing tools**: Expected tool not called
79
+ - **Long durations**: Specific tool calls that are slow
80
+
81
+ ### 5. Compare runs
82
+
83
+ ```bash
84
+ agentv compare <baseline.jsonl> <candidate.jsonl>
85
+ ```
86
+
87
+ Look for:
88
+ - **Wins vs losses**: Net improvement or regression?
89
+ - **Mean delta**: Overall direction of change
90
+ - **Per-test deltas**: Which tests regressed?
91
+
92
+ ### 6. Group analysis
93
+
94
+ ```bash
95
+ # By target provider
96
+ agentv inspect stats <result-file> --group-by target
97
+
98
+ # By suite
99
+ agentv inspect stats <result-file> --group-by suite
100
+ ```
101
+
102
+ Compare providers side-by-side: which is cheaper, faster, more accurate?
103
+
104
+ ## Advanced Queries with jq
105
+
106
+ All commands support `--format json` for piping to `jq`:
107
+
108
+ ```bash
109
+ # Top 3 most expensive tests
110
+ agentv inspect show <result-file> --format json \
111
+ | jq 'sort_by(-.cost_usd) | .[0:3] | .[] | {test_id, cost: .cost_usd, score}'
112
+
113
+ # Tests where token usage exceeds 10k
114
+ agentv inspect show <result-file> --format json \
115
+ | jq '[.[] | select(.token_usage.input + .token_usage.output > 10000) | {test_id, tokens: (.token_usage.input + .token_usage.output)}]'
116
+
117
+ # Score distribution by suite
118
+ agentv inspect show <result-file> --format json \
119
+ | jq 'group_by(.suite) | .[] | {suite: .[0].suite, count: length, avg_score: ([.[].score] | add / length)}'
120
+
121
+ # Tool usage frequency across all tests
122
+ agentv inspect show <result-file> --format json \
123
+ | jq '[.[].trace.tool_calls // {} | to_entries[]] | group_by(.key) | .[] | {tool: .[0].key, total_calls: ([.[].value] | add)}'
124
+
125
+ # Find regressions > 0.1 between two runs
126
+ agentv compare baseline.jsonl candidate.jsonl --format json \
127
+ | jq '.matched[] | select(.delta < -0.1) | {test_id: .testId, delta, from: .score1, to: .score2}'
128
+ ```
129
+
130
+ ## Reasoning Patterns
131
+
132
+ When analyzing traces, think about:
133
+
134
+ 1. **Efficiency**: Are tool calls/tokens proportional to task complexity? High tokens-per-tool may indicate verbose prompts or unnecessary context.
135
+
136
+ 2. **Error patterns**: Do failures cluster by target, suite, or tool usage? Common patterns:
137
+ - Tool errors → agent can't access required resources
138
+ - High LLM calls with low tool calls → agent stuck in reasoning loop
139
+ - Missing tool calls → wrong tool routing
140
+
141
+ 3. **Cost optimization**: Identify tests with high cost but acceptable scores — can they use a cheaper model? Compare `--group-by target` stats.
142
+
143
+ 4. **Latency distribution**: P50 vs P99 spread indicates consistency. Large spread means unpredictable performance — investigate P99 outliers.
144
+
145
+ 5. **Regression detection**: After a prompt/config change, compare before/after. Mean delta > 0 is good, but check individual test regressions — a few large losses can hide behind many small wins.
146
+
147
+ ## Accessing reference files
148
+
149
+ To load a specific reference without pulling the entire skill into context:
150
+
151
+ ```bash
152
+ agentv skills get agentv-trace-analyst --ref <filename>
153
+ ```
154
+
155
+ Or resolve the skill directory and read files directly:
156
+
157
+ ```bash
158
+ cat $(agentv skills path agentv-trace-analyst)/references/<filename>.md
159
+ ```
160
+
161
+ Use `--full` to retrieve every file in the skill at once.
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "agentv",
3
- "version": "4.26.1",
3
+ "version": "4.27.0-next.1",
4
4
  "description": "CLI entry point for AgentV",
5
5
  "type": "module",
6
6
  "repository": {