agentv 0.26.0 → 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/dist/{chunk-6ZM7WVSC.js → chunk-RIJO5WBF.js} +13 -13
- package/dist/chunk-RIJO5WBF.js.map +1 -0
- package/dist/cli.js +1 -1
- package/dist/cli.js.map +1 -1
- package/dist/index.js +1 -1
- package/dist/templates/.claude/skills/agentv-eval-builder/SKILL.md +36 -19
- package/dist/templates/.claude/skills/agentv-eval-builder/references/eval-schema.json +217 -217
- package/dist/templates/.claude/skills/agentv-eval-builder/references/example-evals.md +94 -2
- package/dist/templates/.claude/skills/agentv-eval-builder/references/tool-trajectory-evaluator.md +8 -8
- package/package.json +1 -1
- package/dist/chunk-6ZM7WVSC.js.map +0 -1
- package/dist/templates/agentv/.env.template +0 -23
package/dist/templates/.claude/skills/agentv-eval-builder/references/tool-trajectory-evaluator.md
CHANGED
|
@@ -76,7 +76,7 @@ execution:
|
|
|
76
76
|
- Strict protocol validation
|
|
77
77
|
- Regression testing specific behavior
|
|
78
78
|
|
|
79
|
-
## Expected
|
|
79
|
+
## Expected Tool Calls Evaluator
|
|
80
80
|
|
|
81
81
|
For simpler cases, specify tool_calls inline in `expected_messages`:
|
|
82
82
|
|
|
@@ -84,11 +84,11 @@ For simpler cases, specify tool_calls inline in `expected_messages`:
|
|
|
84
84
|
evalcases:
|
|
85
85
|
- id: research-task
|
|
86
86
|
expected_outcome: Agent searches and retrieves documents
|
|
87
|
-
|
|
87
|
+
|
|
88
88
|
input_messages:
|
|
89
89
|
- role: user
|
|
90
90
|
content: Research REST vs GraphQL differences
|
|
91
|
-
|
|
91
|
+
|
|
92
92
|
expected_messages:
|
|
93
93
|
- role: assistant
|
|
94
94
|
content: I'll research this topic.
|
|
@@ -96,11 +96,11 @@ evalcases:
|
|
|
96
96
|
- tool: knowledgeSearch
|
|
97
97
|
- tool: knowledgeSearch
|
|
98
98
|
- tool: documentRetrieve
|
|
99
|
-
|
|
99
|
+
|
|
100
100
|
execution:
|
|
101
101
|
evaluators:
|
|
102
102
|
- name: tool-validator
|
|
103
|
-
type:
|
|
103
|
+
type: expected_tool_calls
|
|
104
104
|
```
|
|
105
105
|
|
|
106
106
|
### With Input Matching
|
|
@@ -130,7 +130,7 @@ expected_messages:
|
|
|
130
130
|
| `in_order` | (matched tools in sequence) / (expected tools count) |
|
|
131
131
|
| `exact` | (correctly positioned tools) / (expected tools count) |
|
|
132
132
|
|
|
133
|
-
###
|
|
133
|
+
### expected_tool_calls Scoring
|
|
134
134
|
|
|
135
135
|
Sequential matching: `(matched tool_calls) / (expected tool_calls)`
|
|
136
136
|
|
|
@@ -215,7 +215,7 @@ evalcases:
|
|
|
215
215
|
execution:
|
|
216
216
|
evaluators:
|
|
217
217
|
- name: pipeline-check
|
|
218
|
-
type:
|
|
218
|
+
type: expected_tool_calls
|
|
219
219
|
```
|
|
220
220
|
|
|
221
221
|
## CLI Options for Traces
|
|
@@ -234,4 +234,4 @@ agentv eval evals/test.yaml --include-trace
|
|
|
234
234
|
2. **Start with any_order** - Then tighten to `in_order` or `exact` as needed
|
|
235
235
|
3. **Combine with other evaluators** - Use tool trajectory for execution, LLM judge for output quality
|
|
236
236
|
4. **Test with --dump-traces** - Inspect actual traces to understand agent behavior
|
|
237
|
-
5. **Use
|
|
237
|
+
5. **Use expected_tool_calls for simple cases** - It's more readable for basic tool validation
|