agentv 1.0.0 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/dist/cli.js CHANGED
@@ -1,7 +1,7 @@
1
1
  #!/usr/bin/env node
2
2
  import {
3
3
  runCli
4
- } from "./chunk-RIJO5WBF.js";
4
+ } from "./chunk-IVIT4U6S.js";
5
5
  import "./chunk-UE4GLFVL.js";
6
6
 
7
7
  // src/cli.ts
package/dist/index.js CHANGED
@@ -1,7 +1,7 @@
1
1
  import {
2
2
  app,
3
3
  runCli
4
- } from "./chunk-RIJO5WBF.js";
4
+ } from "./chunk-IVIT4U6S.js";
5
5
  import "./chunk-UE4GLFVL.js";
6
6
  export {
7
7
  app,
@@ -79,22 +79,6 @@ execution:
79
79
 
80
80
  See `references/tool-trajectory-evaluator.md` for modes and configuration.
81
81
 
82
- ### Expected Tool Calls Evaluators
83
- Validate tool calls and inputs inline with conversation flow:
84
-
85
- ```yaml
86
- expected_messages:
87
- - role: assistant
88
- tool_calls:
89
- - tool: getMetrics
90
- input: { server: "prod-1" }
91
-
92
- execution:
93
- evaluators:
94
- - name: input_check
95
- type: expected_tool_calls
96
- ```
97
-
98
82
  ### Multiple Evaluators
99
83
  Define multiple evaluators to run sequentially. The final score is a weighted average of all results.
100
84
 
@@ -142,33 +142,6 @@ evalcases:
142
142
  - tool: generateToken
143
143
  ```
144
144
 
145
- ## Expected Messages with Tool Calls
146
-
147
- Validate precise tool inputs inline with expected messages.
148
-
149
- ```yaml
150
- $schema: agentv-eval-v2
151
- description: Tool input validation
152
- target: mock_agent
153
-
154
- evalcases:
155
- - id: precise-inputs
156
- expected_outcome: Agent calls tools with correct parameters
157
- input_messages:
158
- - role: user
159
- content: Check CPU metrics for prod-1
160
- expected_messages:
161
- - role: assistant
162
- content: Checking metrics...
163
- tool_calls:
164
- - tool: getCpuMetrics
165
- input: { server: "prod-1" }
166
- execution:
167
- evaluators:
168
- - name: input-validator
169
- type: expected_tool_calls
170
- ```
171
-
172
145
  ## Static Trace Evaluation
173
146
 
174
147
  Evaluate pre-existing trace files without running an agent.
@@ -2,13 +2,6 @@
2
2
 
3
3
  Tool trajectory evaluators validate that an agent used the expected tools during execution. They work with trace data returned by agent providers (codex, vscode, cli with trace support).
4
4
 
5
- ## Evaluator Types
6
-
7
- AgentV provides two ways to validate tool usage:
8
-
9
- 1. **`tool_trajectory`** - Dedicated evaluator with configurable matching modes
10
- 2. **`expected_messages`** - Inline tool_calls in expected_messages for simpler cases
11
-
12
5
  ## Tool Trajectory Evaluator
13
6
 
14
7
  ### Modes
@@ -76,50 +69,6 @@ execution:
76
69
  - Strict protocol validation
77
70
  - Regression testing specific behavior
78
71
 
79
- ## Expected Tool Calls Evaluator
80
-
81
- For simpler cases, specify tool_calls inline in `expected_messages`:
82
-
83
- ```yaml
84
- evalcases:
85
- - id: research-task
86
- expected_outcome: Agent searches and retrieves documents
87
-
88
- input_messages:
89
- - role: user
90
- content: Research REST vs GraphQL differences
91
-
92
- expected_messages:
93
- - role: assistant
94
- content: I'll research this topic.
95
- tool_calls:
96
- - tool: knowledgeSearch
97
- - tool: knowledgeSearch
98
- - tool: documentRetrieve
99
-
100
- execution:
101
- evaluators:
102
- - name: tool-validator
103
- type: expected_tool_calls
104
- ```
105
-
106
- ### With Input Matching
107
-
108
- Validate specific inputs were passed to tools:
109
-
110
- ```yaml
111
- expected_messages:
112
- - role: assistant
113
- content: Checking metrics...
114
- tool_calls:
115
- - tool: getCpuMetrics
116
- input:
117
- server: prod-1
118
- - tool: getMemoryMetrics
119
- input:
120
- server: prod-1
121
- ```
122
-
123
72
  ## Scoring
124
73
 
125
74
  ### tool_trajectory Scoring
@@ -130,10 +79,6 @@ expected_messages:
130
79
  | `in_order` | (matched tools in sequence) / (expected tools count) |
131
80
  | `exact` | (correctly positioned tools) / (expected tools count) |
132
81
 
133
- ### expected_tool_calls Scoring
134
-
135
- Sequential matching: `(matched tool_calls) / (expected tool_calls)`
136
-
137
82
  ## Trace Data Requirements
138
83
 
139
84
  Tool trajectory evaluators require trace data from the agent provider. Supported providers:
@@ -198,24 +143,21 @@ evalcases:
198
143
  evalcases:
199
144
  - id: data-pipeline
200
145
  expected_outcome: Process data through complete pipeline
201
-
146
+
202
147
  input_messages:
203
148
  - role: user
204
149
  content: Process the customer dataset
205
-
206
- expected_messages:
207
- - role: assistant
208
- content: Processing data...
209
- tool_calls:
210
- - tool: loadData
211
- - tool: validate
212
- - tool: transform
213
- - tool: export
214
-
150
+
215
151
  execution:
216
152
  evaluators:
217
153
  - name: pipeline-check
218
- type: expected_tool_calls
154
+ type: tool_trajectory
155
+ mode: exact
156
+ expected:
157
+ - tool: loadData
158
+ - tool: validate
159
+ - tool: transform
160
+ - tool: export
219
161
  ```
220
162
 
221
163
  ## CLI Options for Traces
@@ -234,4 +176,4 @@ agentv eval evals/test.yaml --include-trace
234
176
  2. **Start with any_order** - Then tighten to `in_order` or `exact` as needed
235
177
  3. **Combine with other evaluators** - Use tool trajectory for execution, LLM judge for output quality
236
178
  4. **Test with --dump-traces** - Inspect actual traces to understand agent behavior
237
- 5. **Use expected_tool_calls for simple cases** - It's more readable for basic tool validation
179
+ 5. **Use code evaluators for custom validation** - Write custom tool validation scripts with access to trace data
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "agentv",
3
- "version": "1.0.0",
3
+ "version": "1.2.0",
4
4
  "description": "CLI entry point for AgentV",
5
5
  "type": "module",
6
6
  "repository": {