agentv 0.25.0 → 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/dist/cli.js CHANGED
@@ -1,7 +1,7 @@
1
1
  #!/usr/bin/env node
2
2
  import {
3
3
  runCli
4
- } from "./chunk-ZVSFP6NK.js";
4
+ } from "./chunk-RIJO5WBF.js";
5
5
  import "./chunk-UE4GLFVL.js";
6
6
 
7
7
  // src/cli.ts
package/dist/index.js CHANGED
@@ -1,7 +1,7 @@
1
1
  import {
2
2
  app,
3
3
  runCli
4
- } from "./chunk-ZVSFP6NK.js";
4
+ } from "./chunk-RIJO5WBF.js";
5
5
  import "./chunk-UE4GLFVL.js";
6
6
  export {
7
7
  app,
@@ -44,7 +44,7 @@ execution:
44
44
  ```
45
45
 
46
46
  **Contract:**
47
- - Input (stdin): JSON with `question`, `expected_outcome`, `reference_answer`, `candidate_answer`, `guideline_paths`, `input_files`, `input_segments`
47
+ - Input (stdin): JSON with `question`, `expected_outcome`, `reference_answer`, `candidate_answer`, `guideline_files` (file paths), `input_files` (file paths, excludes guidelines), `input_messages`
48
48
  - Output (stdout): JSON with `score` (0.0-1.0), `hits`, `misses`, `reasoning`
49
49
 
50
50
  **Template:** See `references/custom-evaluators.md` for Python code evaluator template
@@ -61,8 +61,42 @@ execution:
61
61
  model: gpt-5-chat
62
62
  ```
63
63
 
64
- ### Evaluator Chaining
65
- Evaluators run sequentially:
64
+ ### Tool Trajectory Evaluators
65
+ Validate agent tool usage patterns (requires trace data from provider):
66
+
67
+ ```yaml
68
+ execution:
69
+ evaluators:
70
+ - name: research_check
71
+ type: tool_trajectory
72
+ mode: any_order # Options: any_order, in_order, exact
73
+ minimums: # For any_order mode
74
+ knowledgeSearch: 2
75
+ expected: # For in_order/exact modes
76
+ - tool: knowledgeSearch
77
+ - tool: documentRetrieve
78
+ ```
79
+
80
+ See `references/tool-trajectory-evaluator.md` for modes and configuration.
81
+
82
+ ### Expected Tool Calls Evaluators
83
+ Validate tool calls and inputs inline with conversation flow:
84
+
85
+ ```yaml
86
+ expected_messages:
87
+ - role: assistant
88
+ tool_calls:
89
+ - tool: getMetrics
90
+ input: { server: "prod-1" }
91
+
92
+ execution:
93
+ evaluators:
94
+ - name: input_check
95
+ type: expected_tool_calls
96
+ ```
97
+
98
+ ### Multiple Evaluators
99
+ Define multiple evaluators to run sequentially. The final score is a weighted average of all results.
66
100
 
67
101
  ```yaml
68
102
  execution:
@@ -119,23 +153,6 @@ execution:
119
153
 
120
154
  See `references/composite-evaluator.md` for aggregation types and patterns.
121
155
 
122
- ### Tool Trajectory Evaluator
123
- Validate agent tool usage from trace data:
124
-
125
- ```yaml
126
- execution:
127
- evaluators:
128
- - name: workflow-check
129
- type: tool_trajectory
130
- mode: in_order # or: any_order, exact
131
- expected:
132
- - tool: fetchData
133
- - tool: processData
134
- - tool: saveResults
135
- ```
136
-
137
- See `references/tool-trajectory-evaluator.md` for modes and configuration.
138
-
139
156
  ## Example
140
157
  ```yaml
141
158
  $schema: agentv-eval-v2
@@ -78,13 +78,12 @@ evalcases:
78
78
  execution:
79
79
  evaluators:
80
80
  - name: json_format_validator
81
- type: code
81
+ type: code_judge
82
82
  script: uv run validate_json.py
83
83
  cwd: ./evaluators
84
84
  - name: content_evaluator
85
85
  type: llm_judge
86
86
  prompt: ./judges/semantic_correctness.md
87
- model: gpt-5-chat
88
87
 
89
88
  input_messages:
90
89
  - role: user
@@ -102,6 +101,99 @@ evalcases:
102
101
  }
103
102
  ```
104
103
 
104
+ ## Tool Trajectory Evaluation
105
+
106
+ Validate that an agent uses specific tools during execution.
107
+
108
+ ```yaml
109
+ $schema: agentv-eval-v2
110
+ description: Tool usage validation
111
+ target: mock_agent
112
+
113
+ evalcases:
114
+ # Validate minimum tool usage (order doesn't matter)
115
+ - id: research-depth
116
+ expected_outcome: Agent researches thoroughly
117
+ input_messages:
118
+ - role: user
119
+ content: Research REST vs GraphQL
120
+ execution:
121
+ evaluators:
122
+ - name: research-check
123
+ type: tool_trajectory
124
+ mode: any_order
125
+ minimums:
126
+ knowledgeSearch: 2
127
+ documentRetrieve: 1
128
+
129
+ # Validate exact tool sequence
130
+ - id: auth-flow
131
+ expected_outcome: Agent follows auth sequence
132
+ input_messages:
133
+ - role: user
134
+ content: Authenticate user
135
+ execution:
136
+ evaluators:
137
+ - name: auth-sequence
138
+ type: tool_trajectory
139
+ mode: exact
140
+ expected:
141
+ - tool: checkCredentials
142
+ - tool: generateToken
143
+ ```
144
+
145
+ ## Expected Messages with Tool Calls
146
+
147
+ Validate precise tool inputs inline with expected messages.
148
+
149
+ ```yaml
150
+ $schema: agentv-eval-v2
151
+ description: Tool input validation
152
+ target: mock_agent
153
+
154
+ evalcases:
155
+ - id: precise-inputs
156
+ expected_outcome: Agent calls tools with correct parameters
157
+ input_messages:
158
+ - role: user
159
+ content: Check CPU metrics for prod-1
160
+ expected_messages:
161
+ - role: assistant
162
+ content: Checking metrics...
163
+ tool_calls:
164
+ - tool: getCpuMetrics
165
+ input: { server: "prod-1" }
166
+ execution:
167
+ evaluators:
168
+ - name: input-validator
169
+ type: expected_tool_calls
170
+ ```
171
+
172
+ ## Static Trace Evaluation
173
+
174
+ Evaluate pre-existing trace files without running an agent.
175
+
176
+ ```yaml
177
+ $schema: agentv-eval-v2
178
+ description: Static trace evaluation
179
+ target: static_trace
180
+
181
+ evalcases:
182
+ - id: validate-trace-file
183
+ expected_outcome: Trace contains required steps
184
+ input_messages:
185
+ - role: user
186
+ content: Analyze trace
187
+ execution:
188
+ evaluators:
189
+ - name: trace-check
190
+ type: tool_trajectory
191
+ mode: in_order
192
+ expected:
193
+ - tool: webSearch
194
+ - tool: readFile
195
+ ```
196
+
105
197
  ## Multi-Turn Conversation (Single Eval Case)
106
198
 
107
199
  ```yaml
@@ -76,7 +76,7 @@ execution:
76
76
  - Strict protocol validation
77
77
  - Regression testing specific behavior
78
78
 
79
- ## Expected Messages Evaluator
79
+ ## Expected Tool Calls Evaluator
80
80
 
81
81
  For simpler cases, specify tool_calls inline in `expected_messages`:
82
82
 
@@ -84,11 +84,11 @@ For simpler cases, specify tool_calls inline in `expected_messages`:
84
84
  evalcases:
85
85
  - id: research-task
86
86
  expected_outcome: Agent searches and retrieves documents
87
-
87
+
88
88
  input_messages:
89
89
  - role: user
90
90
  content: Research REST vs GraphQL differences
91
-
91
+
92
92
  expected_messages:
93
93
  - role: assistant
94
94
  content: I'll research this topic.
@@ -96,11 +96,11 @@ evalcases:
96
96
  - tool: knowledgeSearch
97
97
  - tool: knowledgeSearch
98
98
  - tool: documentRetrieve
99
-
99
+
100
100
  execution:
101
101
  evaluators:
102
102
  - name: tool-validator
103
- type: expected_messages
103
+ type: expected_tool_calls
104
104
  ```
105
105
 
106
106
  ### With Input Matching
@@ -130,7 +130,7 @@ expected_messages:
130
130
  | `in_order` | (matched tools in sequence) / (expected tools count) |
131
131
  | `exact` | (correctly positioned tools) / (expected tools count) |
132
132
 
133
- ### expected_messages Scoring
133
+ ### expected_tool_calls Scoring
134
134
 
135
135
  Sequential matching: `(matched tool_calls) / (expected tool_calls)`
136
136
 
@@ -215,7 +215,7 @@ evalcases:
215
215
  execution:
216
216
  evaluators:
217
217
  - name: pipeline-check
218
- type: expected_messages
218
+ type: expected_tool_calls
219
219
  ```
220
220
 
221
221
  ## CLI Options for Traces
@@ -234,4 +234,4 @@ agentv eval evals/test.yaml --include-trace
234
234
  2. **Start with any_order** - Then tighten to `in_order` or `exact` as needed
235
235
  3. **Combine with other evaluators** - Use tool trajectory for execution, LLM judge for output quality
236
236
  4. **Test with --dump-traces** - Inspect actual traces to understand agent behavior
237
- 5. **Use expected_messages for simple cases** - It's more readable for basic tool validation
237
+ 5. **Use expected_tool_calls for simple cases** - It's more readable for basic tool validation
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "agentv",
3
- "version": "0.25.0",
3
+ "version": "1.0.0",
4
4
  "description": "CLI entry point for AgentV",
5
5
  "type": "module",
6
6
  "repository": {