agentv 1.3.1 → 1.6.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -1,179 +1,224 @@
1
- # Tool Trajectory Evaluator Guide
2
-
3
- Tool trajectory evaluators validate that an agent used the expected tools during execution. They work with trace data returned by agent providers (codex, vscode, cli with trace support).
4
-
5
- ## Tool Trajectory Evaluator
6
-
7
- ### Modes
8
-
9
- #### 1. `any_order` - Minimum Tool Counts
10
-
11
- Validates that each tool was called at least N times, regardless of order:
12
-
13
- ```yaml
14
- execution:
15
- evaluators:
16
- - name: tool-usage
17
- type: tool_trajectory
18
- mode: any_order
19
- minimums:
20
- knowledgeSearch: 2 # Must be called at least twice
21
- documentRetrieve: 1 # Must be called at least once
22
- ```
23
-
24
- **Use cases:**
25
- - Ensure required tools are used
26
- - Don't care about execution order
27
- - Allow flexibility in agent implementation
28
-
29
- #### 2. `in_order` - Sequential Matching
30
-
31
- Validates tools appear in the expected sequence, but allows gaps (other tools can appear between):
32
-
33
- ```yaml
34
- execution:
35
- evaluators:
36
- - name: workflow-sequence
37
- type: tool_trajectory
38
- mode: in_order
39
- expected:
40
- - tool: fetchData
41
- - tool: validateSchema
42
- - tool: transformData
43
- - tool: saveResults
44
- ```
45
-
46
- **Use cases:**
47
- - Validate logical workflow order
48
- - Allow agent to use additional helper tools
49
- - Check that key steps happen in sequence
50
-
51
- #### 3. `exact` - Strict Sequence Match
52
-
53
- Validates the exact tool sequence with no gaps or extra tools:
54
-
55
- ```yaml
56
- execution:
57
- evaluators:
58
- - name: auth-sequence
59
- type: tool_trajectory
60
- mode: exact
61
- expected:
62
- - tool: checkCredentials
63
- - tool: generateToken
64
- - tool: auditLog
65
- ```
66
-
67
- **Use cases:**
68
- - Security-critical workflows
69
- - Strict protocol validation
70
- - Regression testing specific behavior
71
-
72
- ## Scoring
73
-
74
- ### tool_trajectory Scoring
75
-
76
- | Mode | Score Calculation |
77
- |------|------------------|
78
- | `any_order` | (tools meeting minimum) / (total tools with minimums) |
79
- | `in_order` | (matched tools in sequence) / (expected tools count) |
80
- | `exact` | (correctly positioned tools) / (expected tools count) |
81
-
82
- ## Trace Data Requirements
83
-
84
- Tool trajectory evaluators require trace data from the agent provider. Supported providers:
85
-
86
- - **codex** - Returns trace via JSONL log events
87
- - **vscode / vscode-insiders** - Returns trace from Copilot execution
88
- - **cli** - Can return trace if agent outputs trace format
89
-
90
- ### Trace Event Structure
91
-
92
- ```json
93
- {
94
- "type": "tool_call",
95
- "name": "knowledgeSearch",
96
- "input": { "query": "REST vs GraphQL" },
97
- "timestamp": "2024-01-15T10:30:00Z"
98
- }
99
- ```
100
-
101
- ## Complete Examples
102
-
103
- ### Research Agent Validation
104
-
105
- ```yaml
106
- $schema: agentv-eval-v2
107
- description: Validate research agent tool usage
108
-
109
- target: codex_agent # Provider that returns traces
110
-
111
- evalcases:
112
- - id: comprehensive-research
113
- expected_outcome: Agent thoroughly researches the topic
114
-
115
- input_messages:
116
- - role: user
117
- content: Research machine learning frameworks
118
-
119
- execution:
120
- evaluators:
121
- # Check minimum tool usage
122
- - name: coverage
123
- type: tool_trajectory
124
- mode: any_order
125
- minimums:
126
- webSearch: 1
127
- documentRead: 2
128
- noteTaking: 1
129
-
130
- # Check workflow order
131
- - name: workflow
132
- type: tool_trajectory
133
- mode: in_order
134
- expected:
135
- - tool: webSearch
136
- - tool: documentRead
137
- - tool: summarize
138
- ```
139
-
140
- ### Multi-Step Pipeline
141
-
142
- ```yaml
143
- evalcases:
144
- - id: data-pipeline
145
- expected_outcome: Process data through complete pipeline
146
-
147
- input_messages:
148
- - role: user
149
- content: Process the customer dataset
150
-
151
- execution:
152
- evaluators:
153
- - name: pipeline-check
154
- type: tool_trajectory
155
- mode: exact
156
- expected:
157
- - tool: loadData
158
- - tool: validate
159
- - tool: transform
160
- - tool: export
161
- ```
162
-
163
- ## CLI Options for Traces
164
-
165
- ```bash
166
- # Write trace files to disk
167
- agentv eval evals/test.yaml --dump-traces
168
-
169
- # Include full trace in result output
170
- agentv eval evals/test.yaml --include-trace
171
- ```
172
-
173
- ## Best Practices
174
-
175
- 1. **Choose the right mode** - Use `any_order` for flexibility, `exact` for strict validation
176
- 2. **Start with any_order** - Then tighten to `in_order` or `exact` as needed
177
- 3. **Combine with other evaluators** - Use tool trajectory for execution, LLM judge for output quality
178
- 4. **Test with --dump-traces** - Inspect actual traces to understand agent behavior
179
- 5. **Use code evaluators for custom validation** - Write custom tool validation scripts with access to trace data
1
+ # Tool Trajectory Evaluator Guide
2
+
3
+ Tool trajectory evaluators validate that an agent used the expected tools during execution. They work with trace data returned by agent providers (codex, vscode, cli with trace support).
4
+
5
+ ## Tool Trajectory Evaluator
6
+
7
+ ### Modes
8
+
9
+ #### 1. `any_order` - Minimum Tool Counts
10
+
11
+ Validates that each tool was called at least N times, regardless of order:
12
+
13
+ ```yaml
14
+ execution:
15
+ evaluators:
16
+ - name: tool-usage
17
+ type: tool_trajectory
18
+ mode: any_order
19
+ minimums:
20
+ knowledgeSearch: 2 # Must be called at least twice
21
+ documentRetrieve: 1 # Must be called at least once
22
+ ```
23
+
24
+ **Use cases:**
25
+ - Ensure required tools are used
26
+ - Don't care about execution order
27
+ - Allow flexibility in agent implementation
28
+
29
+ #### 2. `in_order` - Sequential Matching
30
+
31
+ Validates tools appear in the expected sequence, but allows gaps (other tools can appear between):
32
+
33
+ ```yaml
34
+ execution:
35
+ evaluators:
36
+ - name: workflow-sequence
37
+ type: tool_trajectory
38
+ mode: in_order
39
+ expected:
40
+ - tool: fetchData
41
+ - tool: validateSchema
42
+ - tool: transformData
43
+ - tool: saveResults
44
+ ```
45
+
46
+ **Use cases:**
47
+ - Validate logical workflow order
48
+ - Allow agent to use additional helper tools
49
+ - Check that key steps happen in sequence
50
+
51
+ ### Argument Matching
52
+
53
+ For `in_order` and `exact` modes, you can optionally validate tool arguments:
54
+
55
+ ```yaml
56
+ execution:
57
+ evaluators:
58
+ - name: search-validation
59
+ type: tool_trajectory
60
+ mode: in_order
61
+ expected:
62
+ # Partial match - only specified keys are checked
63
+ - tool: search
64
+ args: { query: "machine learning" }
65
+
66
+ # Skip argument validation for this tool
67
+ - tool: process
68
+ args: any
69
+
70
+ # No args field = no argument validation (same as args: any)
71
+ - tool: saveResults
72
+ ```
73
+
74
+ **Argument matching modes:**
75
+ - `args: { key: value }` - Partial deep equality (only specified keys are checked)
76
+ - `args: any` - Skip argument validation
77
+ - No `args` field - Same as `args: any`
78
+
79
+ #### 3. `exact` - Strict Sequence Match
80
+
81
+ Validates the exact tool sequence with no gaps or extra tools:
82
+
83
+ ```yaml
84
+ execution:
85
+ evaluators:
86
+ - name: auth-sequence
87
+ type: tool_trajectory
88
+ mode: exact
89
+ expected:
90
+ - tool: checkCredentials
91
+ - tool: generateToken
92
+ - tool: auditLog
93
+ ```
94
+
95
+ **Use cases:**
96
+ - Security-critical workflows
97
+ - Strict protocol validation
98
+ - Regression testing specific behavior
99
+
100
+ ## Scoring
101
+
102
+ ### tool_trajectory Scoring
103
+
104
+ | Mode | Score Calculation |
105
+ |------|------------------|
106
+ | `any_order` | (tools meeting minimum) / (total tools with minimums) |
107
+ | `in_order` | (matched tools in sequence) / (expected tools count) |
108
+ | `exact` | (correctly positioned tools) / (expected tools count) |
109
+
110
+ ## Trace Data Requirements
111
+
112
+ Tool trajectory evaluators require trace data from the agent provider. Providers return `output_messages` containing `tool_calls` that capture agent tool usage.
113
+
114
+ ### Output Messages Format
115
+
116
+ Providers return `output_messages` with `tool_calls` in the JSONL output:
117
+
118
+ ```json
119
+ {
120
+ "id": "eval-001",
121
+ "output_messages": [
122
+ {
123
+ "role": "assistant",
124
+ "content": "I'll search for information about this topic.",
125
+ "tool_calls": [
126
+ {
127
+ "tool": "knowledgeSearch",
128
+ "input": { "query": "REST vs GraphQL" },
129
+ "output": { "results": [...] },
130
+ "id": "call_123",
131
+ "timestamp": "2024-01-15T10:30:00Z"
132
+ }
133
+ ]
134
+ }
135
+ ]
136
+ }
137
+ ```
138
+
139
+ The evaluator extracts tool calls from `output_messages[].tool_calls[]`. Optional fields `id` and `timestamp` can be included for debugging.
140
+
141
+ ### Supported Providers
142
+
143
+ - **codex** - Returns output_messages via JSONL log events
144
+ - **vscode / vscode-insiders** - Returns output_messages from Copilot execution
145
+ - **cli** - Returns `output_messages` with `tool_calls`
146
+
147
+ ## Complete Examples
148
+
149
+ ### Research Agent Validation
150
+
151
+ ```yaml
152
+ description: Validate research agent tool usage
153
+ execution:
154
+ target: codex_agent # Provider that returns traces
155
+
156
+ evalcases:
157
+ - id: comprehensive-research
158
+ expected_outcome: Agent thoroughly researches the topic
159
+
160
+ input_messages:
161
+ - role: user
162
+ content: Research machine learning frameworks
163
+
164
+ execution:
165
+ evaluators:
166
+ # Check minimum tool usage
167
+ - name: coverage
168
+ type: tool_trajectory
169
+ mode: any_order
170
+ minimums:
171
+ webSearch: 1
172
+ documentRead: 2
173
+ noteTaking: 1
174
+
175
+ # Check workflow order
176
+ - name: workflow
177
+ type: tool_trajectory
178
+ mode: in_order
179
+ expected:
180
+ - tool: webSearch
181
+ - tool: documentRead
182
+ - tool: summarize
183
+ ```
184
+
185
+ ### Multi-Step Pipeline
186
+
187
+ ```yaml
188
+ evalcases:
189
+ - id: data-pipeline
190
+ expected_outcome: Process data through complete pipeline
191
+
192
+ input_messages:
193
+ - role: user
194
+ content: Process the customer dataset
195
+
196
+ execution:
197
+ evaluators:
198
+ - name: pipeline-check
199
+ type: tool_trajectory
200
+ mode: exact
201
+ expected:
202
+ - tool: loadData
203
+ - tool: validate
204
+ - tool: transform
205
+ - tool: export
206
+ ```
207
+
208
+ ## CLI Options for Traces
209
+
210
+ ```bash
211
+ # Write trace files to disk
212
+ agentv eval evals/test.yaml --dump-traces
213
+
214
+ # Include full trace in result output
215
+ agentv eval evals/test.yaml --include-trace
216
+ ```
217
+
218
+ ## Best Practices
219
+
220
+ 1. **Choose the right mode** - Use `any_order` for flexibility, `exact` for strict validation
221
+ 2. **Start with any_order** - Then tighten to `in_order` or `exact` as needed
222
+ 3. **Combine with other evaluators** - Use tool trajectory for execution, LLM judge for output quality
223
+ 4. **Test with --dump-traces** - Inspect actual traces to understand agent behavior
224
+ 5. **Use code evaluators for custom validation** - Write custom tool validation scripts with access to trace data
@@ -1,77 +1,77 @@
1
- ---
2
- description: Iteratively optimize prompt files against AgentV evaluation datasets by analyzing failures and refining instructions.
3
- ---
4
-
5
- # AgentV Prompt Optimizer
6
-
7
- ## Input Variables
8
- - `eval-path`: Path or glob pattern to the AgentV evaluation file(s) to optimize against
9
- - `optimization-log-path` (optional): Path where optimization progress should be logged
10
-
11
- ## Workflow
12
-
13
- 1. **Initialize**
14
- - Verify `<eval-path>` (file or glob) targets the correct system.
15
- - **Identify Prompt Files**:
16
- - Infer prompt files from the eval file content (look for `file:` references in `input_messages` that match these patterns).
17
- - Recursively check referenced prompt files for *other* prompt references (dependencies).
18
- - If multiple prompts are found, consider ALL of them as candidates for optimization.
19
- - **Identify Optimization Log**:
20
- - If `<optimization-log-path>` is provided, use it.
21
- - If not, create a new one in the parent directory of the eval files: `optimization-[timestamp].md`.
22
- - Read content of the identified prompt file.
23
-
24
- 2. **Optimization Loop** (Max 10 iterations)
25
- - **Execute (The Generator)**: Run `agentv eval <eval-path>`.
26
- - *Targeted Run*: If iterating on specific stubborn failures, use `--eval-id <case_id>` to run only the relevant eval cases.
27
- - **Analyze (The Reflector)**:
28
- - Locate the results file path from the console output (e.g., `.agentv/results/eval_...jsonl`).
29
- - **Orchestrate Subagent**: Use `runSubagent` to analyze the results.
30
- - **Task**: Read the results file, calculate pass rate, and perform root cause analysis.
31
- - **Output**: Return a structured analysis including:
32
- - **Score**: Current pass rate.
33
- - **Root Cause**: Why failures occurred (e.g., "Ambiguous definition", "Hallucination").
34
- - **Insight**: Key learning or pattern identified from the failures.
35
- - **Strategy**: High-level plan to fix the prompt (e.g., "Clarify section X", "Add negative constraint").
36
- - **Decide**:
37
- - If **100% pass**: STOP and report success.
38
- - If **Score decreased**: Revert last change, try different approach.
39
- - If **No improvement** (2x): STOP and report stagnation.
40
- - **Refine (The Curator)**:
41
- - **Orchestrate Subagent**: Use `runSubagent` to apply the fix.
42
- - **Task**: Read the relevant prompt file(s), apply the **Strategy** from the Reflector, and generate the log entry.
43
- - **Output**: The **Log Entry** describing the specific operation performed.
44
- ```markdown
45
- ### Iteration [N]
46
- - **Operation**: [ADD / UPDATE / DELETE]
47
- - **Target**: [Section Name]
48
- - **Change**: [Specific text added/modified]
49
- - **Trigger**: [Specific failing test case or error pattern]
50
- - **Rationale**: [From Reflector: Root Cause]
51
- - **Score**: [From Reflector: Current Pass Rate]
52
- - **Insight**: [From Reflector: Key Learning]
53
- ```
54
- - **Strategy**: Treat the prompt as a structured set of rules. Execute atomic operations:
55
- - **ADD**: Insert a new rule if a constraint was missed.
56
- - **UPDATE**: Refine an existing rule to be clearer or more general.
57
- - *Clarify*: Make ambiguous instructions specific.
58
- - *Generalize*: Refactor specific fixes into high-level principles (First Principles).
59
- - **DELETE**: Remove obsolete, redundant, or harmful rules.
60
- - *Prune*: If a general rule covers specific cases, delete the specific ones.
61
- - **Negative Constraint**: If hallucinating, explicitly state what NOT to do. Prefer generalized prohibitions over specific forbidden tokens where possible.
62
- - **Safety Check**: Ensure new rules don't contradict existing ones (unless intended).
63
- - **Constraint**: Avoid rewriting large sections. Make surgical, additive changes to preserve existing behavior.
64
- - **Log Result**:
65
- - Append the **Log Entry** returned by the Curator to the optimization log file.
66
-
67
- 3. **Completion**
68
- - Report final score.
69
- - Summarize key changes made to the prompt.
70
- - **Finalize Optimization Log**: Add a summary header to the optimization log file indicating the session completion and final score.
71
-
72
- ## Guidelines
73
- - **Generalization First**: Prefer broad, principle-based guidelines over specific examples or "hotfixes". Only use specific rules if generalized instructions fail to achieve the desired score.
74
- - **Simplicity ("Less is More")**: Avoid overfitting to the test set. If a specific rule doesn't significantly improve the score compared to a general one, choose the general one.
75
- - **Structure**: Maintain existing Markdown headers/sections.
76
- - **Progressive Disclosure**: If the prompt grows too large (>200 lines), consider moving specialized logic into a separate file or skill.
77
- - **Quality Criteria**: Ensure the prompt defines a clear persona, specific task, and measurable success criteria.
1
+ ---
2
+ description: Iteratively optimize prompt files against AgentV evaluation datasets by analyzing failures and refining instructions.
3
+ ---
4
+
5
+ # AgentV Prompt Optimizer
6
+
7
+ ## Input Variables
8
+ - `eval-path`: Path or glob pattern to the AgentV evaluation file(s) to optimize against
9
+ - `optimization-log-path` (optional): Path where optimization progress should be logged
10
+
11
+ ## Workflow
12
+
13
+ 1. **Initialize**
14
+ - Verify `<eval-path>` (file or glob) targets the correct system.
15
+ - **Identify Prompt Files**:
16
+ - Infer prompt files from the eval file content (look for `file:` references in `input_messages` that match these patterns).
17
+ - Recursively check referenced prompt files for *other* prompt references (dependencies).
18
+ - If multiple prompts are found, consider ALL of them as candidates for optimization.
19
+ - **Identify Optimization Log**:
20
+ - If `<optimization-log-path>` is provided, use it.
21
+ - If not, create a new one in the parent directory of the eval files: `optimization-[timestamp].md`.
22
+ - Read content of the identified prompt file.
23
+
24
+ 2. **Optimization Loop** (Max 10 iterations)
25
+ - **Execute (The Generator)**: Run `agentv eval <eval-path>`.
26
+ - *Targeted Run*: If iterating on specific stubborn failures, use `--eval-id <case_id>` to run only the relevant eval cases.
27
+ - **Analyze (The Reflector)**:
28
+ - Locate the results file path from the console output (e.g., `.agentv/results/eval_...jsonl`).
29
+ - **Orchestrate Subagent**: Use `runSubagent` to analyze the results.
30
+ - **Task**: Read the results file, calculate pass rate, and perform root cause analysis.
31
+ - **Output**: Return a structured analysis including:
32
+ - **Score**: Current pass rate.
33
+ - **Root Cause**: Why failures occurred (e.g., "Ambiguous definition", "Hallucination").
34
+ - **Insight**: Key learning or pattern identified from the failures.
35
+ - **Strategy**: High-level plan to fix the prompt (e.g., "Clarify section X", "Add negative constraint").
36
+ - **Decide**:
37
+ - If **100% pass**: STOP and report success.
38
+ - If **Score decreased**: Revert last change, try different approach.
39
+ - If **No improvement** (2x): STOP and report stagnation.
40
+ - **Refine (The Curator)**:
41
+ - **Orchestrate Subagent**: Use `runSubagent` to apply the fix.
42
+ - **Task**: Read the relevant prompt file(s), apply the **Strategy** from the Reflector, and generate the log entry.
43
+ - **Output**: The **Log Entry** describing the specific operation performed.
44
+ ```markdown
45
+ ### Iteration [N]
46
+ - **Operation**: [ADD / UPDATE / DELETE]
47
+ - **Target**: [Section Name]
48
+ - **Change**: [Specific text added/modified]
49
+ - **Trigger**: [Specific failing test case or error pattern]
50
+ - **Rationale**: [From Reflector: Root Cause]
51
+ - **Score**: [From Reflector: Current Pass Rate]
52
+ - **Insight**: [From Reflector: Key Learning]
53
+ ```
54
+ - **Strategy**: Treat the prompt as a structured set of rules. Execute atomic operations:
55
+ - **ADD**: Insert a new rule if a constraint was missed.
56
+ - **UPDATE**: Refine an existing rule to be clearer or more general.
57
+ - *Clarify*: Make ambiguous instructions specific.
58
+ - *Generalize*: Refactor specific fixes into high-level principles (First Principles).
59
+ - **DELETE**: Remove obsolete, redundant, or harmful rules.
60
+ - *Prune*: If a general rule covers specific cases, delete the specific ones.
61
+ - **Negative Constraint**: If hallucinating, explicitly state what NOT to do. Prefer generalized prohibitions over specific forbidden tokens where possible.
62
+ - **Safety Check**: Ensure new rules don't contradict existing ones (unless intended).
63
+ - **Constraint**: Avoid rewriting large sections. Make surgical, additive changes to preserve existing behavior.
64
+ - **Log Result**:
65
+ - Append the **Log Entry** returned by the Curator to the optimization log file.
66
+
67
+ 3. **Completion**
68
+ - Report final score.
69
+ - Summarize key changes made to the prompt.
70
+ - **Finalize Optimization Log**: Add a summary header to the optimization log file indicating the session completion and final score.
71
+
72
+ ## Guidelines
73
+ - **Generalization First**: Prefer broad, principle-based guidelines over specific examples or "hotfixes". Only use specific rules if generalized instructions fail to achieve the desired score.
74
+ - **Simplicity ("Less is More")**: Avoid overfitting to the test set. If a specific rule doesn't significantly improve the score compared to a general one, choose the general one.
75
+ - **Structure**: Maintain existing Markdown headers/sections.
76
+ - **Progressive Disclosure**: If the prompt grows too large (>200 lines), consider moving specialized logic into a separate file or skill.
77
+ - **Quality Criteria**: Ensure the prompt defines a clear persona, specific task, and measurable success criteria.
@@ -1,5 +1,5 @@
1
- ---
2
- description: 'Create and maintain AgentV YAML evaluation files'
3
- ---
4
-
1
+ ---
2
+ description: 'Create and maintain AgentV YAML evaluation files'
3
+ ---
4
+
5
5
  #file:../../.claude/skills/agentv-eval-builder/SKILL.md
@@ -1,4 +1,4 @@
1
- ---
2
- description: Iteratively optimize prompt files against an AgentV evaluation suite
3
- ---
1
+ ---
2
+ description: Iteratively optimize prompt files against an AgentV evaluation suite
3
+ ---
4
4
  #file:../../.claude/skills/agentv-prompt-optimizer/SKILL.md
package/package.json CHANGED
@@ -1,6 +1,6 @@
1
1
  {
2
2
  "name": "agentv",
3
- "version": "1.3.1",
3
+ "version": "1.6.1",
4
4
  "description": "CLI entry point for AgentV",
5
5
  "type": "module",
6
6
  "repository": {