agentv 1.6.1 → 2.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -129,9 +129,6 @@ agentv eval --target vscode_projectx --targets "path/to/targets.yaml" --eval-id
129
129
  - `--agent-timeout SECONDS`: Timeout in seconds for agent response polling (default: 120)
130
130
  - `--max-retries COUNT`: Maximum number of retries for timeout cases (default: 2)
131
131
  - `--cache`: Enable caching of LLM responses (default: disabled)
132
- - `--dump-prompts`: Save all prompts to `.agentv/prompts/` directory
133
- - `--dump-traces`: Write trace files to `.agentv/traces/` directory
134
- - `--include-trace`: Include full trace in result output (verbose)
135
132
  - `--workers COUNT`: Parallel workers for eval cases (default: 3; target `workers` setting used when provided)
136
133
  - `--verbose`: Verbose output
137
134
 
@@ -297,45 +294,13 @@ Code evaluators receive input via stdin and write output to stdout as JSON.
297
294
  - Score range: `0.0` to `1.0` (float)
298
295
  - `hits` and `misses` are optional but recommended for debugging
299
296
 
300
- ### Code Evaluator Script Template
301
-
302
- ```python
303
- #!/usr/bin/env python3
304
- import json
305
- import sys
306
-
307
- def evaluate(input_data):
308
- # Extract only the fields you need
309
- candidate_answer = input_data.get("candidate_answer", "")
310
-
311
- # Your validation logic here
312
- score = 0.0 # to 1.0
313
- hits = ["successful check 1", "successful check 2"]
314
- misses = ["failed check 1"]
315
- reasoning = "Explanation of score"
316
-
317
- return {
318
- "score": score,
319
- "hits": hits,
320
- "misses": misses,
321
- "reasoning": reasoning
322
- }
323
-
324
- if __name__ == "__main__":
325
- try:
326
- input_data = json.loads(sys.stdin.read())
327
- result = evaluate(input_data)
328
- print(json.dumps(result, indent=2))
329
- except Exception as e:
330
- error_result = {
331
- "score": 0.0,
332
- "hits": [],
333
- "misses": [f"Evaluator error: {str(e)}"],
334
- "reasoning": f"Evaluator error: {str(e)}"
335
- }
336
- print(json.dumps(error_result, indent=2))
337
- sys.exit(1)
338
- ```
297
+ ### Code Evaluator Templates
298
+
299
+ Custom evaluators can be written in any language. For complete templates and examples:
300
+
301
+ - **Python template**: See `apps/cli/src/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md`
302
+ - **TypeScript template (with SDK)**: See `apps/cli/src/templates/.claude/skills/agentv-eval-builder/references/custom-evaluators.md`
303
+ - **Working examples**: See [examples/features/code-judge-sdk](examples/features/code-judge-sdk)
339
304
 
340
305
  ### LLM Judge Template Structure
341
306