npm - agentv - Versions diffs - 0.2.8 → 0.5.0 - Mend

agentv 0.2.8 → 0.5.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

package/README.md +142 -63
package/dist/{chunk-RLBRJX7V.js → chunk-THVRLL37.js} +2316 -482
package/dist/chunk-THVRLL37.js.map +1 -0
package/dist/cli.js +1 -1
package/dist/index.js +1 -1
package/package.json +2 -2
package/dist/chunk-RLBRJX7V.js.map +0 -1

package/README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # AgentV
-A TypeScript-based AI agent evaluation and optimization framework using YAML specifications to score task completion. Built for modern development workflows with first-class support for VS Code Copilot, Azure OpenAI, Anthropic, and Google Gemini.
+A TypeScript-based AI agent evaluation and optimization framework using YAML specifications to score task completion. Built for modern development workflows with first-class support for VS Code Copilot, OpenAI Codex CLI and Azure OpenAI.
 ## Installation and Setup
@@ -76,7 +76,7 @@ You are now ready to start development. The monorepo contains:
 ### Configuring Guideline Patterns
-AgentV automatically detects guideline files (instructions, prompts) and treats them differently from regular file content. You can customize which files are considered guidelines using an optional `.agentv/config.yaml` configuration file.
+AgentV automatically detects guideline files and treats them differently from regular file content. You can customize which files are considered guidelines using an optional `.agentv/config.yaml` configuration file.
 **Config file discovery:**
 - AgentV searches for `.agentv/config.yaml` starting from the eval file's directory
@@ -84,16 +84,6 @@ AgentV automatically detects guideline files (instructions, prompts) and treats
 - Uses the first config file found (similar to how `targets.yaml` is discovered)
 - This allows you to place one config file at the project root for all evals
-**Default patterns** (used when `.agentv/config.yaml` is absent):
-```yaml
-guideline_patterns:
-  - "**/*.instructions.md"
-  - "**/instructions/**"
-  - "**/*.prompt.md"
-  - "**/prompts/**"
-```
 **Custom patterns** (create `.agentv/config.yaml` in same directory as your eval file):
 ```yaml
@@ -105,13 +95,6 @@ guideline_patterns:
   - "**/*.rules.md"           # Match by naming convention
 ```
-**How it works:**
-- Files matching guideline patterns are loaded as separate guideline context
-- Files NOT matching are treated as regular file content in user messages
-- Patterns use standard glob syntax (via [micromatch](https://github.com/micromatch/micromatch))
-- Paths are normalized to forward slashes for cross-platform compatibility
 See [config.yaml example](docs/examples/simple/.agentv/config.yaml) for more pattern examples.
 ### Validating Eval Files
@@ -200,18 +183,6 @@ Output goes to `.agentv/results/{evalname}_{timestamp}.jsonl` (or `.yaml`) unles
 **Recommended Models:** Use Claude Sonnet 4.5 or Grok Code Fast 1 for best results, as these models are more consistent in following instruction chains.
-## Requirements
-- Node.js 20.0.0 or higher
-- Environment variables for your chosen providers (configured via targets.yaml)
-Environment keys (configured via targets.yaml):
-- **Azure OpenAI:** Set environment variables specified in your target's `settings.endpoint`, `settings.api_key`, and `settings.model`
-- **Anthropic Claude:** Set environment variables specified in your target's `settings.api_key` and `settings.model`
-- **Google Gemini:** Set environment variables specified in your target's `settings.api_key` and optional `settings.model`
-- **VS Code:** Set environment variable specified in your target's `settings.workspace_env` → `.code-workspace` path
 ## Targets and Environment Variables
 Execution targets in `.agentv/targets.yaml` decouple evals from providers/settings and provide flexible environment variable mapping.
@@ -221,7 +192,7 @@ Execution targets in `.agentv/targets.yaml` decouple evals from providers/settin
 Each target specifies:
 - `name`: Unique identifier for the target
-- `provider`: The model provider (`azure`, `anthropic`, `gemini`, `vscode`, `vscode-insiders`, or `mock`)
+- `provider`: The model provider (`azure`, `anthropic`, `gemini`, `codex`, `vscode`, `vscode-insiders`, `cli`, or `mock`)
 - `settings`: Environment variable names to use for this target
 ### Examples
@@ -237,40 +208,54 @@ Each target specifies:
     model: "AZURE_DEPLOYMENT_NAME"
 ```
-**Anthropic targets:**
+**VS Code targets:**
 ```yaml
-- name: anthropic_base
-  provider: anthropic
+- name: vscode_projectx
+  provider: vscode
   settings:
-    api_key: "ANTHROPIC_API_KEY"
-    model: "ANTHROPIC_MODEL"
+    workspace_env: "EVAL_PROJECTX_WORKSPACE_PATH"
+- name: vscode_insiders_projectx
+  provider: vscode-insiders
+  settings:
+    workspace_env: "EVAL_PROJECTX_WORKSPACE_PATH"
 ```
-**Google Gemini targets:**
+**CLI targets (template-based):**
 ```yaml
-- name: gemini_base
-  provider: gemini
+- name: local_cli
+  provider: cli
   settings:
-    api_key: "GOOGLE_API_KEY"
-    model: "GOOGLE_GEMINI_MODEL"  # Optional, defaults to gemini-2.0-flash-exp
+    command_template: 'somecommand {PROMPT} {FILES}'
+    files_format: '--file {path}'
+    cwd: PROJECT_ROOT               # optional working directory
+    env:                            # merged into process.env
+      API_TOKEN: LOCAL_AGENT_TOKEN
+    timeout_seconds: 30             # optional per-command timeout
+    healthcheck:
+      type: command                 # or http
+      command_template: code --version
 ```
-**VS Code targets:**
+**Codex CLI targets:**
 ```yaml
-- name: vscode_projectx
-  provider: vscode
-  settings:
-    workspace_env: "EVAL_PROJECTX_WORKSPACE_PATH"
-- name: vscode_insiders_projectx
-  provider: vscode-insiders
+- name: codex_cli
+  provider: codex
   settings:
-    workspace_env: "EVAL_PROJECTX_WORKSPACE_PATH"
+    executable: "CODEX_CLI_PATH"     # defaults to `codex` if omitted
+    profile: "CODEX_PROFILE"         # matches the profile in ~/.codex/config
+    model: "CODEX_MODEL"             # optional, falls back to profile default
+    approval_preset: "CODEX_APPROVAL_PRESET"
+    timeout_seconds: 180
+    cwd: CODEX_WORKSPACE_DIR
 ```
+Codex targets require the standalone `codex` CLI and a configured profile (via `codex configure`) so credentials are stored in `~/.codex/config` (or whatever path the CLI already uses). AgentV mirrors all guideline and attachment files into a fresh scratch workspace, so the `file://` preread links remain valid even when the CLI runs outside your repo tree.
+Confirm the CLI works by running `codex exec --json --profile <name> "ping"` (or any supported dry run) before starting an eval. This prints JSONL events; seeing `item.completed` messages indicates the CLI is healthy.
 ## Timeout Handling and Retries
 When using VS Code or other AI agents that may experience timeouts, the evaluator includes automatic retry functionality:
@@ -286,22 +271,116 @@ Example with custom timeout settings:
 agentv eval evals/projectx/example.yaml --target vscode_projectx --agent-timeout 180 --max-retries 3
 ```
-## How the Evals Work
+## Writing Custom Evaluators
+### Code Evaluator I/O Contract
+Code evaluators receive input via stdin and write output to stdout as JSON.
+**Input Format (via stdin):**
+```json
+{
+  "task": "string describing the task",
+  "outcome": "expected outcome description",
+  "expected": "expected output string",
+  "output": "generated code/text from the agent",
+  "system_message": "system message if any",
+  "guideline_paths": ["path1", "path2"],
+  "attachments": ["file1", "file2"],
+  "user_segments": [{"type": "text", "value": "..."}]
+}
+```
+**Output Format (to stdout):**
+```json
+{
+  "score": 0.85,
+  "hits": ["list of successful checks"],
+  "misses": ["list of failed checks"],
+  "reasoning": "explanation of the score"
+}
+```
+**Key Points:**
+- Evaluators receive **full context** but should select only relevant fields
+- Most evaluators only need `output` field - ignore the rest to avoid false positives
+- Complex evaluators can use `task`, `expected`, or `guideline_paths` for context-aware validation
+- Score range: `0.0` to `1.0` (float)
+- `hits` and `misses` are optional but recommended for debugging
+### Code Evaluator Script Template
+```python
+#!/usr/bin/env python3
+import json
+import sys
+def evaluate(input_data):
+    # Extract only the fields you need
+    output = input_data.get("output", "")
+    # Your validation logic here
+    score = 0.0  # to 1.0
+    hits = ["successful check 1", "successful check 2"]
+    misses = ["failed check 1"]
+    reasoning = "Explanation of score"
+    return {
+        "score": score,
+        "hits": hits,
+        "misses": misses,
+        "reasoning": reasoning
+    }
+if __name__ == "__main__":
+    try:
+        input_data = json.loads(sys.stdin.read())
+        result = evaluate(input_data)
+        print(json.dumps(result, indent=2))
+    except Exception as e:
+        error_result = {
+            "score": 0.0,
+            "hits": [],
+            "misses": [f"Evaluator error: {str(e)}"],
+            "reasoning": f"Evaluator error: {str(e)}"
+        }
+        print(json.dumps(error_result, indent=2))
+        sys.exit(1)
+```
+### LLM Judge Template Structure
+```markdown
+# Judge Name
+Evaluation criteria and guidelines...
+## Scoring Guidelines
+0.9-1.0: Excellent
+0.7-0.8: Good
+...
+## Output Format
+{
+  "score": 0.85,
+  "passed": true,
+  "reasoning": "..."
+}
+```
-For each eval case in a `.yaml` file:
+## Next Steps
-1. Parse YAML and collect user messages (inline text and referenced files)
-2. Extract code blocks from text for structured prompting
-3. Generate a candidate answer via the configured provider/model
-4. Score against the expected answer using AI-powered quality grading
-5. Output results in JSONL or YAML format with detailed metrics
+- Review `docs/examples/simple/evals/example-eval.yaml` to understand the schema
+- Create your own eval cases following the schema
+- Write custom evaluator scripts for domain-specific validation
+- Create LLM judge templates for semantic evaluation
+- Set up optimizer configs when ready to improve prompts
-### VS Code Copilot Target
+## Resources
-- Opens your configured workspace and uses the `subagent` library to programmatically invoke VS Code Copilot
-- The prompt is built from the `.yaml` user content (task, files, code blocks)
-- Copilot is instructed to complete the task within the workspace context
-- Results are captured and scored automatically
+- [Simple Example README](docs/examples/simple/README.md)
+- [Schema Specification](docs/openspec/changes/update-eval-schema-v2/)
+- [Ax ACE Documentation](https://github.com/ax-llm/ax/blob/main/docs/ACE.md)
 ## Scoring and Outputs