npm - @wix/evalforge-evaluator - Versions diffs - 0.74.0 → 0.76.0 - Mend

@wix/evalforge-evaluator 0.74.0 → 0.76.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

package/README.md ADDED Viewed

@@ -0,0 +1,53 @@
+# @wix/evalforge-evaluator
+CLI tool that executes AI agent evaluations. It fetches an eval run configuration from the backend, runs each scenario against a Claude Code agent, streams trace events, runs assertions, and reports results.
+## How It Works
+```
+evaluator <project-id> <eval-run-id>
+```
+1. **Load configuration** from environment variables (server URL, AI Gateway credentials, etc.)
+2. **Fetch evaluation data** from the backend API — eval run, scenarios, agent config, skills, MCPs, sub-agents, and templates
+3. **For each scenario:**
+   - Prepare a working directory (download and extract template)
+   - Write skills to `.claude/skills/<name>/SKILL.md`
+   - Write MCPs to `.mcp.json`
+   - Write sub-agents to `.claude/agents/<name>.md`
+   - Launch the Claude Code agent with the scenario's trigger prompt via `@anthropic-ai/claude-agent-sdk`
+   - Stream trace events back to the backend
+   - Run assertions on the agent's output
+   - Report the scenario result
+4. **Finalize** — set eval run status to `COMPLETED` or `FAILED`
+## Environment Variables
+| Variable | Required | Description |
+|----------|----------|-------------|
+| `EVAL_SERVER_URL` | Yes | Backend server URL for fetching data and reporting results |
+| `AI_GATEWAY_URL` | Yes | AI Gateway base URL for LLM calls |
+| `AI_GATEWAY_HEADERS` | No | Custom headers for AI Gateway (newline-separated `key:value` pairs) |
+| `EVAL_API_PREFIX` | No | API path prefix (e.g., `/api/v1`) |
+| `EVALUATIONS_DIR` | No | Directory for evaluation working directories |
+| `TRACE_PUSH_URL` | No | URL for pushing trace events (remote job execution) |
+| `EVAL_ROUTE_HEADER` | No | `x-wix-route` header for deploy preview routing |
+| `EVAL_AUTH_TOKEN` | No | Bearer token for public endpoint authentication |
+The evaluator is typically launched by the backend (locally or on a remote Dev Machine) with these variables pre-configured.
+## Scripts
+```bash
+yarn build       # Build CJS + ESM + type declarations
+yarn test        # Run tests
+yarn lint        # Run ESLint
+yarn clean       # Remove build artifacts
+```
+## Dependencies
+- `@wix/evalforge-types` — shared type definitions
+- `@wix/eval-assertions` — assertion evaluation framework
+- `@wix/evalforge-github-client` — GitHub API client for fetching skill files
+- `@anthropic-ai/claude-agent-sdk` — Claude Code agent SDK

package/build/index.js CHANGED Viewed

@@ -1140,7 +1140,8 @@ IMPORTANT: This is an automated evaluation run. Follow these guidelines:
 3. Do NOT use the Task tool to delegate simple operations - do them directly yourself.
 4. Keep your approach simple and direct - avoid excessive planning.
 5. Make targeted edits using Read and Edit tools rather than exploring the entire codebase.
-6. If you encounter an error, fix it directly rather than starting over.`;
+6. If you encounter an error, fix it directly rather than starting over.
+7. Your project root is the current working directory. Always create and modify source code files relative to the project root, NOT inside .claude/skills/ directories.`;
       const fullPrompt = scenario.triggerPrompt + evaluatorPromptSuffix;
       for await (const message of query({
         prompt: fullPrompt,