PyPI - agentevals-cli - Versions diffs - 0.5.2__tar.gz → 0.6.0__tar.gz - Mend

agentevals-cli 0.5.2tar.gz → 0.6.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (212) hide show

{agentevals_cli-0.5.2 → agentevals_cli-0.6.0}/.github/workflows/release.yml RENAMED Viewed

@@ -61,19 +61,31 @@ jobs:
     needs: build
     runs-on: ubuntu-latest
     steps:
-      - name: 'Checkout GitHub Action'
-        uses: actions/checkout@main
+      - uses: actions/checkout@v6
+      - uses: astral-sh/setup-uv@v7
+        with:
+          enable-cache: true
-      - name: Install uv
-        uses: astral-sh/setup-uv@v6
+      - uses: actions/setup-node@v6
+        with:
+          node-version: '22'
+          cache: npm
+          cache-dependency-path: ui/package-lock.json
-      # Repo root cwd: uv build puts artifacts in ./dist; uv publish looks for dist/* relative to cwd.
-      - name: 'Release Python Packages'
+      # Same bundle as `make release` / `build-bundle`: wheel must include ui/dist in src/agentevals/_static
+      # (see [tool.hatch.build] artifacts in pyproject.toml).
+      - name: Release Python package (wheel + sdist with bundled UI)
         env:
           VERSION: ${{ github.event.inputs.tag || github.ref_name }}
         run: |
           uv sync --package agentevals-cli --all-extras
           uv version "$VERSION" --package agentevals-cli
+          make build-ui
+          rm -rf src/agentevals/_static
+          cp -r ui/dist src/agentevals/_static
+          rm -rf dist
           uv build --package agentevals-cli
           uv publish dist/* --token ${{ secrets.PYPI_TOKEN }}
+          rm -rf src/agentevals/_static

agentevals_cli-0.6.0/PKG-INFO ADDED Viewed

@@ -0,0 +1,333 @@
+Metadata-Version: 2.4
+Name: agentevals-cli
+Version: 0.6.0
+Summary: Standalone framework to evaluate agent correctness based on portable OpenTelemetry traces
+License-File: LICENSE
+Requires-Python: >=3.11
+Requires-Dist: click>=8.0
+Requires-Dist: fastapi>=0.115.0
+Requires-Dist: google-adk[eval]>=1.25.0
+Requires-Dist: httpx>=0.27.0
+Requires-Dist: opentelemetry-proto>=1.36.0
+Requires-Dist: python-dotenv>=1.0.0
+Requires-Dist: python-multipart>=0.0.12
+Requires-Dist: pyyaml>=6.0
+Requires-Dist: tabulate>=0.9.0
+Requires-Dist: uvicorn[standard]>=0.32.0
+Provides-Extra: live
+Requires-Dist: httpx>=0.27.0; extra == 'live'
+Requires-Dist: mcp>=1.26.0; extra == 'live'
+Provides-Extra: openai
+Requires-Dist: openai>=2.0; extra == 'openai'
+Provides-Extra: streaming
+Requires-Dist: opentelemetry-sdk>=1.20.0; extra == 'streaming'
+Requires-Dist: websockets>=12.0; extra == 'streaming'
+Description-Content-Type: text/markdown
+<p align="center">
+  <picture>
+    <source media="(prefers-color-scheme: dark)" srcset="docs/assets/logo-color-on-transparent.svg">
+    <source media="(prefers-color-scheme: light)" srcset="docs/assets/logo-dark-on-transparent.svg">
+    <img src="docs/assets/logo-color-on-transparent.svg" alt="agentevals" width="420" />
+  </picture>
+</p>
+<h1 align="center">Ship Agents Reliably</h1>
+<p align="center">
+Benchmark your agents before they hit production.<br>
+agentevals scores performance and inference quality from OpenTelemetry traces — no re-runs, no guesswork.
+</p>
+<p align="center">
+  <a href="https://github.com/agentevals-dev/agentevals/stargazers"><img src="https://img.shields.io/github/stars/agentevals-dev/agentevals?style=social" alt="GitHub Stars"></a>
+  &nbsp;
+  <a href="https://discord.gg/cpveEn8Ah2"><img src="https://img.shields.io/discord/1435836734666707190?label=Discord&logo=discord&logoColor=white&color=5865F2" alt="Discord"></a>
+  &nbsp;
+  <a href="https://github.com/agentevals-dev/agentevals/releases"><img src="https://img.shields.io/github/v/release/agentevals-dev/agentevals?label=Release" alt="Release"></a>
+  &nbsp;
+  <a href="https://github.com/agentevals-dev/agentevals/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-Apache%202.0-green.svg" alt="License"></a>
+  &nbsp;
+  <a href="https://pypi.org/project/agentevals-cli/"><img src="https://img.shields.io/pypi/v/agentevals-cli?label=PyPI&color=blue" alt="PyPI"></a>
+</p>
+<p align="center">
+  <a href="#installation">Install</a> · <a href="#quick-start">Quick Start</a> · <a href="https://github.com/agentevals-dev/agentevals/releases">Releases</a> · <a href="CONTRIBUTING.md">Contributing</a> · <a href="https://discord.gg/cpveEn8Ah2">Discord</a>
+</p>
+---
+## What is agentevals?
+agentevals is a framework-agnostic evaluation solution that scores AI agent behavior directly from [OpenTelemetry](https://opentelemetry.io/) traces. Record your agent's actions once, then evaluate as many times as you want — no re-runs, no guesswork.
+It works with any OTel-instrumented framework (LangChain, Strands, Google ADK, and others), supports Jaeger JSON and OTLP trace formats, and ships with built-in evaluators, custom evaluator support, and LLM-based judges.
+- **CLI** for scripting and CI pipelines
+- **Web UI** for visual inspection and local developer experience
+- **MCP server** so MCP clients can run evaluations from a conversation
+## Why agentevals?
+Most evaluation tools require you to **re-execute your agent** for every test — burning tokens, time, and money on duplicate LLM calls. agentevals takes a different approach:
+- **No re-execution** — score agents from existing traces without replaying expensive LLM calls
+- **Framework-agnostic** — works with any agent framework that emits OpenTelemetry spans
+- **Golden eval sets** — compare actual behavior against defined expected behaviors for deterministic pass/fail gating
+- **Custom evaluators** — write scoring logic in Python, JavaScript, or any language
+- **CI/CD ready** — gate deployments on quality thresholds directly in your pipeline
+- **Local-first** — no cloud dependency required; everything runs on your machine
+## How It Works
+agentevals follows three simple steps:
+1. **Collect traces** — Instrument your agent with OpenTelemetry (or export traces from your tracing backend). Point the OTLP exporter at the agentevals receiver, or load trace files directly.
+2. **Define eval sets** — Create golden evaluation sets that describe expected agent behavior: which tools should be called, in what order, and what the output should look like.
+3. **Run evaluations** — Use the CLI, Web UI, or MCP server to score traces against your eval sets. Get per-metric scores, pass/fail results, and detailed span-level breakdowns.
+> [!IMPORTANT]
+> This project is under active development. Expect breaking changes.
+## Contents
+- [Installation](#installation)
+- [Quick Start](#quick-start)
+- [Integration](#integration)
+- [CLI](#cli)
+- [Custom Evaluators](#custom-evaluators)
+- [Web UI](#web-ui)
+- [REST API Reference](#rest-api-reference)
+- [MCP Server](#mcp-server)
+- [Claude Code Skills](#claude-code-skills)
+- [Docs](#docs)
+- [Development](#development)
+- [FAQ](#faq)
+## Installation
+**From PyPI** (recommended): the published package includes the **CLI**, **REST API**, and **embedded web UI**.
+```bash
+pip install agentevals-cli
+```
+Optional extras:
+```bash
+pip install "agentevals-cli[live]"        # MCP server support
+pip install "agentevals-cli[openai]"      # OpenAI Evals API graders
+```
+**GitHub [releases](../../releases)** also ship **core** wheels (CLI and API only) and **bundle** wheels (with the embedded UI) if you need a specific version or offline `pip install ./path/to.whl`.
+**From source** with `uv` or Nix:
+```bash
+uv sync
+# or: nix develop .
+```
+See [DEVELOPMENT.md](DEVELOPMENT.md) for build instructions.
+## Quick Start
+Examples use `agentevals` on your PATH after `pip install agentevals-cli`. If you are working from a clone of this repo, use `uv run agentevals` instead.
+Run an evaluation against a sample trace:
+```bash
+agentevals run samples/helm.json \
+  --eval-set samples/eval_set_helm.json \
+  -m tool_trajectory_avg_score
+```
+List available evaluators:
+```bash
+agentevals evaluator list
+```
+## Integration
+### Zero-Code (Recommended)
+Point any OTel-instrumented agent at the receiver. No SDK, no code changes:
+```bash
+# Terminal 1
+agentevals serve --dev
+# Terminal 2
+export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
+export OTEL_RESOURCE_ATTRIBUTES="agentevals.session_name=my-agent"
+python your_agent.py
+```
+Traces stream to the UI in real-time. Works with LangChain, Strands, Google ADK, or any framework that emits OTel spans (`http/protobuf` and `http/json` supported). Sessions are auto-created and grouped by `agentevals.session_name`. Set `agentevals.eval_set_id` to associate traces with an eval set.
+See [examples/zero-code-examples/](examples/zero-code-examples/) for working examples.
+### SDK
+For programmatic session lifecycle and decorator API:
+```python
+from agentevals import AgentEvals
+app = AgentEvals()
+with app.session(eval_set_id="my-eval"):
+    agent.invoke("Roll a 20-sided die for me")
+```
+Requires `pip install "agentevals-cli[streaming]"`. See [examples/sdk_example/](examples/sdk_example/) for framework-specific patterns.
+## CLI
+```bash
+# Single trace
+agentevals run samples/helm.json \
+  --eval-set samples/eval_set_helm.json \
+  -m tool_trajectory_avg_score
+# Multiple traces
+agentevals run samples/helm.json samples/k8s.json \
+  --eval-set samples/eval_set_helm.json \
+  -m tool_trajectory_avg_score
+# JSON output
+agentevals run samples/helm.json \
+  --eval-set samples/eval_set_helm.json \
+  --output json
+# List available evaluators (builtin + community)
+agentevals evaluator list
+# List only builtin evaluators
+agentevals evaluator list --source builtin
+```
+## Custom Evaluators
+Beyond the built-in metrics, you can write your own evaluators in Python, JavaScript, or any language. An evaluator is any program that reads JSON from stdin and writes a score to stdout.
+```bash
+agentevals evaluator init my_evaluator
+```
+This scaffolds a directory with boilerplate and a manifest. You can also list supported runtimes and generate config snippets:
+```bash
+agentevals evaluator runtimes           # show supported languages
+agentevals evaluator config my_evaluator --path ./evaluators/my_evaluator.py
+```
+Implement your scoring logic, then reference it in an eval config:
+```yaml
+# eval_config.yaml
+evaluators:
+  - name: tool_trajectory_avg_score
+    type: builtin
+  - name: my_evaluator
+    type: code
+    path: ./evaluators/my_evaluator.py
+    threshold: 0.7
+```
+```bash
+agentevals run trace.json --config eval_config.yaml --eval-set eval_set.json
+```
+Community evaluators can be referenced directly from a shared GitHub repository using `type: remote`. You can also delegate grading to the [OpenAI Evals API](https://developers.openai.com/api/reference/resources/evals/methods/create) using `type: openai_eval` (requires `pip install "agentevals-cli[openai]"` and `OPENAI_API_KEY`). See the [Custom Evaluators guide](docs/custom-evaluators.md) for the full protocol reference, SDK usage, and how to contribute evaluators.
+## Web UI
+**Installed bundle** (port 8001):
+```bash
+agentevals serve
+```
+**From source** (two terminals):
+```bash
+uv run agentevals serve --dev    # Terminal 1
+cd ui && npm install && npm run dev             # Terminal 2 → http://localhost:5173
+```
+Upload traces and eval sets, select metrics, and view results with interactive span trees. Live-streamed traces appear in the "Local Dev" tab, grouped by session ID.
+## REST API Reference
+While the server is running, interactive API documentation is available at:
+| Endpoint | Description |
+|----------|-------------|
+| [`/docs`](http://localhost:8001/docs) | Swagger UI with interactive request builder |
+| [`/redoc`](http://localhost:8001/redoc) | ReDoc reference documentation |
+| [`/openapi.json`](http://localhost:8001/openapi.json) | Raw OpenAPI 3.x schema (for code generation or CI) |
+The OTLP receiver (port 4318) serves its own docs at `http://localhost:4318/docs`.
+## MCP Server
+Exposes evaluation tools to MCP clients. A `.mcp.json` at the project root lets Claude Code pick it up automatically.
+| Tool | Requires `serve` | Description |
+|------|:---:|-------------|
+| `list_metrics` | yes | List available metrics |
+| `evaluate_traces` | no | Evaluate local trace files (OTLP or Jaeger) |
+| `list_sessions` | yes | List streaming sessions |
+| `summarize_session` | yes | Structured summary of a session's tool calls |
+| `evaluate_sessions` | yes | Evaluate sessions against a golden reference |
+```bash
+# Custom server URL (requires pip install "agentevals-cli[live]")
+AGENTEVALS_SERVER_URL=http://localhost:9000 agentevals mcp
+```
+The React UI and MCP server share the same in-memory session state and can run simultaneously.
+## Claude Code Skills
+Two slash-command workflows in `.claude/skills/`, available automatically in this repo:
+| Skill | What it does |
+|-------|-------------|
+| `/eval` | Score traces or compare sessions against a golden reference |
+| `/inspect` | Turn-by-turn narrative of a live session with anomaly detection |
+## Docs
+| Guide | Description |
+|-------|-------------|
+| [Eval Set Format](docs/eval-set-format.md) | Schema, field reference, and examples for golden eval set JSON files |
+| [Custom Evaluators](docs/custom-evaluators.md) | Write your own scoring logic in Python, JavaScript, or any language |
+| [Live Streaming](docs/streaming.md) | Real-time trace streaming, dev server setup, and session management |
+| [OpenTelemetry Compatibility](docs/otel-compatibility.md) | Supported OTel conventions, message delivery mechanisms, and OTLP receiver |
+## Development
+```bash
+uv run pytest                      # run tests
+uv run agentevals serve --dev      # backend
+cd ui && npm run dev               # frontend (separate terminal)
+```
+See [DEVELOPMENT.md](DEVELOPMENT.md) for build tiers, Makefile targets, and Nix setup. To contribute, see [CONTRIBUTING.md](CONTRIBUTING.md).
+## FAQ
+**How does this compare to ADK's evaluations?**
+Unlike ADK's LocalEvalService, which couples agent execution with evaluation, agentevals only handles scoring: it takes pre-recorded traces and compares them against expected behavior using metrics like tool trajectory matching, response quality, and LLM-based judgments.
+However, if you're iterating on your agents locally, you can point your agents to agentevals and you will see rich runtime information in your browser. For more details, use the bundled wheel and explore the Local Development option in the UI.
+**How does this compare to Bedrock AgentCore's evaluation?**
+AgentCore's evaluation integration (via `strands-agents-evals`) also couples agent execution with evaluation. It re-invokes the agent for each test case, converts the resulting OTel spans to AWS's ADOT format, and scores them against 4 built-in evaluators (Helpfulness, Accuracy, Harmfulness, Relevance) via a cloud API call. This means you need an AWS account, valid credentials, and network access for every evaluation.
+agentevals takes a different approach: it scores pre-recorded traces locally without re-running anything. It works with standard Jaeger JSON and OTLP formats from any framework, supports open-ended metrics (tool trajectory matching, LLM-based judges, custom scorers), and ships with a CLI, web UI, and MCP server. No cloud dependency required, though we do include all ADK's GCP-based evals as of now.

{agentevals_cli-0.5.2 → agentevals_cli-0.6.0}/README.md RENAMED Viewed

@@ -1,15 +1,66 @@
 <p align="center">
-  <img src="docs/assets/logo-color.png" alt="agentevals" width="420" />
+  <picture>
+    <source media="(prefers-color-scheme: dark)" srcset="docs/assets/logo-color-on-transparent.svg">
+    <source media="(prefers-color-scheme: light)" srcset="docs/assets/logo-dark-on-transparent.svg">
+    <img src="docs/assets/logo-color-on-transparent.svg" alt="agentevals" width="420" />
+  </picture>
 </p>
-`agentevals` evaluates AI agent behavior from OpenTelemetry traces, without re-running the agent. Record once, score as many times as you want.
+<h1 align="center">Ship Agents Reliably</h1>
-Works with any OTel-instrumented framework (LangChain, Strands, Google ADK, and others). Supports Jaeger JSON and OTLP trace formats, built-in and custom evaluators, and LLM-based judges.
+<p align="center">
+Benchmark your agents before they hit production.<br>
+agentevals scores performance and inference quality from OpenTelemetry traces — no re-runs, no guesswork.
+</p>
+<p align="center">
+  <a href="https://github.com/agentevals-dev/agentevals/stargazers"><img src="https://img.shields.io/github/stars/agentevals-dev/agentevals?style=social" alt="GitHub Stars"></a>
+  &nbsp;
+  <a href="https://discord.gg/cpveEn8Ah2"><img src="https://img.shields.io/discord/1435836734666707190?label=Discord&logo=discord&logoColor=white&color=5865F2" alt="Discord"></a>
+  &nbsp;
+  <a href="https://github.com/agentevals-dev/agentevals/releases"><img src="https://img.shields.io/github/v/release/agentevals-dev/agentevals?label=Release" alt="Release"></a>
+  &nbsp;
+  <a href="https://github.com/agentevals-dev/agentevals/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-Apache%202.0-green.svg" alt="License"></a>
+  &nbsp;
+  <a href="https://pypi.org/project/agentevals-cli/"><img src="https://img.shields.io/pypi/v/agentevals-cli?label=PyPI&color=blue" alt="PyPI"></a>
+</p>
+<p align="center">
+  <a href="#installation">Install</a> · <a href="#quick-start">Quick Start</a> · <a href="https://github.com/agentevals-dev/agentevals/releases">Releases</a> · <a href="CONTRIBUTING.md">Contributing</a> · <a href="https://discord.gg/cpveEn8Ah2">Discord</a>
+</p>
+---
+## What is agentevals?
+agentevals is a framework-agnostic evaluation solution that scores AI agent behavior directly from [OpenTelemetry](https://opentelemetry.io/) traces. Record your agent's actions once, then evaluate as many times as you want — no re-runs, no guesswork.
+It works with any OTel-instrumented framework (LangChain, Strands, Google ADK, and others), supports Jaeger JSON and OTLP trace formats, and ships with built-in evaluators, custom evaluator support, and LLM-based judges.
 - **CLI** for scripting and CI pipelines
 - **Web UI** for visual inspection and local developer experience
 - **MCP server** so MCP clients can run evaluations from a conversation
+## Why agentevals?
+Most evaluation tools require you to **re-execute your agent** for every test — burning tokens, time, and money on duplicate LLM calls. agentevals takes a different approach:
+- **No re-execution** — score agents from existing traces without replaying expensive LLM calls
+- **Framework-agnostic** — works with any agent framework that emits OpenTelemetry spans
+- **Golden eval sets** — compare actual behavior against defined expected behaviors for deterministic pass/fail gating
+- **Custom evaluators** — write scoring logic in Python, JavaScript, or any language
+- **CI/CD ready** — gate deployments on quality thresholds directly in your pipeline
+- **Local-first** — no cloud dependency required; everything runs on your machine
+## How It Works
+agentevals follows three simple steps:
+1. **Collect traces** — Instrument your agent with OpenTelemetry (or export traces from your tracing backend). Point the OTLP exporter at the agentevals receiver, or load trace files directly.
+2. **Define eval sets** — Create golden evaluation sets that describe expected agent behavior: which tools should be called, in what order, and what the output should look like.
+3. **Run evaluations** — Use the CLI, Web UI, or MCP server to score traces against your eval sets. Get per-metric scores, pass/fail results, and detailed span-level breakdowns.
 > [!IMPORTANT]
 > This project is under active development. Expect breaking changes.
@@ -30,15 +81,21 @@ Works with any OTel-instrumented framework (LangChain, Strands, Google ADK, and
 ## Installation
-Grab a wheel from the [releases page](../../releases). The **core** wheel has the CLI and REST API. The **bundle** wheel adds streaming and the embedded web UI.
+**From PyPI** (recommended): the published package includes the **CLI**, **REST API**, and **embedded web UI**.
 ```bash
-pip install agentevals-<version>-py3-none-any.whl
+pip install agentevals-cli
+```
+Optional extras:
-# For MCP server support:
-pip install "agentevals-<version>-py3-none-any.whl[live]"
+```bash
+pip install "agentevals-cli[live]"        # MCP server support
+pip install "agentevals-cli[openai]"      # OpenAI Evals API graders
 ```
+**GitHub [releases](../../releases)** also ship **core** wheels (CLI and API only) and **bundle** wheels (with the embedded UI) if you need a specific version or offline `pip install ./path/to.whl`.
 **From source** with `uv` or Nix:
 ```bash
@@ -50,10 +107,12 @@ See [DEVELOPMENT.md](DEVELOPMENT.md) for build instructions.
 ## Quick Start
+Examples use `agentevals` on your PATH after `pip install agentevals-cli`. If you are working from a clone of this repo, use `uv run agentevals` instead.
 Run an evaluation against a sample trace:
 ```bash
-uv run agentevals run samples/helm.json \
+agentevals run samples/helm.json \
   --eval-set samples/eval_set_helm.json \
   -m tool_trajectory_avg_score
 ```
@@ -61,7 +120,7 @@ uv run agentevals run samples/helm.json \
 List available evaluators:
 ```bash
-uv run agentevals evaluator list
+agentevals evaluator list
 ```
 ## Integration
@@ -72,7 +131,7 @@ Point any OTel-instrumented agent at the receiver. No SDK, no code changes:
 ```bash
 # Terminal 1
-uv run agentevals serve --dev
+agentevals serve --dev
 # Terminal 2
 export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
@@ -97,31 +156,31 @@ with app.session(eval_set_id="my-eval"):
     agent.invoke("Roll a 20-sided die for me")
 ```
-Requires `pip install "agentevals[streaming]"`. See [examples/sdk_example/](examples/sdk_example/) for framework-specific patterns.
+Requires `pip install "agentevals-cli[streaming]"`. See [examples/sdk_example/](examples/sdk_example/) for framework-specific patterns.
 ## CLI
 ```bash
 # Single trace
-uv run agentevals run samples/helm.json \
+agentevals run samples/helm.json \
   --eval-set samples/eval_set_helm.json \
   -m tool_trajectory_avg_score
 # Multiple traces
-uv run agentevals run samples/helm.json samples/k8s.json \
+agentevals run samples/helm.json samples/k8s.json \
   --eval-set samples/eval_set_helm.json \
   -m tool_trajectory_avg_score
 # JSON output
-uv run agentevals run samples/helm.json \
+agentevals run samples/helm.json \
   --eval-set samples/eval_set_helm.json \
   --output json
 # List available evaluators (builtin + community)
-uv run agentevals evaluator list
+agentevals evaluator list
 # List only builtin evaluators
-uv run agentevals evaluator list --source builtin
+agentevals evaluator list --source builtin
 ```
 ## Custom Evaluators
@@ -157,7 +216,7 @@ evaluators:
 agentevals run trace.json --config eval_config.yaml --eval-set eval_set.json
 ```
-Community evaluators can be referenced directly from a shared GitHub repository using `type: remote`. See the [Custom Evaluators guide](docs/custom-evaluators.md) for the full protocol reference, SDK usage, and how to contribute evaluators.
+Community evaluators can be referenced directly from a shared GitHub repository using `type: remote`. You can also delegate grading to the [OpenAI Evals API](https://developers.openai.com/api/reference/resources/evals/methods/create) using `type: openai_eval` (requires `pip install "agentevals-cli[openai]"` and `OPENAI_API_KEY`). See the [Custom Evaluators guide](docs/custom-evaluators.md) for the full protocol reference, SDK usage, and how to contribute evaluators.
 ## Web UI
@@ -201,8 +260,8 @@ Exposes evaluation tools to MCP clients. A `.mcp.json` at the project root lets
 | `evaluate_sessions` | yes | Evaluate sessions against a golden reference |
 ```bash
-# Custom server URL
-AGENTEVALS_SERVER_URL=http://localhost:9000 uv run agentevals mcp
+# Custom server URL (requires pip install "agentevals-cli[live]")
+AGENTEVALS_SERVER_URL=http://localhost:9000 agentevals mcp
 ```
 The React UI and MCP server share the same in-memory session state and can run simultaneously.

agentevals_cli-0.6.0/docs/assets/logo-color-on-transparent.svg ADDED Viewed

@@ -0,0 +1,13 @@
+<svg width="3302" height="1066" viewBox="0 0 3302 1066" fill="none" xmlns="http://www.w3.org/2000/svg">
+<path d="M518.695 264C560.958 264 595.207 298.274 595.207 340.537C595.207 382.8 560.958 417.048 518.695 417.048C454.983 417.048 403.305 468.548 403 532.184V533.304C403.306 596.94 454.983 648.438 518.695 648.438H518.722C560.985 648.439 595.232 682.687 595.232 724.95C595.232 767.213 560.984 801.461 518.722 801.461C476.459 801.461 442.21 767.213 442.21 724.95V724.67C442.057 661.008 390.482 609.408 326.795 609.255H326.515C284.252 609.255 250.004 575.006 250.004 532.743C250.004 490.48 284.252 456.232 326.515 456.232H326.642C390.431 456.156 442.108 404.453 442.185 340.664V340.512C442.185 298.249 476.432 264 518.695 264ZM492.436 469.353C527.452 454.848 567.596 471.476 582.101 506.492C596.605 541.508 579.976 581.653 544.96 596.157C509.944 610.661 469.8 594.033 455.296 559.017C440.792 524.001 457.42 483.857 492.436 469.353Z" fill="#8023C3"/>
+<path d="M1029.16 401.476V655.084H982.736L976.321 616.93C956.253 644.357 928.75 658.054 893.878 658.054C870.849 658.054 850.254 652.839 832.16 642.443C814.066 632.046 799.887 617.029 789.688 597.359C779.49 577.721 774.391 554.683 774.391 528.247C774.391 501.81 779.589 479.796 789.952 460.125C800.315 440.487 814.56 425.305 832.654 414.546C850.748 403.819 871.178 398.439 893.878 398.439C912.301 398.439 928.454 401.839 942.271 408.605C956.089 415.371 967.274 424.711 975.861 436.593V401.41H1029.19L1029.16 401.476ZM902.76 612.97C924.802 612.97 942.567 605.214 956.089 589.701C969.577 574.189 976.321 554.023 976.321 529.27C976.321 504.516 969.577 483.294 956.089 467.617C942.6 451.94 924.802 444.085 902.76 444.085C880.718 444.085 862.92 451.94 849.432 467.617C835.944 483.294 829.199 503.526 829.199 528.28C829.199 553.033 835.944 573.76 849.432 589.47C862.92 605.148 880.685 613.003 902.76 613.003V612.97Z" fill="white"/>
+<path d="M2723.45 401.476V655.084H2677.03L2670.61 616.93C2650.54 644.357 2623.04 658.054 2588.17 658.054C2565.14 658.054 2544.54 652.839 2526.45 642.443C2508.36 632.046 2494.18 617.029 2483.98 597.359C2473.78 577.721 2468.68 554.683 2468.68 528.247C2468.68 501.81 2473.88 479.796 2484.24 460.125C2494.6 440.487 2508.85 425.305 2526.94 414.546C2545.04 403.819 2565.47 398.439 2588.17 398.439C2606.59 398.439 2622.74 401.839 2636.56 408.605C2650.38 415.371 2661.56 424.711 2670.15 436.593V401.41H2723.48L2723.45 401.476ZM2597.05 612.97C2619.09 612.97 2636.86 605.214 2650.38 589.701C2663.87 574.189 2670.61 554.023 2670.61 529.27C2670.61 504.516 2663.87 483.294 2650.38 467.617C2636.89 451.94 2619.09 444.085 2597.05 444.085C2575.01 444.085 2557.21 451.94 2543.72 467.617C2530.23 483.294 2523.49 503.526 2523.49 528.28C2523.49 553.033 2530.23 573.76 2543.72 589.47C2557.21 605.148 2574.97 613.003 2597.05 613.003V612.97Z" fill="white"/>
+<path d="M1308.72 401.47V644.682C1308.72 680.36 1298.19 707.985 1277.14 727.655C1256.08 747.293 1223.48 757.129 1179.36 757.129C1145.11 757.129 1117.32 749.439 1095.93 734.091C1074.51 718.744 1062.67 697.027 1060.37 668.94H1114.68C1117.97 683.132 1125.54 694.123 1137.38 701.879C1149.23 709.635 1164.52 713.529 1183.31 713.529C1231.7 713.529 1255.88 689.898 1255.88 642.701V614.482C1237.46 642.206 1209.96 656.101 1173.44 656.101C1150.41 656.101 1129.82 650.887 1111.72 640.49C1093.63 630.094 1079.45 615.242 1069.25 595.901C1059.05 576.593 1053.95 553.721 1053.95 527.284C1053.95 500.847 1059.15 479.394 1069.51 459.922C1079.88 440.449 1094.12 425.333 1112.22 414.606C1130.31 403.88 1150.74 398.5 1173.44 398.5C1192.52 398.5 1209 402.296 1222.82 409.887C1236.64 417.478 1247.82 427.907 1256.41 441.109L1262.33 401.47H1308.75H1308.72ZM1182.32 610.984C1204.36 610.984 1222.13 603.294 1235.65 587.947C1249.14 572.6 1255.88 552.698 1255.88 528.274C1255.88 503.851 1249.14 482.794 1235.65 467.084C1222.16 451.406 1204.36 443.551 1182.32 443.551C1160.28 443.551 1142.48 451.307 1128.99 466.82C1115.51 482.332 1108.76 502.498 1108.76 527.251C1108.76 552.005 1115.51 572.17 1128.99 587.683C1142.48 603.195 1160.25 610.951 1182.32 610.951V610.984Z" fill="white"/>
+<path d="M1324.01 528.769C1324.01 502.696 1329.21 479.823 1339.57 460.153C1349.94 440.515 1364.41 425.333 1383.03 414.573C1401.62 403.847 1422.94 398.467 1446.99 398.467C1471.03 398.467 1492.81 403.417 1511.43 413.319C1530.02 423.22 1544.66 437.28 1555.39 455.433C1566.08 473.585 1571.61 494.906 1571.93 519.33C1571.93 525.931 1571.44 532.697 1570.45 539.628H1379.87V542.598C1381.19 564.711 1388.1 582.237 1400.6 595.109C1413.1 607.98 1429.71 614.416 1450.47 614.416C1466.92 614.416 1480.74 610.555 1491.96 602.766C1503.14 595.01 1510.55 584.019 1514.16 569.827H1567.49C1562.89 595.571 1550.45 616.727 1530.22 633.229C1509.99 649.731 1484.72 657.982 1454.42 657.982C1428.07 657.982 1405.14 652.636 1385.53 641.876C1365.96 631.15 1350.79 616.034 1340.1 596.561C1329.41 577.088 1324.04 554.447 1324.04 528.703L1324.01 528.769ZM1517.55 500.517C1515.25 482.035 1507.91 467.579 1495.58 457.182C1483.24 446.786 1467.68 441.571 1448.93 441.571C1431.49 441.571 1416.42 446.951 1403.76 457.677C1391.09 468.404 1383.76 482.695 1381.78 500.517H1517.55Z" fill="white"/>
+<path d="M1587.2 401.47H1633.61L1639.54 434.673C1658.62 410.58 1685.63 398.5 1720.5 398.5C1735.3 398.5 1748.96 400.711 1761.49 405.2C1773.99 409.656 1784.78 416.686 1793.83 426.257C1802.88 435.828 1809.88 447.974 1814.82 462.661C1819.75 477.348 1822.22 494.94 1822.22 515.402V655.078H1768.4V518.373C1768.4 494.28 1763.3 475.929 1753.1 463.387C1742.9 450.845 1727.93 444.574 1708.16 444.574C1687.11 444.574 1670.56 451.935 1658.55 466.622C1646.54 481.309 1640.52 501.541 1640.52 527.317V655.111H1587.2V401.47Z" fill="white"/>
+<path d="M1845.75 401.47V330.609H1899.57V401.437H1960.3V448.502H1899.57V580.752C1899.57 590.653 1901.54 597.683 1905.49 601.809C1909.44 605.934 1916.18 608.014 1925.72 608.014H1966.22V655.078H1914.87C1890.85 655.078 1873.31 649.566 1862.29 638.477C1851.27 627.42 1845.75 609.994 1845.75 586.23V448.535" fill="white"/>
+<path d="M1976.12 528.769C1976.12 502.696 1981.32 479.823 1991.69 460.153C2002.05 440.515 2016.52 425.333 2035.14 414.573C2053.73 403.847 2075.05 398.467 2099.1 398.467C2123.15 398.467 2144.93 403.417 2163.51 413.319C2182.1 423.22 2196.77 437.28 2207.47 455.433C2218.16 473.585 2223.69 494.906 2224.01 519.33C2224.01 525.931 2223.52 532.697 2222.53 539.628H2031.95V542.598C2033.27 564.711 2040.18 582.237 2052.68 595.109C2065.18 607.98 2081.83 614.416 2102.55 614.416C2119 614.416 2132.85 610.555 2144.04 602.766C2155.22 595.01 2162.63 584.019 2166.24 569.827H2219.57C2214.97 595.571 2202.53 616.727 2182.3 633.229C2162.07 649.731 2136.8 657.982 2106.5 657.982C2080.15 657.982 2057.19 652.636 2037.61 641.876C2018.04 631.15 2002.87 616.034 1992.18 596.561C1981.49 577.088 1976.12 554.447 1976.12 528.703V528.769ZM2169.67 500.517C2167.36 482.035 2160.03 467.579 2147.69 457.182C2135.35 446.786 2119.79 441.571 2101.04 441.571C2083.6 441.571 2068.54 446.951 2055.87 457.677C2043.2 468.404 2035.87 482.695 2033.89 500.517H2169.67Z" fill="white"/>
+<path d="M2216.86 401.475H2274.14L2343.75 597.621L2412.38 401.475H2468.67L2375.33 655.082H2310.16L2216.83 401.475H2216.86Z" fill="white"/>
+<path d="M2754.43 308.332H2807.76V655.079H2754.43V308.332Z" fill="white"/>
+<path d="M2882.6 571.352C2883.59 584.554 2889.77 595.379 2901.12 603.796C2912.47 612.212 2927.21 616.436 2945.31 616.436C2961.43 616.436 2974.52 613.4 2984.55 607.261C2994.59 601.155 2999.62 592.97 2999.62 582.739C2999.62 574.157 2997.32 567.721 2992.71 563.431C2988.11 559.14 2981.92 556.104 2974.19 554.256C2966.46 552.44 2954.52 550.559 2938.4 548.546C2916.36 545.905 2898.16 542.341 2883.85 537.885C2869.54 533.43 2857.99 526.334 2849.28 516.597C2840.56 506.861 2836.18 493.725 2836.18 477.223C2836.18 461.711 2840.56 447.915 2849.28 435.868C2857.99 423.821 2870 414.481 2885.33 407.88C2900.63 401.279 2918 397.979 2937.41 397.979C2969.32 397.979 2995.25 405.075 3015.18 419.267C3035.09 433.459 3045.88 453.459 3047.52 479.203H2995.67C2994.36 467.651 2988.6 458.146 2978.4 450.72C2968.2 443.294 2955.37 439.564 2939.88 439.564C2924.38 439.564 2911.92 442.535 2902.34 448.476C2892.8 454.416 2888.03 462.503 2888.03 472.734C2888.03 480.325 2890.4 486.035 2895.2 489.83C2899.97 493.626 2905.99 496.266 2913.23 497.752C2920.47 499.237 2932.15 500.986 2948.3 502.966C2970.01 505.277 2988.31 508.841 3003.11 513.627C3017.91 518.413 3029.76 526.004 3038.67 536.4C3047.56 546.797 3052 560.923 3052 578.745C3052 594.587 3047.39 608.548 3038.18 620.595C3028.97 632.642 3016.27 641.883 3000.15 648.319C2984.03 654.755 2965.9 657.989 2945.83 657.989C2911.92 657.989 2884.51 650.299 2863.62 634.952C2842.73 619.605 2831.94 598.383 2831.28 571.286H2882.64L2882.6 571.352Z" fill="white"/>
+</svg>

agentevals_cli-0.6.0/docs/assets/logo-dark-on-transparent.svg ADDED Viewed

@@ -0,0 +1,13 @@
+<svg width="3302" height="1066" viewBox="0 0 3302 1066" fill="none" xmlns="http://www.w3.org/2000/svg">
+<path d="M518.695 264C560.958 264 595.207 298.274 595.207 340.537C595.207 382.8 560.958 417.048 518.695 417.048C454.983 417.048 403.305 468.548 403 532.184V533.304C403.306 596.94 454.983 648.438 518.695 648.438H518.722C560.985 648.439 595.232 682.687 595.232 724.95C595.232 767.213 560.984 801.461 518.722 801.461C476.459 801.461 442.21 767.213 442.21 724.95V724.67C442.057 661.008 390.482 609.408 326.795 609.255H326.515C284.252 609.255 250.004 575.006 250.004 532.743C250.004 490.48 284.252 456.232 326.515 456.232H326.642C390.431 456.156 442.108 404.453 442.185 340.664V340.512C442.185 298.249 476.432 264 518.695 264ZM492.436 469.353C527.452 454.848 567.596 471.476 582.101 506.492C596.605 541.508 579.976 581.653 544.96 596.157C509.944 610.661 469.8 594.033 455.296 559.017C440.792 524.001 457.42 483.857 492.436 469.353Z" fill="#151927"/>
+<path d="M1029.16 401.476V655.084H982.736L976.321 616.93C956.253 644.357 928.75 658.054 893.878 658.054C870.849 658.054 850.254 652.839 832.16 642.443C814.066 632.046 799.887 617.029 789.688 597.359C779.49 577.721 774.391 554.683 774.391 528.247C774.391 501.81 779.589 479.796 789.952 460.125C800.315 440.487 814.56 425.305 832.654 414.546C850.748 403.819 871.178 398.439 893.878 398.439C912.301 398.439 928.454 401.839 942.271 408.605C956.089 415.371 967.274 424.711 975.861 436.593V401.41H1029.19L1029.16 401.476ZM902.76 612.97C924.802 612.97 942.567 605.214 956.089 589.701C969.577 574.189 976.321 554.023 976.321 529.27C976.321 504.516 969.577 483.294 956.089 467.617C942.6 451.94 924.802 444.085 902.76 444.085C880.718 444.085 862.92 451.94 849.432 467.617C835.944 483.294 829.199 503.526 829.199 528.28C829.199 553.033 835.944 573.76 849.432 589.47C862.92 605.148 880.685 613.003 902.76 613.003V612.97Z" fill="#151927"/>
+<path d="M2723.45 401.476V655.084H2677.03L2670.61 616.93C2650.54 644.357 2623.04 658.054 2588.17 658.054C2565.14 658.054 2544.54 652.839 2526.45 642.443C2508.36 632.046 2494.18 617.029 2483.98 597.359C2473.78 577.721 2468.68 554.683 2468.68 528.247C2468.68 501.81 2473.88 479.796 2484.24 460.125C2494.6 440.487 2508.85 425.305 2526.94 414.546C2545.04 403.819 2565.47 398.439 2588.17 398.439C2606.59 398.439 2622.74 401.839 2636.56 408.605C2650.38 415.371 2661.56 424.711 2670.15 436.593V401.41H2723.48L2723.45 401.476ZM2597.05 612.97C2619.09 612.97 2636.86 605.214 2650.38 589.701C2663.87 574.189 2670.61 554.023 2670.61 529.27C2670.61 504.516 2663.87 483.294 2650.38 467.617C2636.89 451.94 2619.09 444.085 2597.05 444.085C2575.01 444.085 2557.21 451.94 2543.72 467.617C2530.23 483.294 2523.49 503.526 2523.49 528.28C2523.49 553.033 2530.23 573.76 2543.72 589.47C2557.21 605.148 2574.97 613.003 2597.05 613.003V612.97Z" fill="#151927"/>
+<path d="M1308.72 401.47V644.682C1308.72 680.36 1298.19 707.985 1277.14 727.655C1256.08 747.293 1223.48 757.129 1179.36 757.129C1145.11 757.129 1117.32 749.439 1095.93 734.091C1074.51 718.744 1062.67 697.027 1060.37 668.94H1114.68C1117.97 683.132 1125.54 694.123 1137.38 701.879C1149.23 709.635 1164.52 713.529 1183.31 713.529C1231.7 713.529 1255.88 689.898 1255.88 642.701V614.482C1237.46 642.206 1209.96 656.101 1173.44 656.101C1150.41 656.101 1129.82 650.887 1111.72 640.49C1093.63 630.094 1079.45 615.242 1069.25 595.901C1059.05 576.593 1053.95 553.721 1053.95 527.284C1053.95 500.847 1059.15 479.394 1069.51 459.922C1079.88 440.449 1094.12 425.333 1112.22 414.606C1130.31 403.88 1150.74 398.5 1173.44 398.5C1192.52 398.5 1209 402.296 1222.82 409.887C1236.64 417.478 1247.82 427.907 1256.41 441.109L1262.33 401.47H1308.75H1308.72ZM1182.32 610.984C1204.36 610.984 1222.13 603.294 1235.65 587.947C1249.14 572.6 1255.88 552.698 1255.88 528.274C1255.88 503.851 1249.14 482.794 1235.65 467.084C1222.16 451.406 1204.36 443.551 1182.32 443.551C1160.28 443.551 1142.48 451.307 1128.99 466.82C1115.51 482.332 1108.76 502.498 1108.76 527.251C1108.76 552.005 1115.51 572.17 1128.99 587.683C1142.48 603.195 1160.25 610.951 1182.32 610.951V610.984Z" fill="#151927"/>
+<path d="M1324.01 528.769C1324.01 502.696 1329.21 479.823 1339.57 460.153C1349.94 440.515 1364.41 425.333 1383.03 414.573C1401.62 403.847 1422.94 398.467 1446.99 398.467C1471.03 398.467 1492.81 403.417 1511.43 413.319C1530.02 423.22 1544.66 437.28 1555.39 455.433C1566.08 473.585 1571.61 494.906 1571.93 519.33C1571.93 525.931 1571.44 532.697 1570.45 539.628H1379.87V542.598C1381.19 564.711 1388.1 582.237 1400.6 595.109C1413.1 607.98 1429.71 614.416 1450.47 614.416C1466.92 614.416 1480.74 610.555 1491.96 602.766C1503.14 595.01 1510.55 584.019 1514.16 569.827H1567.49C1562.89 595.571 1550.45 616.727 1530.22 633.229C1509.99 649.731 1484.72 657.982 1454.42 657.982C1428.07 657.982 1405.14 652.636 1385.53 641.876C1365.96 631.15 1350.79 616.034 1340.1 596.561C1329.41 577.088 1324.04 554.447 1324.04 528.703L1324.01 528.769ZM1517.55 500.517C1515.25 482.035 1507.91 467.579 1495.58 457.182C1483.24 446.786 1467.68 441.571 1448.93 441.571C1431.49 441.571 1416.42 446.951 1403.76 457.677C1391.09 468.404 1383.76 482.695 1381.78 500.517H1517.55Z" fill="#151927"/>
+<path d="M1587.2 401.47H1633.61L1639.54 434.673C1658.62 410.58 1685.63 398.5 1720.5 398.5C1735.3 398.5 1748.96 400.711 1761.49 405.2C1773.99 409.656 1784.78 416.686 1793.83 426.257C1802.88 435.828 1809.88 447.974 1814.82 462.661C1819.75 477.348 1822.22 494.94 1822.22 515.402V655.078H1768.4V518.373C1768.4 494.28 1763.3 475.929 1753.1 463.387C1742.9 450.845 1727.93 444.574 1708.16 444.574C1687.11 444.574 1670.56 451.935 1658.55 466.622C1646.54 481.309 1640.52 501.541 1640.52 527.317V655.111H1587.2V401.47Z" fill="#151927"/>
+<path d="M1845.75 401.47V330.609H1899.57V401.437H1960.3V448.502H1899.57V580.752C1899.57 590.653 1901.54 597.683 1905.49 601.809C1909.44 605.934 1916.18 608.014 1925.72 608.014H1966.22V655.078H1914.87C1890.85 655.078 1873.31 649.566 1862.29 638.477C1851.27 627.42 1845.75 609.994 1845.75 586.23V448.535" fill="#151927"/>
+<path d="M1976.12 528.769C1976.12 502.696 1981.32 479.823 1991.69 460.153C2002.05 440.515 2016.52 425.333 2035.14 414.573C2053.73 403.847 2075.05 398.467 2099.1 398.467C2123.15 398.467 2144.93 403.417 2163.51 413.319C2182.1 423.22 2196.77 437.28 2207.47 455.433C2218.16 473.585 2223.69 494.906 2224.01 519.33C2224.01 525.931 2223.52 532.697 2222.53 539.628H2031.95V542.598C2033.27 564.711 2040.18 582.237 2052.68 595.109C2065.18 607.98 2081.83 614.416 2102.55 614.416C2119 614.416 2132.85 610.555 2144.04 602.766C2155.22 595.01 2162.63 584.019 2166.24 569.827H2219.57C2214.97 595.571 2202.53 616.727 2182.3 633.229C2162.07 649.731 2136.8 657.982 2106.5 657.982C2080.15 657.982 2057.19 652.636 2037.61 641.876C2018.04 631.15 2002.87 616.034 1992.18 596.561C1981.49 577.088 1976.12 554.447 1976.12 528.703V528.769ZM2169.67 500.517C2167.36 482.035 2160.03 467.579 2147.69 457.182C2135.35 446.786 2119.79 441.571 2101.04 441.571C2083.6 441.571 2068.54 446.951 2055.87 457.677C2043.2 468.404 2035.87 482.695 2033.89 500.517H2169.67Z" fill="#151927"/>
+<path d="M2216.86 401.475H2274.14L2343.75 597.621L2412.38 401.475H2468.67L2375.33 655.082H2310.16L2216.83 401.475H2216.86Z" fill="#151927"/>
+<path d="M2754.43 308.332H2807.76V655.079H2754.43V308.332Z" fill="#151927"/>
+<path d="M2882.6 571.352C2883.59 584.554 2889.77 595.379 2901.12 603.796C2912.47 612.212 2927.21 616.436 2945.31 616.436C2961.43 616.436 2974.52 613.4 2984.55 607.261C2994.59 601.155 2999.62 592.97 2999.62 582.739C2999.62 574.157 2997.32 567.721 2992.71 563.431C2988.11 559.14 2981.92 556.104 2974.19 554.256C2966.46 552.44 2954.52 550.559 2938.4 548.546C2916.36 545.905 2898.16 542.341 2883.85 537.885C2869.54 533.43 2857.99 526.334 2849.28 516.597C2840.56 506.861 2836.18 493.725 2836.18 477.223C2836.18 461.711 2840.56 447.915 2849.28 435.868C2857.99 423.821 2870 414.481 2885.33 407.88C2900.63 401.279 2918 397.979 2937.41 397.979C2969.32 397.979 2995.25 405.075 3015.18 419.267C3035.09 433.459 3045.88 453.459 3047.52 479.203H2995.67C2994.36 467.651 2988.6 458.146 2978.4 450.72C2968.2 443.294 2955.37 439.564 2939.88 439.564C2924.38 439.564 2911.92 442.535 2902.34 448.476C2892.8 454.416 2888.03 462.503 2888.03 472.734C2888.03 480.325 2890.4 486.035 2895.2 489.83C2899.97 493.626 2905.99 496.266 2913.23 497.752C2920.47 499.237 2932.15 500.986 2948.3 502.966C2970.01 505.277 2988.31 508.841 3003.11 513.627C3017.91 518.413 3029.76 526.004 3038.67 536.4C3047.56 546.797 3052 560.923 3052 578.745C3052 594.587 3047.39 608.548 3038.18 620.595C3028.97 632.642 3016.27 641.883 3000.15 648.319C2984.03 654.755 2965.9 657.989 2945.83 657.989C2911.92 657.989 2884.51 650.299 2863.62 634.952C2842.73 619.605 2831.94 598.383 2831.28 571.286H2882.64L2882.6 571.352Z" fill="#151927"/>
+</svg>

agentevals-cli 0.5.2__tar.gz → 0.6.0__tar.gz

agentevals-cli 0.5.2tar.gz → 0.6.0tar.gz