PyPI - agentevals-cli - Versions diffs - 0.9.4__tar.gz → 0.9.6__tar.gz - Mend

agentevals-cli 0.9.4tar.gz → 0.9.6tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (293) hide show

{agentevals_cli-0.9.4 → agentevals_cli-0.9.6}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: agentevals-cli
-Version: 0.9.4
+Version: 0.9.6
 Summary: Standalone framework to evaluate agent correctness based on portable OpenTelemetry traces
 License-File: LICENSE
 Requires-Python: >=3.11
@@ -280,7 +280,7 @@ See the [Custom Evaluators guide](docs/custom-evaluators.md) for the full protoc
 agentevals serve            # bundled UI on http://localhost:8001
 ```
-Upload traces and eval sets, select evaluators, and view results with interactive span trees. Live-streamed traces appear in the "Local Dev" tab, grouped by session ID. For running from source, see [DEVELOPMENT.md](DEVELOPMENT.md).
+Upload traces and eval sets, select evaluators, and view results with interactive span trees. Live-streamed traces appear in the "Local Dev" tab, grouped by session ID. With the Postgres backend enabled, the "Run History" tab persists every evaluation and lets you group and trend runs by eval set or agent over time; see the [Run History guide](docs/run-history.md). For running from source, see [DEVELOPMENT.md](DEVELOPMENT.md).
 Interactive API docs are available at `/docs` (Swagger) and `/redoc` while the server is running. The OTLP receiver on port 4318 serves its own docs at `http://localhost:4318/docs`.
@@ -318,11 +318,12 @@ See the [Kubernetes example](examples/kubernetes/README.md) for an end-to-end wa
 #### Postgres backend (`/api/runs`)
-> **Preview.** Persistent run history backed by Postgres is under active
-> development. The `storage.*` and `database.postgres.*` chart values, the
-> `/api/runs` HTTP surface, and the database schema may change incompatibly
-> in upcoming releases. Operators evaluating this feature should plan to
-> recreate the agentevals schema when upgrading between minor versions.
+> **Preview.** Persisting evaluations and exploring them in the UI works end
+> to end (see the [Run History guide](docs/run-history.md)), but the storage
+> layer is still stabilizing. The `storage.*` and `database.postgres.*` chart
+> values, the `/api/runs` HTTP surface, and the database schema may change
+> incompatibly in upcoming releases. Operators evaluating this feature should
+> plan to recreate the agentevals schema when upgrading between minor versions.
 > Default in-memory mode is unaffected.
 By default the chart deploys agentevals with an in-memory backend; runs and results are not persisted. To enable the async `POST /api/runs` pipeline with durable Postgres-backed state:
@@ -341,6 +342,8 @@ helm install agentevals oci://ghcr.io/agentevals-dev/agentevals/helm/agentevals
 When `storage.backend=postgres` the app applies any pending schema migrations on startup (advisory-lock protected, safe across replicas) and starts an in-process worker that processes the run queue. Without `storage.backend=postgres` the `/api/runs` endpoints return 503 with a hint pointing at the env var.
+Persisted runs power the **Run History** view in the UI, where you can group and trend evaluations by eval set or agent and drill into per-run detail. See the [Run History guide](docs/run-history.md) for the full feature walkthrough and local-dev setup.
 ## MCP Server
 Exposes evaluation tools to MCP clients. A `.mcp.json` at the project root lets Claude Code pick it up automatically.
@@ -389,6 +392,7 @@ Working examples are in the [`examples/`](examples/) directory:
 | [Eval Set Format](docs/eval-set-format.md) | Schema, field reference, and examples for golden eval set JSON files |
 | [Custom Evaluators](docs/custom-evaluators.md) | Write your own scoring logic in Python, JavaScript, or any language |
 | [Live Streaming](docs/streaming.md) | Real-time trace streaming, dev server setup, and session management |
+| [Run History](docs/run-history.md) | Persisting evaluations to Postgres and exploring them over time in the UI |
 | [OpenTelemetry Compatibility](docs/otel-compatibility.md) | Supported OTel conventions, message delivery mechanisms, and OTLP receiver |
 ## Development

{agentevals_cli-0.9.4 → agentevals_cli-0.9.6}/README.md RENAMED Viewed

@@ -250,7 +250,7 @@ See the [Custom Evaluators guide](docs/custom-evaluators.md) for the full protoc
 agentevals serve            # bundled UI on http://localhost:8001
 ```
-Upload traces and eval sets, select evaluators, and view results with interactive span trees. Live-streamed traces appear in the "Local Dev" tab, grouped by session ID. For running from source, see [DEVELOPMENT.md](DEVELOPMENT.md).
+Upload traces and eval sets, select evaluators, and view results with interactive span trees. Live-streamed traces appear in the "Local Dev" tab, grouped by session ID. With the Postgres backend enabled, the "Run History" tab persists every evaluation and lets you group and trend runs by eval set or agent over time; see the [Run History guide](docs/run-history.md). For running from source, see [DEVELOPMENT.md](DEVELOPMENT.md).
 Interactive API docs are available at `/docs` (Swagger) and `/redoc` while the server is running. The OTLP receiver on port 4318 serves its own docs at `http://localhost:4318/docs`.
@@ -288,11 +288,12 @@ See the [Kubernetes example](examples/kubernetes/README.md) for an end-to-end wa
 #### Postgres backend (`/api/runs`)
-> **Preview.** Persistent run history backed by Postgres is under active
-> development. The `storage.*` and `database.postgres.*` chart values, the
-> `/api/runs` HTTP surface, and the database schema may change incompatibly
-> in upcoming releases. Operators evaluating this feature should plan to
-> recreate the agentevals schema when upgrading between minor versions.
+> **Preview.** Persisting evaluations and exploring them in the UI works end
+> to end (see the [Run History guide](docs/run-history.md)), but the storage
+> layer is still stabilizing. The `storage.*` and `database.postgres.*` chart
+> values, the `/api/runs` HTTP surface, and the database schema may change
+> incompatibly in upcoming releases. Operators evaluating this feature should
+> plan to recreate the agentevals schema when upgrading between minor versions.
 > Default in-memory mode is unaffected.
 By default the chart deploys agentevals with an in-memory backend; runs and results are not persisted. To enable the async `POST /api/runs` pipeline with durable Postgres-backed state:
@@ -311,6 +312,8 @@ helm install agentevals oci://ghcr.io/agentevals-dev/agentevals/helm/agentevals
 When `storage.backend=postgres` the app applies any pending schema migrations on startup (advisory-lock protected, safe across replicas) and starts an in-process worker that processes the run queue. Without `storage.backend=postgres` the `/api/runs` endpoints return 503 with a hint pointing at the env var.
+Persisted runs power the **Run History** view in the UI, where you can group and trend evaluations by eval set or agent and drill into per-run detail. See the [Run History guide](docs/run-history.md) for the full feature walkthrough and local-dev setup.
 ## MCP Server
 Exposes evaluation tools to MCP clients. A `.mcp.json` at the project root lets Claude Code pick it up automatically.
@@ -359,6 +362,7 @@ Working examples are in the [`examples/`](examples/) directory:
 | [Eval Set Format](docs/eval-set-format.md) | Schema, field reference, and examples for golden eval set JSON files |
 | [Custom Evaluators](docs/custom-evaluators.md) | Write your own scoring logic in Python, JavaScript, or any language |
 | [Live Streaming](docs/streaming.md) | Real-time trace streaming, dev server setup, and session management |
+| [Run History](docs/run-history.md) | Persisting evaluations to Postgres and exploring them over time in the UI |
 | [OpenTelemetry Compatibility](docs/otel-compatibility.md) | Supported OTel conventions, message delivery mechanisms, and OTLP receiver |
 ## Development

agentevals_cli-0.9.6/docs/run-history.md ADDED Viewed

@@ -0,0 +1,105 @@
+# Run History
+Run history turns each evaluation into a durable record you can revisit, group, and trend over time. When agentevals runs with the Postgres storage backend, every evaluation (whether an uploaded trace file or a live streaming session) is persisted as a **run** with its per case scores, and the UI's **Run History** view lets you explore how an agent or eval set performs across many runs.
+Without the Postgres backend, agentevals is stateless: evaluations still work and results show on the dashboard, but nothing is persisted and the run-history endpoints return `503`.
+## Enabling durable storage
+Run history requires the Postgres storage backend. It is opt in.
+### Local development
+The quickest path uses the Makefile target, which starts a throwaway Postgres container, applies migrations, and serves the app wired to it:
+```bash
+make dev-backend-pg
+```
+That is equivalent to:
+```bash
+export AGENTEVALS_STORAGE_BACKEND=postgres
+export AGENTEVALS_DATABASE_URL=postgresql://agentevals:agentevals@localhost:5432/agentevals
+uv run agentevals migrate up          # apply schema migrations
+uv run agentevals serve --dev         # serve with the Postgres backend
+```
+Run the UI in a second terminal (`cd ui && npm run dev`) and open the **Run History** tab.
+> The `make pg-up` container runs with `--rm` and no volume, so its data is ephemeral: `make pg-down` (or a reboot) resets your run history. Point `AGENTEVALS_DATABASE_URL` at a persistent Postgres if you want runs to survive across sessions.
+### Configuration reference
+| Variable | Purpose |
+|----------|---------|
+| `AGENTEVALS_STORAGE_BACKEND` | `postgres` to enable durable storage; anything else (default) keeps the in-memory backend |
+| `AGENTEVALS_DATABASE_URL` | Postgres DSN, e.g. `postgresql://user:pass@host:5432/dbname` |
+| `AGENTEVALS_DATABASE_URL_FILE` | Path to a file containing the DSN (preferred over the inline variable; useful for mounted secrets) |
+| `AGENTEVALS_DATABASE_SCHEMA` | Schema name to use (default `agentevals`) |
+On startup with `storage.backend=postgres` the app applies any pending migrations (advisory-lock protected, safe across replicas). For deployment via Helm, see the [Postgres backend section of the README](../README.md#postgres-backend-apiruns).
+## How runs get persisted
+A run is created once per evaluation, best effort: if persistence fails the evaluation result is still returned to the caller. Both evaluation paths persist:
+- **Uploaded traces** (`POST /api/evaluate`): the run aggregates every uploaded trace as one evaluation.
+- **Live sessions** (streaming dev server): scoring sessions from the UI persists one run per "Evaluate" click, aggregating the sessions it scored.
+Each run stores a pre-aggregated `summary` plus one `result` row per (eval case, evaluator):
+```jsonc
+// run.summary
+{
+  "trace_count": 8,
+  "result_counts": { "passed": 6, "failed": 2, "errored": 0, "skipped": 0 },
+  "per_metric": {
+    "tool_trajectory_avg_score": { "passed": 7, "failed": 1, "errored": 0, "skipped": 0, "avg_score": 0.94 }
+  },
+  "agents": ["langchain-agent", "openai-agents-agent"],
+  "performance_metrics": { "models": ["gpt-4o"], /* tokens, latency, counts */ },
+  "errors": []
+}
+```
+## Exploring runs in the UI
+Open **Run History** from the sidebar. It reads from `GET /api/runs`, so it shows the same friendly notice if durable storage is not configured.
+- **Trends.** A pass-rate line and a per-metric average-score line plot across runs over time, so regressions and improvements are visible at a glance.
+- **Group by.** Toggle between grouping by **eval set** or by **agent**, then pick a specific group to isolate its runs and trends. The pass-rate chart draws one line per agent.
+- **History table.** Every run with its status, eval set, agent, trace count, pass/fail counts, pass-rate bar, duration, and models. Click a row to open the run detail.
+- **Run detail.** For a single run: the evaluator configuration (metrics, thresholds, judge model), the golden eval set it was scored against, and per eval case results. Tool-trajectory results expand to an expected vs actual diff per invocation, showing exactly where the run diverged from the reference.
+### What is and is not persisted
+Run detail is an *evaluation record*, not a full trace record. It faithfully shows the expected behavior, each metric's pass or fail, and (for trajectory metrics) where the actual tool calls diverged. It does not retain the raw trace spans or timeline, and text-similarity metrics keep only their score, not the actual response text. To replay a full trace, use the live inspector at evaluation time.
+## Agent identity and grouping
+Runs group by **agent** using the OpenTelemetry `service.name` resource attribute, the cross-framework identifier for a service. Set it on your agent with the standard `OTEL_SERVICE_NAME` environment variable:
+```bash
+OTEL_SERVICE_NAME=my-agent python my_agent.py
+```
+The zero-code examples set this for you (for example `service.name=langchain-agent`). When `service.name` is absent, agentevals falls back to the framework agent name (`gen_ai.agent.name`); it never falls back to a model or span operation name, so a group is always a real agent identity.
+## Golden reference handling
+When you score other agents against a golden session, the golden defines the eval set and therefore matches itself trivially. To keep scoring meaningful, the golden is excluded from pass or fail counts, the agent list, and the results table, but its latency and token usage are still plotted in the performance charts (labeled as the reference) so you can compare the scored agents against the baseline.
+## HTTP API
+All endpoints return `503` (with a hint pointing at `AGENTEVALS_STORAGE_BACKEND=postgres`) when durable storage is not configured.
+| Method + path | Description |
+|---------------|-------------|
+| `GET /api/runs` | List runs, newest first. Filter with `status`, `limit` (1-1000), and `before` (a `created_at` cursor for pagination) |
+| `GET /api/runs/{run_id}` | Fetch a single run (spec + summary) |
+| `GET /api/runs/{run_id}/results` | List the per (eval case, evaluator) result rows for a run |
+| `POST /api/runs` | Submit a run for asynchronous execution by the in-process worker; idempotent on `run_id` |
+| `POST /api/runs/{run_id}/cancel` | Request cancellation of a queued or running run (idempotent) |
+Interactive API docs are available at `/docs` (Swagger) and `/redoc` while the server is running.

{agentevals_cli-0.9.4 → agentevals_cli-0.9.6}/examples/dice_agent/agent.py RENAMED Viewed

@@ -56,7 +56,7 @@ def check_prime(nums: list[int]) -> dict:
 dice_agent = Agent(
     name="dice_agent",
     # model="gemini-2.5-flash",
-    model="gemini-2.5-flash-lite",
+    model="gemini-3-flash-preview",
     instruction="""You are a helpful assistant that can roll dice and check if numbers are prime.
 When a user asks you to roll a die, use the roll_die tool with the appropriate number of sides.

{agentevals_cli-0.9.4 → agentevals_cli-0.9.6}/examples/zero-code-examples/adk/run.py RENAMED Viewed

@@ -44,6 +44,7 @@ async def main():
     endpoint = os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4318")
     print(f"OTLP endpoint: {endpoint}")
+    os.environ.setdefault("OTEL_SERVICE_NAME", "adk-agent")
     os.environ.setdefault(
         "OTEL_RESOURCE_ATTRIBUTES",
         "agentevals.eval_set_id=dice_agent_eval,agentevals.session_name=adk-zero-code",

{agentevals_cli-0.9.4 → agentevals_cli-0.9.6}/examples/zero-code-examples/langchain/run.py RENAMED Viewed

@@ -48,6 +48,7 @@ def main():
     os.environ["OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT"] = "true"
+    os.environ.setdefault("OTEL_SERVICE_NAME", "langchain-agent")
     os.environ.setdefault(
         "OTEL_RESOURCE_ATTRIBUTES",
         "agentevals.eval_set_id=langchain_agent_eval,agentevals.session_name=langchain-zero-code",

{agentevals_cli-0.9.4 → agentevals_cli-0.9.6}/examples/zero-code-examples/ollama/run.py RENAMED Viewed

@@ -112,6 +112,7 @@ def main():
     print(f"Local model: {model}")
     os.environ["OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT"] = "true"
+    os.environ.setdefault("OTEL_SERVICE_NAME", "ollama-agent")
     os.environ.setdefault(
         "OTEL_RESOURCE_ATTRIBUTES",
         "agentevals.eval_set_id=langchain_local_ollama_openai_eval,agentevals.session_name=langchain-ollama-openai-zero-code",

{agentevals_cli-0.9.4 → agentevals_cli-0.9.6}/examples/zero-code-examples/openai-agents/run.py RENAMED Viewed

@@ -58,6 +58,7 @@ def main():
     os.environ.setdefault("OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT", "span_and_event")
     os.environ.setdefault("OTEL_SEMCONV_STABILITY_OPT_IN", "gen_ai_latest_experimental")
+    os.environ.setdefault("OTEL_SERVICE_NAME", "openai-agents-agent")
     os.environ.setdefault(
         "OTEL_RESOURCE_ATTRIBUTES",
         "agentevals.eval_set_id=openai_agents_eval,agentevals.session_name=openai-agents-zero-code",

{agentevals_cli-0.9.4 → agentevals_cli-0.9.6}/examples/zero-code-examples/pydantic-ai/run.py RENAMED Viewed

@@ -54,6 +54,7 @@ def main():
     endpoint = os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4318")
     print(f"OTLP endpoint: {endpoint}")
+    os.environ.setdefault("OTEL_SERVICE_NAME", "pydantic-ai-agent")
     os.environ.setdefault(
         "OTEL_RESOURCE_ATTRIBUTES",
         "agentevals.eval_set_id=pydantic_ai_eval,agentevals.session_name=pydantic-ai-zero-code",
@@ -72,6 +73,7 @@ def main():
     agent = Agent(
         "openai:gpt-4o-mini",
+        # "openai:gpt-5.4-mini-2026-03-17",
         instructions="You are a helpful assistant. You can roll dice and check if numbers are prime.",
     )
     agent.tool_plain(roll_die)

{agentevals_cli-0.9.4 → agentevals_cli-0.9.6}/examples/zero-code-examples/strands/run.py RENAMED Viewed

@@ -40,6 +40,7 @@ def main():
     endpoint = os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4318")
     print(f"OTLP endpoint: {endpoint}")
+    os.environ.setdefault("OTEL_SERVICE_NAME", "strands-agent")
     os.environ.setdefault(
         "OTEL_RESOURCE_ATTRIBUTES",
         "agentevals.eval_set_id=strands_agent_eval,agentevals.session_name=strands-zero-code",

agentevals-cli 0.9.4__tar.gz → 0.9.6__tar.gz

agentevals-cli 0.9.4tar.gz → 0.9.6tar.gz