PyPI - agentevals-cli - Versions diffs - 0.8.4__tar.gz → 0.9.1__tar.gz - Mend

agentevals-cli 0.8.4tar.gz → 0.9.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (278) hide show

{agentevals_cli-0.8.4 → agentevals_cli-0.9.1}/DEVELOPMENT.md RENAMED Viewed

@@ -50,7 +50,7 @@ Once running, submit a run with:
 ```bash
 curl -X POST http://localhost:8001/api/runs \
     -H 'content-type: application/json' \
-    -d '{"spec": {"approach": "trace_replay", "target": {"kind": "inline", "inline": {...}}, "evalConfig": {"metrics": ["tool_trajectory_avg_score"]}}}'
+    -d '{"spec": {"approach": "trace_replay", "target": {"kind": "inline", "inline": {...}}, "evalConfig": {"evaluators": [{"name": "tool_trajectory_avg_score", "type": "builtin"}]}}}'
 ```
 Then poll `GET /api/runs/{runId}` and `GET /api/runs/{runId}/results`. Without `storage.backend=postgres`, the `/api/runs` endpoints return 503 with a hint pointing at the env var.

{agentevals_cli-0.8.4 → agentevals_cli-0.9.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: agentevals-cli
-Version: 0.8.4
+Version: 0.9.1
 Summary: Standalone framework to evaluate agent correctness based on portable OpenTelemetry traces
 License-File: LICENSE
 Requires-Python: >=3.11
@@ -278,7 +278,7 @@ See the [Custom Evaluators guide](docs/custom-evaluators.md) for the full protoc
 agentevals serve            # bundled UI on http://localhost:8001
 ```
-Upload traces and eval sets, select metrics, and view results with interactive span trees. Live-streamed traces appear in the "Local Dev" tab, grouped by session ID. For running from source, see [DEVELOPMENT.md](DEVELOPMENT.md).
+Upload traces and eval sets, select evaluators, and view results with interactive span trees. Live-streamed traces appear in the "Local Dev" tab, grouped by session ID. For running from source, see [DEVELOPMENT.md](DEVELOPMENT.md).
 Interactive API docs are available at `/docs` (Swagger) and `/redoc` while the server is running. The OTLP receiver on port 4318 serves its own docs at `http://localhost:4318/docs`.

{agentevals_cli-0.8.4 → agentevals_cli-0.9.1}/README.md RENAMED Viewed

@@ -250,7 +250,7 @@ See the [Custom Evaluators guide](docs/custom-evaluators.md) for the full protoc
 agentevals serve            # bundled UI on http://localhost:8001
 ```
-Upload traces and eval sets, select metrics, and view results with interactive span trees. Live-streamed traces appear in the "Local Dev" tab, grouped by session ID. For running from source, see [DEVELOPMENT.md](DEVELOPMENT.md).
+Upload traces and eval sets, select evaluators, and view results with interactive span trees. Live-streamed traces appear in the "Local Dev" tab, grouped by session ID. For running from source, see [DEVELOPMENT.md](DEVELOPMENT.md).
 Interactive API docs are available at `/docs` (Swagger) and `/redoc` while the server is running. The OTLP receiver on port 4318 serves its own docs at `http://localhost:4318/docs`.

{agentevals_cli-0.8.4 → agentevals_cli-0.9.1}/docs/custom-evaluators.md RENAMED Viewed

@@ -317,6 +317,26 @@ The `grader.evaluation_metric` field selects the similarity algorithm:
 | `rouge_1` through `rouge_5` | Unigram through 5-gram overlap (F-measure) |
 | `rouge_l` | Longest common subsequence overlap (F-measure) |
+### Label Model Grader
+Scores responses without a golden set. The model reads each response and assigns a label from a fixed list. Passing labels are defined in the config.
+```yaml
+evaluators:
+  - name: quality_check
+    type: openai_eval
+    grader:
+      type: label_model
+      model: gpt-4o-mini
+      input:
+        - role: user
+          content: "Rate this response: {{ item.actual_response }}"
+      labels: [good, bad]
+      passing_labels: [good]
+```
+The `threshold` field is not used for `label_model`. A response passes if its assigned label is in `passing_labels`.
 ### How it works
 Under the hood, agentevals creates an ephemeral eval on OpenAI, submits the actual and expected responses as JSONL items, polls for results, and cleans up. The agent's response and the golden reference are both placed in the `item` namespace (with `include_sample_schema: false`), so OpenAI only grades the provided text without generating any model outputs.

{agentevals_cli-0.8.4 → agentevals_cli-0.9.1}/docs/eval-set-format.md RENAMED Viewed

@@ -1,6 +1,6 @@
 # Eval Set Format
-An eval set is a JSON file containing golden reference data that metrics compare agent traces against. It follows the [Google ADK `EvalSet`](https://github.com/google/adk-python/blob/main/src/google/adk/evaluation/eval_set.py) schema, which means eval sets are portable between agentevals and ADK tooling.
+An eval set is a JSON file containing golden reference data that evaluators compare agent traces against. It follows the [Google ADK `EvalSet`](https://github.com/google/adk-python/blob/main/src/google/adk/evaluation/eval_set.py) schema, which means eval sets are portable between agentevals and ADK tooling.
 Most users will not need to author eval sets by hand. The web UI can generate them from live sessions (mark a session as golden, and the server builds the eval set automatically). This document is for users who want to create or edit eval sets directly, whether for CLI usage, CI pipelines, or version-controlled test suites.
@@ -203,9 +203,9 @@ The `parts` array can contain text, function calls, or function responses. Most
 Each `FunctionCall` has `name`, `args`, and `id`. Each `FunctionResponse` has `name`, `response`, and `id`. Match `id` values between calls and responses to pair them.
-## Which Metrics Use Eval Sets
+## Which Evaluators Use Eval Sets
-Not all metrics require an eval set. Use `agentevals list-metrics` to see which do:
+Not all evaluators require an eval set. Use `agentevals evaluator list --source builtin` to see which built-in evaluators do:
 | Metric | Needs Eval Set | What It Reads |
 |---|---|---|

{agentevals_cli-0.8.4 → agentevals_cli-0.9.1}/examples/custom_evaluators/eval_config.yaml RENAMED Viewed

@@ -32,3 +32,4 @@ evaluators:
     ref: evaluators/random_evaluator/random_evaluator.py
     threshold: 0.110
     executor: local

agentevals_cli-0.9.1/examples/custom_evaluators/eval_config_openai_eval.yaml ADDED Viewed

@@ -0,0 +1,18 @@
+# Eval config using OpenAI Evals API graders.
+# Requires OPENAI_API_KEY to be set.
+#
+# Run with:
+#   agentevals run samples/helm.json \
+#     --config examples/custom_evaluators/eval_config_openai_eval.yaml
+evaluators:
+  - name: quality_check
+    type: openai_eval
+    grader:
+      type: label_model
+      model: gpt-4o-mini
+      input:
+        - role: user
+          content: "Rate this response: {{ item.actual_response }}"
+      labels: [good, bad]
+      passing_labels: [good]

{agentevals_cli-0.8.4 → agentevals_cli-0.9.1}/examples/dice_agent/README.md RENAMED Viewed

@@ -149,7 +149,7 @@ Update `main.py` to test the new functionality.
 **After agent completes:**
 - Status changes to "EVALUATED"
 - Evaluation results appear as colored badges
-- Each metric shows: name and score (e.g., "tool_trajectory_avg_score: 1.00")
+- Each evaluator result shows: name and score (e.g., "tool_trajectory_avg_score: 1.00")
 **Multiple runs:**
 - Each run creates a new session with model name in ID

{agentevals_cli-0.8.4 → agentevals_cli-0.9.1}/examples/kubernetes/README.md RENAMED Viewed

@@ -221,7 +221,7 @@ This captures the GPT-5 session's tool trajectory and final responses as the gol
 2. Select both sessions (the `gpt-4.1-mini` session and the `gpt-5` session)
 3. Click **Evaluate**
 4. Select the `helm-agent-comparison` eval set
-5. Choose the metrics:
+5. Choose the evaluators:
    - **tool_trajectory_avg_score**: Did the agent call the correct tools in the correct order?
    - **response_match_score**: Did the agent produce responses consistent with the golden reference?
 6. Run the evaluation
@@ -241,7 +241,7 @@ Compare the two sessions in the results table:
 <img width="1914" height="1154" alt="image" src="https://github.com/user-attachments/assets/5939a8d4-3775-4cf1-9cf2-d3b6b4afd582" />
-You can also click an individual conversation and see a breakdown of each evaluators.
+You can also click an individual conversation and see a breakdown of each evaluator.
 <img width="1916" height="1348" alt="image" src="https://github.com/user-attachments/assets/984b3d29-8018-4fcb-9036-bb7c6e97d9ff" />

{agentevals_cli-0.8.4 → agentevals_cli-0.9.1}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 [project]
 name = "agentevals-cli"
-version = "0.8.4"
+version = "0.9.1"
 description = "Standalone framework to evaluate agent correctness based on portable OpenTelemetry traces"
 readme = "README.md"
 requires-python = ">=3.11"

agentevals-cli 0.8.4__tar.gz → 0.9.1__tar.gz

agentevals-cli 0.8.4tar.gz → 0.9.1tar.gz