PyPI - agentops-accelerator - Versions diffs - 0.3.5__tar.gz → 0.3.7__tar.gz - Mend

agentops-accelerator 0.3.5tar.gz → 0.3.7tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (299) hide show

{agentops_accelerator-0.3.5 → agentops_accelerator-0.3.7}/.claude-plugin/marketplace.json RENAMED Viewed

@@ -13,7 +13,7 @@
       "name": "agentops-accelerator",
       "source": "../../plugins/agentops",
       "description": "Copilot agent skills for running standardized evaluation workflows with AgentOps Toolkit and Microsoft Foundry agents.",
-      "version": "0.3.5",
+      "version": "0.3.7",
       "keywords": [
         "agentops",
         "evaluation",

{agentops_accelerator-0.3.5 → agentops_accelerator-0.3.7}/.github/plugin/marketplace.json RENAMED Viewed

@@ -13,7 +13,7 @@
       "name": "agentops-accelerator",
       "source": "../../plugins/agentops",
       "description": "Copilot agent skills for running standardized evaluation workflows with AgentOps Toolkit and Microsoft Foundry agents.",
-      "version": "0.3.5",
+      "version": "0.3.7",
       "keywords": [
         "agentops",
         "evaluation",

{agentops_accelerator-0.3.5 → agentops_accelerator-0.3.7}/CHANGELOG.md RENAMED Viewed

@@ -5,6 +5,44 @@ This format follows [Keep a Changelog](https://keepachangelog.com/) and adheres
 ## [Unreleased]
+## [0.3.7] - 2026-06-01
+### Fixed
+- **RBAC preflight now covers Foundry/Azure AI managed identities, not only
+  the signed-in user.** Cloud evaluations run server-side and some agent or
+  grader calls authenticate as the managed identities on the backing AI
+  Services account and child Foundry project. Granting `Cognitive Services
+  OpenAI User` only to the user still allowed intermittent grader
+  `AuthenticationError` failures and the v0.3.6 execution warning. The
+  prompt-agent, hosted-agent, and end-to-end tutorials plus the
+  `agentops-eval` skill now assign the same data-plane role to every managed
+  identity in the Foundry resource group, preventing the warning/failure path
+  before `agentops eval run`.
+## [0.3.6] - 2026-06-01
+### Changed
+- **`agentops eval run` now distinguishes a grader *execution* failure from a
+  quality-gate failure.** When evaluator workers error out on a subset of rows
+  (auth/RBAC/timeout), no row has every grader return a score, so
+  `items_passed_all` is `0` and the run reports `Threshold status: FAILED` even
+  though every threshold that *could* be computed passed. The CLI now detects
+  this case (errored graders combined with all thresholds passing) and prints a
+  `Warning` explaining that this is an execution error, not a quality
+  regression, names the most common cause (data-plane RBAC granted moments
+  earlier that is still propagating to the evaluator workers), surfaces the
+  first underlying grader error, and advises waiting a few minutes before
+  re-running. The exit-code contract is unchanged. Added the
+  `_grader_error_summary` helper plus focused unit tests.
+- **Corrected the RBAC propagation guidance in the tutorials and the
+  `agentops-eval` skill.** Data-plane role assignments on Cognitive Services
+  accounts can take several minutes (not 30-120 seconds) to reach the
+  independent, per-row evaluator workers, which can produce an *intermittent*
+  `FAILED` with otherwise-green thresholds on the first run after granting
+  access. The prompt-agent, hosted-agent, and end-to-end tutorials and the
+  skill now describe this symptom and tell readers to wait and re-run rather
+  than lower thresholds.
 ## [0.3.5] - 2026-06-01
 ### Changed

{agentops_accelerator-0.3.5/src/agentops_accelerator.egg-info → agentops_accelerator-0.3.7}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: agentops-accelerator
-Version: 0.3.5
+Version: 0.3.7
 Summary: Release readiness gates and evidence for Microsoft Foundry agents
 License: MIT License

{agentops_accelerator-0.3.5 → agentops_accelerator-0.3.7}/docs/tutorial-end-to-end.md RENAMED Viewed

@@ -286,7 +286,7 @@ for creating agents, tools, tracing, evaluation, and red-team scans:
 https://github.com/Azure-Samples/microsoft-foundry-e2e-agent-observability-workshop/tree/2026-04-aie-europe
 ```
-### Grant your identity data-plane access to the AI Services account
+### Grant data-plane access to your identity and Foundry managed identities
 Both options above (prompt agent and hosted HTTP agent) eventually drive
 an `agentops eval run` that calls chat-completions on the AI Services
@@ -300,19 +300,43 @@ what causes the eval to fail later with `PermissionDenied` on
 `Microsoft.CognitiveServices/accounts/OpenAI/deployments/chat/
 completions/action`.
-Run the assignment once per resource group that hosts a Foundry account
-you will evaluate against. Replace `<your-objectId>`,
-`<subscription-id>`, and `<resource-group>` with your own values (use
-`az ad signed-in-user show --query id -o tsv` to get the object ID):
+Run these assignments once per resource group that hosts a Foundry account
+you will evaluate against. Cloud evaluations run server-side and some agent
+or grader calls may authenticate as Foundry/Azure AI managed identities, not
+only as your signed-in user. Assigning the role only to your user can still
+leave graders failing with `AuthenticationError`.
 ```powershell
+$subscriptionId = az account show --query id -o tsv
+$resourceGroup = "<resource-group>"
+$scope = "/subscriptions/$subscriptionId/resourceGroups/$resourceGroup"
+$userObjectId = az ad signed-in-user show --query id -o tsv
 az role assignment create `
-  --assignee <your-objectId> `
+  --assignee $userObjectId `
   --role "Cognitive Services OpenAI User" `
-  --scope /subscriptions/<subscription-id>/resourceGroups/<resource-group>
-```
-Propagation usually completes within 30–120 seconds.
+  --scope $scope
+az resource list -g $resourceGroup `
+  --query "[?identity.principalId!=null].identity.principalId" -o tsv |
+  ForEach-Object {
+    az role assignment create `
+      --assignee-object-id $_ `
+      --assignee-principal-type ServicePrincipal `
+      --role "Cognitive Services OpenAI User" `
+      --scope $scope
+  }
+```
+> **Give the assignment a few minutes to propagate.** Data-plane role
+> assignments on the AI Services account do **not** take effect
+> instantly — propagation to the evaluator workers can take several
+> minutes (occasionally up to ~15). Evaluators authenticate per call, so
+> the **first eval right after granting the role may show intermittent
+> `AuthenticationError` on a subset of graders and report
+> `Threshold status: FAILED` even when every threshold is green**. This
+> is a grader execution failure, not a quality regression — wait a few
+> minutes and re-run the eval.
 ## 2. Create the travel eval dataset

{agentops_accelerator-0.3.5 → agentops_accelerator-0.3.7}/docs/tutorial-hosted-agent-quickstart.md RENAMED Viewed

@@ -310,7 +310,7 @@ If the deployed endpoint needs a bearer token:
 $env:HOSTED_AGENT_TOKEN = "<token>"
 ```
-### Grant your identity data-plane access to the AI Services account
+### Grant data-plane access to your identity and Foundry managed identities
 The local AI-assisted evaluators that AgentOps runs in step 8 call
 chat-completions on the AI Services account that backs your Foundry
@@ -322,19 +322,43 @@ but `dataActions: []`. Skipping this once causes the eval to fail with
 `PermissionDenied` on `Microsoft.CognitiveServices/accounts/OpenAI/
 deployments/chat/completions/action`.
-Run the assignment once per resource group hosting a Foundry account
-you will evaluate against (replace `<your-objectId>`,
-`<subscription-id>`, and `<resource-group>` with your values; get the
-object ID with `az ad signed-in-user show --query id -o tsv`):
+Run these assignments once per resource group hosting a Foundry account
+you will evaluate against. Local AI-assisted evaluators use your identity,
+while Foundry-hosted/server-side eval paths may use Azure AI managed
+identities from the same resource group. Assigning only the user can still
+leave server-side graders failing with `AuthenticationError`.
 ```powershell
+$subscriptionId = az account show --query id -o tsv
+$resourceGroup = "<resource-group>"
+$scope = "/subscriptions/$subscriptionId/resourceGroups/$resourceGroup"
+$userObjectId = az ad signed-in-user show --query id -o tsv
 az role assignment create `
-  --assignee <your-objectId> `
+  --assignee $userObjectId `
   --role "Cognitive Services OpenAI User" `
-  --scope /subscriptions/<subscription-id>/resourceGroups/<resource-group>
-```
-Propagation usually completes within 30–120 seconds.
+  --scope $scope
+az resource list -g $resourceGroup `
+  --query "[?identity.principalId!=null].identity.principalId" -o tsv |
+  ForEach-Object {
+    az role assignment create `
+      --assignee-object-id $_ `
+      --assignee-principal-type ServicePrincipal `
+      --role "Cognitive Services OpenAI User" `
+      --scope $scope
+  }
+```
+> **Give the assignment a few minutes to propagate.** Data-plane role
+> assignments on the AI Services account do **not** take effect
+> instantly — propagation to the local/Foundry evaluator workers can
+> take several minutes (occasionally up to ~15). Evaluators authenticate
+> per call, so the **first eval right after granting the role may show
+> intermittent `AuthenticationError` on a subset of graders and report
+> `Threshold status: FAILED` even when every threshold is green**. This
+> is a grader execution failure, not a quality regression — wait a few
+> minutes and re-run the eval.
 ## 5. Initialize AgentOps interactively

{agentops_accelerator-0.3.5 → agentops_accelerator-0.3.7}/docs/tutorial-prompt-agent-quickstart.md RENAMED Viewed

@@ -241,7 +241,7 @@ Show me the planned changes and the resulting endpoints before applying.
 If the skill is not available, use Path A.
-### Grant your identity data-plane access to the AI Services account
+### Grant data-plane access to your identity and Foundry managed identities
 Creating a project through the portal only assigns you `Foundry User` **at
 the project scope**. That role does not cover the OpenAI data-plane actions
@@ -257,23 +257,54 @@ Skipping this step is what causes the eval grader to fail later with::
     data action `Microsoft.CognitiveServices/accounts/OpenAI/deployments/
     chat/completions/action` to perform `POST /openai/deployments/...`
-Run the assignment once per resource group that hosts a Foundry account
-you will evaluate against. Replace `<your-objectId>`, `<subscription-id>`,
-and `<resource-group>` with your own values (you can get the object ID
-with `az ad signed-in-user show --query id -o tsv`):
+Run these assignments once per resource group that hosts a Foundry account
+you will evaluate against. Cloud evaluations run server-side: the agent call
+and graders may authenticate as Foundry/Azure AI managed identities, not only
+as your signed-in user. Assigning the role only to your user can still leave
+some graders failing with `AuthenticationError`.
 ```powershell
+$subscriptionId = az account show --query id -o tsv
+$resourceGroup = "<resource-group>"
+$scope = "/subscriptions/$subscriptionId/resourceGroups/$resourceGroup"
+$userObjectId = az ad signed-in-user show --query id -o tsv
+# User running local commands / creating cloud evals.
 az role assignment create `
-  --assignee <your-objectId> `
+  --assignee $userObjectId `
   --role "Cognitive Services OpenAI User" `
-  --scope /subscriptions/<subscription-id>/resourceGroups/<resource-group>
+  --scope $scope
+# Foundry/Azure AI managed identities used by server-side agent/evaluator calls.
+az resource list -g $resourceGroup `
+  --query "[?identity.principalId!=null].identity.principalId" -o tsv |
+  ForEach-Object {
+    az role assignment create `
+      --assignee-object-id $_ `
+      --assignee-principal-type ServicePrincipal `
+      --role "Cognitive Services OpenAI User" `
+      --scope $scope
+  }
 ```
 Repeat the command with the `travel-agent-dev` resource group if the dev
-project lives in a different RG. The assignment usually propagates within
-30–120 seconds. AgentOps Doctor will detect the missing assignment in a
-future release, but until then this is a manual one-time setup step per
-new environment.
+project lives in a different RG.
+> **Give the assignment a few minutes to propagate.** Data-plane role
+> assignments on the AI Services account do **not** take effect
+> instantly — propagation to the Foundry evaluator workers can take
+> several minutes (occasionally up to ~15). The cloud eval runs each
+> grader as an independent worker that authenticates separately, so the
+> **first run right after granting the role may show intermittent
+> `AuthenticationError` on a subset of graders and report
+> `Threshold status: FAILED` even when every threshold is green** (no
+> single row had all graders succeed). This is a grader execution
+> failure, not a quality regression. Wait a few minutes and re-run
+> `agentops eval run` — once propagation finishes, every grader scores
+> and the gate passes.
+AgentOps Doctor will detect the missing assignment in a future release,
+but until then this is a manual one-time setup step per new environment.
 ## 4. Seed `travel-agent` in the sandbox project

{agentops_accelerator-0.3.5 → agentops_accelerator-0.3.7}/plugins/agentops/package.json RENAMED Viewed

@@ -2,7 +2,7 @@
   "name": "agentops-accelerator",
   "displayName": "AgentOps Accelerator — Skills for GitHub Copilot",
   "description": "Copilot agent skills for running standardized evaluation workflows with AgentOps Accelerator and Microsoft Foundry agents.",
-  "version": "0.3.5",
+  "version": "0.3.7",
   "publisher": "AgentOpsAccelerator",
   "icon": "icon.png",
   "license": "MIT",

{agentops_accelerator-0.3.5 → agentops_accelerator-0.3.7}/plugins/agentops/plugin.json RENAMED Viewed

@@ -1,7 +1,7 @@
 {
   "name": "agentops-accelerator",
   "description": "Copilot agent skills for running standardized evaluation workflows with AgentOps Accelerator and Microsoft Foundry agents.",
-  "version": "0.3.5",
+  "version": "0.3.7",
   "author": {
     "name": "AgentOps Accelerator",
     "url": "https://github.com/Azure/agentops"

{agentops_accelerator-0.3.5/src/agentops/templates → agentops_accelerator-0.3.7/plugins/agentops}/skills/agentops-eval/SKILL.md RENAMED Viewed

@@ -41,8 +41,12 @@ PermissionDenied … lacks the required data action
 'Microsoft.CognitiveServices/accounts/OpenAI/deployments/chat/completions/action'
 ```
-Run this preflight before Step 1 - it is idempotent (Azure returns
-`RoleAssignmentExists` if already granted) and takes ~5 seconds:
+Run this preflight before Step 1. It must grant the role to the signed-in
+user **and** to the Foundry/Azure AI managed identities in the resource
+group. Cloud evaluations run server-side and some graders authenticate as
+those managed identities, so assigning only the user can still produce
+intermittent `AuthenticationError` grader failures. The commands are
+idempotent (`RoleAssignmentExists` means the role was already granted):
 ```bash
 # 1. Resolve the AI Services account from agentops.yaml / .azure/<env>/.env
@@ -55,11 +59,23 @@ SUB_ID=$(az account show --query id -o tsv)
 RG=$(az cognitiveservices account list --subscription "$SUB_ID" --query "[?name=='$ACCOUNT_NAME'].resourceGroup | [0]" -o tsv)
 OBJ_ID=$(az ad signed-in-user show --query id -o tsv)
-# 3. Grant data-plane access at the RG scope (covers sandbox + future evals)
+# 3. Grant the user data-plane access at RG scope.
 az role assignment create \
   --assignee "$OBJ_ID" \
   --role "Cognitive Services OpenAI User" \
   --scope "/subscriptions/$SUB_ID/resourceGroups/$RG"
+# 4. Grant the same data-plane role to Foundry/Azure AI managed identities.
+az resource list -g "$RG" \
+  --query "[?identity.principalId!=null].identity.principalId" -o tsv |
+while read -r PRINCIPAL_ID; do
+  [ -z "$PRINCIPAL_ID" ] && continue
+  az role assignment create \
+    --assignee-object-id "$PRINCIPAL_ID" \
+    --assignee-principal-type ServicePrincipal \
+    --role "Cognitive Services OpenAI User" \
+    --scope "/subscriptions/$SUB_ID/resourceGroups/$RG"
+done
 ```
 PowerShell equivalent: replace `$(...)` with the PowerShell variable
@@ -73,6 +89,17 @@ Skip this step only if the user explicitly says the role is already
 assigned, or if a previous `agentops eval run` succeeded against the
 same Foundry account.
+**Propagation:** data-plane role assignments do not take effect
+instantly — allow several minutes (occasionally up to ~15) before the
+first eval. The cloud/local graders authenticate per call, so if the
+user runs an eval immediately after this preflight and sees intermittent
+`AuthenticationError` on a subset of graders plus
+`Threshold status: FAILED` while the visible thresholds are green, that
+is propagation lag (a grader **execution** failure), not a quality
+regression. Tell the user to wait a few minutes and re-run
+`agentops eval run`; do not treat it as a failing gate or start changing
+thresholds.
 ## Step 1 - Analyze evaluation setup
 Run the deterministic local triage first:

{agentops_accelerator-0.3.5 → agentops_accelerator-0.3.7}/src/agentops/cli/app.py RENAMED Viewed

@@ -2055,10 +2055,57 @@ def _run_flat_schema_eval(
     if result.summary.overall_passed:
         typer.echo(f"{_cli_label('Threshold status')}: {style('PASSED', 'bold', 'green')}")
         return
+    # Distinguish a genuine quality-gate failure from grader *execution*
+    # errors. When evaluator workers error (auth/RBAC/timeout) on a subset of
+    # rows, no row has every grader succeed, so `items_passed_all` is 0 and the
+    # gate reports FAILED even though every threshold that *could* be computed
+    # passed. Surfacing this prevents users from chasing a phantom quality
+    # regression - the most common cause is data-plane RBAC granted moments
+    # earlier that is still propagating to the evaluator workers.
+    errored, total, first_error = _grader_error_summary(result)
+    all_thresholds_passed = (
+        result.summary.thresholds_total > 0
+        and result.summary.thresholds_passed == result.summary.thresholds_total
+    )
+    if errored and all_thresholds_passed:
+        typer.echo(
+            f"{_cli_warn('Warning')}: {errored} of {total} grader execution(s) "
+            "errored, so no dataset row had every grader return a score. This is "
+            "a grader execution failure, not a quality regression - every "
+            "threshold that could be computed passed. The most common cause is "
+            "data-plane RBAC granted recently that is still propagating to the "
+            "evaluator workers; wait a few minutes and re-run `agentops eval run`.",
+            err=True,
+        )
+        if first_error:
+            typer.echo(f"{_cli_warn('Warning')}: first grader error: {first_error}", err=True)
     typer.echo(f"{_cli_label('Threshold status')}: {style('FAILED', 'bold', 'red')}")
     raise typer.Exit(code=exit_code_from(result))
+def _grader_error_summary(result) -> tuple[int, int, Optional[str]]:
+    """Return ``(errored_metric_count, total_metric_count, first_error)``.
+    Walks every per-row metric in the run so the CLI can tell a grader
+    *execution* failure (auth/RBAC/timeout) apart from a quality-gate failure.
+    The first non-empty error string is lifted out as the actionable cause.
+    """
+    errored = 0
+    total = 0
+    first_error: Optional[str] = None
+    for row in result.rows:
+        for metric in row.metrics:
+            total += 1
+            err = getattr(metric, "error", None)
+            if isinstance(err, str) and err.strip():
+                errored += 1
+                if first_error is None:
+                    first_error = err.strip()
+    return errored, total, first_error
 def _default_flat_output_dir(config_path: Path) -> Path:
     base = config_path.parent / ".agentops" / "results"
     timestamp = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H-%M-%SZ")

{agentops_accelerator-0.3.5/plugins/agentops → agentops_accelerator-0.3.7/src/agentops/templates}/skills/agentops-eval/SKILL.md RENAMED Viewed

@@ -41,8 +41,12 @@ PermissionDenied … lacks the required data action
 'Microsoft.CognitiveServices/accounts/OpenAI/deployments/chat/completions/action'
 ```
-Run this preflight before Step 1 - it is idempotent (Azure returns
-`RoleAssignmentExists` if already granted) and takes ~5 seconds:
+Run this preflight before Step 1. It must grant the role to the signed-in
+user **and** to the Foundry/Azure AI managed identities in the resource
+group. Cloud evaluations run server-side and some graders authenticate as
+those managed identities, so assigning only the user can still produce
+intermittent `AuthenticationError` grader failures. The commands are
+idempotent (`RoleAssignmentExists` means the role was already granted):
 ```bash
 # 1. Resolve the AI Services account from agentops.yaml / .azure/<env>/.env
@@ -55,11 +59,23 @@ SUB_ID=$(az account show --query id -o tsv)
 RG=$(az cognitiveservices account list --subscription "$SUB_ID" --query "[?name=='$ACCOUNT_NAME'].resourceGroup | [0]" -o tsv)
 OBJ_ID=$(az ad signed-in-user show --query id -o tsv)
-# 3. Grant data-plane access at the RG scope (covers sandbox + future evals)
+# 3. Grant the user data-plane access at RG scope.
 az role assignment create \
   --assignee "$OBJ_ID" \
   --role "Cognitive Services OpenAI User" \
   --scope "/subscriptions/$SUB_ID/resourceGroups/$RG"
+# 4. Grant the same data-plane role to Foundry/Azure AI managed identities.
+az resource list -g "$RG" \
+  --query "[?identity.principalId!=null].identity.principalId" -o tsv |
+while read -r PRINCIPAL_ID; do
+  [ -z "$PRINCIPAL_ID" ] && continue
+  az role assignment create \
+    --assignee-object-id "$PRINCIPAL_ID" \
+    --assignee-principal-type ServicePrincipal \
+    --role "Cognitive Services OpenAI User" \
+    --scope "/subscriptions/$SUB_ID/resourceGroups/$RG"
+done
 ```
 PowerShell equivalent: replace `$(...)` with the PowerShell variable
@@ -73,6 +89,17 @@ Skip this step only if the user explicitly says the role is already
 assigned, or if a previous `agentops eval run` succeeded against the
 same Foundry account.
+**Propagation:** data-plane role assignments do not take effect
+instantly — allow several minutes (occasionally up to ~15) before the
+first eval. The cloud/local graders authenticate per call, so if the
+user runs an eval immediately after this preflight and sees intermittent
+`AuthenticationError` on a subset of graders plus
+`Threshold status: FAILED` while the visible thresholds are green, that
+is propagation lag (a grader **execution** failure), not a quality
+regression. Tell the user to wait a few minutes and re-run
+`agentops eval run`; do not treat it as a failing gate or start changing
+thresholds.
 ## Step 1 - Analyze evaluation setup
 Run the deterministic local triage first:

{agentops_accelerator-0.3.5 → agentops_accelerator-0.3.7/src/agentops_accelerator.egg-info}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: agentops-accelerator
-Version: 0.3.5
+Version: 0.3.7
 Summary: Release readiness gates and evidence for Microsoft Foundry agents
 License: MIT License

{agentops_accelerator-0.3.5 → agentops_accelerator-0.3.7}/src/agentops_accelerator.egg-info/SOURCES.txt RENAMED Viewed

@@ -268,6 +268,7 @@ tests/unit/test_doctor_cli_explain.py
 tests/unit/test_dotenv_loader.py
 tests/unit/test_e2e_render.py
 tests/unit/test_eval_analysis.py
+tests/unit/test_eval_run_grader_errors.py
 tests/unit/test_evaluators.py
 tests/unit/test_foundry_discovery.py
 tests/unit/test_init_command.py

agentops_accelerator-0.3.7/tests/unit/test_eval_run_grader_errors.py ADDED Viewed

@@ -0,0 +1,150 @@
+"""CLI behaviour when graders *execute* but a subset errors out.
+A grader execution error (auth/RBAC/timeout) is not a quality regression, but
+because ``items_passed_all`` requires every grader on a row to succeed, a single
+errored grader flips ``overall_passed`` to ``False`` and the run reports
+``Threshold status: FAILED`` even though every computable threshold passed.
+The CLI must surface that distinction loudly so users (the most common trigger
+is data-plane RBAC that is still propagating) do not chase a phantom quality
+failure or start lowering thresholds.
+"""
+from __future__ import annotations
+import json
+from pathlib import Path
+from typer.testing import CliRunner
+from agentops.cli.app import _grader_error_summary, app
+from agentops.core.results import (
+    RowMetric,
+    RowResult,
+    RunResult,
+    RunSummary,
+    TargetInfo,
+    ThresholdEvaluation,
+)
+runner = CliRunner()
+_AUTH_ERROR = (
+    "FAILED_EXECUTION: (UserError) OpenAI API hits AuthenticationError: "
+    "Principal does not have access to API/Operation."
+)
+def _result_with_partial_grader_errors() -> RunResult:
+    """One row where coherence scored but similarity errored on auth."""
+    row = RowResult(
+        row_index=0,
+        input="plan a trip",
+        expected="an itinerary",
+        response="here is an itinerary",
+        metrics=[
+            RowMetric(name="coherence", value=5.0),
+            RowMetric(name="similarity", value=None, error=_AUTH_ERROR),
+        ],
+    )
+    summary = RunSummary(
+        items_total=1,
+        items_passed_all=0,  # the errored grader means no row passed all
+        items_pass_rate=0.0,
+        thresholds_total=1,
+        thresholds_passed=1,  # every computable threshold passed
+        threshold_pass_rate=1.0,
+        overall_passed=False,
+    )
+    return RunResult(
+        started_at="2026-06-01T00:00:00+00:00",
+        finished_at="2026-06-01T00:01:00+00:00",
+        duration_seconds=60.0,
+        target=TargetInfo(kind="foundry_prompt", raw="travel-agent:2"),
+        dataset_path="dataset.jsonl",
+        evaluators=["CoherenceEvaluator", "SimilarityEvaluator"],
+        rows=[row],
+        aggregate_metrics={"coherence": 5.0},
+        thresholds=[
+            ThresholdEvaluation(
+                metric="coherence",
+                criteria=">=",
+                expected=">=3",
+                actual="5",
+                passed=True,
+            )
+        ],
+        summary=summary,
+    )
+def test_grader_error_summary_counts_and_lifts_first_error() -> None:
+    errored, total, first_error = _grader_error_summary(
+        _result_with_partial_grader_errors()
+    )
+    assert (errored, total) == (1, 2)
+    assert first_error is not None
+    assert "AuthenticationError" in first_error
+def _write_minimal_config(tmp_path: Path) -> Path:
+    dataset = tmp_path / "dataset.jsonl"
+    dataset.write_text(json.dumps({"input": "hi", "expected": "hi"}), encoding="utf-8")
+    config = tmp_path / "agentops.yaml"
+    config.write_text(
+        json.dumps(
+            {"version": 1, "agent": "model:gpt-4o", "dataset": str(dataset)}
+        ),
+        encoding="utf-8",
+    )
+    return config
+def test_eval_run_warns_on_partial_grader_errors(tmp_path, monkeypatch) -> None:
+    config = _write_minimal_config(tmp_path)
+    output = tmp_path / "out"
+    output.mkdir()
+    crafted = _result_with_partial_grader_errors()
+    import agentops.pipeline.orchestrator as orch
+    monkeypatch.setattr(orch, "run_evaluation", lambda *a, **k: crafted)
+    result = runner.invoke(
+        app,
+        ["eval", "run", "--config", str(config), "--output", str(output)],
+    )
+    # A grader-execution failure keeps the gate-failed exit code...
+    assert result.exit_code == 2, result.output
+    # ...but the user is told it is an execution error, not a quality failure.
+    assert "grader execution(s) errored" in result.output
+    assert "propagating" in result.output
+    assert "AuthenticationError" in result.output
+    assert "FAILED" in result.output
+def test_eval_run_no_warning_when_no_grader_errors(tmp_path, monkeypatch) -> None:
+    config = _write_minimal_config(tmp_path)
+    output = tmp_path / "out"
+    output.mkdir()
+    clean = _result_with_partial_grader_errors()
+    # Drop the errored grader so the row is clean and the gate genuinely passes.
+    clean.rows[0].metrics = [RowMetric(name="coherence", value=5.0)]
+    clean.summary.items_passed_all = 1
+    clean.summary.items_pass_rate = 1.0
+    clean.summary.overall_passed = True
+    import agentops.pipeline.orchestrator as orch
+    monkeypatch.setattr(orch, "run_evaluation", lambda *a, **k: clean)
+    result = runner.invoke(
+        app,
+        ["eval", "run", "--config", str(config), "--output", str(output)],
+    )
+    assert result.exit_code == 0, result.output
+    assert "PASSED" in result.output
+    assert "grader execution(s) errored" not in result.output