PyPI - agentevals-cli - Versions diffs - 0.5.3__tar.gz → 0.6.0__tar.gz - Mend

agentevals-cli 0.5.3tar.gz → 0.6.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (211) hide show

{agentevals_cli-0.5.3 → agentevals_cli-0.6.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: agentevals-cli
-Version: 0.5.3
+Version: 0.6.0
 Summary: Standalone framework to evaluate agent correctness based on portable OpenTelemetry traces
 License-File: LICENSE
 Requires-Python: >=3.11
@@ -17,23 +17,76 @@ Requires-Dist: uvicorn[standard]>=0.32.0
 Provides-Extra: live
 Requires-Dist: httpx>=0.27.0; extra == 'live'
 Requires-Dist: mcp>=1.26.0; extra == 'live'
+Provides-Extra: openai
+Requires-Dist: openai>=2.0; extra == 'openai'
 Provides-Extra: streaming
 Requires-Dist: opentelemetry-sdk>=1.20.0; extra == 'streaming'
 Requires-Dist: websockets>=12.0; extra == 'streaming'
 Description-Content-Type: text/markdown
 <p align="center">
-  <img src="docs/assets/logo-color.png" alt="agentevals" width="420" />
+  <picture>
+    <source media="(prefers-color-scheme: dark)" srcset="docs/assets/logo-color-on-transparent.svg">
+    <source media="(prefers-color-scheme: light)" srcset="docs/assets/logo-dark-on-transparent.svg">
+    <img src="docs/assets/logo-color-on-transparent.svg" alt="agentevals" width="420" />
+  </picture>
 </p>
-`agentevals` evaluates AI agent behavior from OpenTelemetry traces, without re-running the agent. Record once, score as many times as you want.
+<h1 align="center">Ship Agents Reliably</h1>
-Works with any OTel-instrumented framework (LangChain, Strands, Google ADK, and others). Supports Jaeger JSON and OTLP trace formats, built-in and custom evaluators, and LLM-based judges.
+<p align="center">
+Benchmark your agents before they hit production.<br>
+agentevals scores performance and inference quality from OpenTelemetry traces — no re-runs, no guesswork.
+</p>
+<p align="center">
+  <a href="https://github.com/agentevals-dev/agentevals/stargazers"><img src="https://img.shields.io/github/stars/agentevals-dev/agentevals?style=social" alt="GitHub Stars"></a>
+  &nbsp;
+  <a href="https://discord.gg/cpveEn8Ah2"><img src="https://img.shields.io/discord/1435836734666707190?label=Discord&logo=discord&logoColor=white&color=5865F2" alt="Discord"></a>
+  &nbsp;
+  <a href="https://github.com/agentevals-dev/agentevals/releases"><img src="https://img.shields.io/github/v/release/agentevals-dev/agentevals?label=Release" alt="Release"></a>
+  &nbsp;
+  <a href="https://github.com/agentevals-dev/agentevals/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-Apache%202.0-green.svg" alt="License"></a>
+  &nbsp;
+  <a href="https://pypi.org/project/agentevals-cli/"><img src="https://img.shields.io/pypi/v/agentevals-cli?label=PyPI&color=blue" alt="PyPI"></a>
+</p>
+<p align="center">
+  <a href="#installation">Install</a> · <a href="#quick-start">Quick Start</a> · <a href="https://github.com/agentevals-dev/agentevals/releases">Releases</a> · <a href="CONTRIBUTING.md">Contributing</a> · <a href="https://discord.gg/cpveEn8Ah2">Discord</a>
+</p>
+---
+## What is agentevals?
+agentevals is a framework-agnostic evaluation solution that scores AI agent behavior directly from [OpenTelemetry](https://opentelemetry.io/) traces. Record your agent's actions once, then evaluate as many times as you want — no re-runs, no guesswork.
+It works with any OTel-instrumented framework (LangChain, Strands, Google ADK, and others), supports Jaeger JSON and OTLP trace formats, and ships with built-in evaluators, custom evaluator support, and LLM-based judges.
 - **CLI** for scripting and CI pipelines
 - **Web UI** for visual inspection and local developer experience
 - **MCP server** so MCP clients can run evaluations from a conversation
+## Why agentevals?
+Most evaluation tools require you to **re-execute your agent** for every test — burning tokens, time, and money on duplicate LLM calls. agentevals takes a different approach:
+- **No re-execution** — score agents from existing traces without replaying expensive LLM calls
+- **Framework-agnostic** — works with any agent framework that emits OpenTelemetry spans
+- **Golden eval sets** — compare actual behavior against defined expected behaviors for deterministic pass/fail gating
+- **Custom evaluators** — write scoring logic in Python, JavaScript, or any language
+- **CI/CD ready** — gate deployments on quality thresholds directly in your pipeline
+- **Local-first** — no cloud dependency required; everything runs on your machine
+## How It Works
+agentevals follows three simple steps:
+1. **Collect traces** — Instrument your agent with OpenTelemetry (or export traces from your tracing backend). Point the OTLP exporter at the agentevals receiver, or load trace files directly.
+2. **Define eval sets** — Create golden evaluation sets that describe expected agent behavior: which tools should be called, in what order, and what the output should look like.
+3. **Run evaluations** — Use the CLI, Web UI, or MCP server to score traces against your eval sets. Get per-metric scores, pass/fail results, and detailed span-level breakdowns.
 > [!IMPORTANT]
 > This project is under active development. Expect breaking changes.
@@ -64,6 +117,7 @@ Optional extras:
 ```bash
 pip install "agentevals-cli[live]"        # MCP server support
+pip install "agentevals-cli[openai]"      # OpenAI Evals API graders
 ```
 **GitHub [releases](../../releases)** also ship **core** wheels (CLI and API only) and **bundle** wheels (with the embedded UI) if you need a specific version or offline `pip install ./path/to.whl`.
@@ -188,7 +242,7 @@ evaluators:
 agentevals run trace.json --config eval_config.yaml --eval-set eval_set.json
 ```
-Community evaluators can be referenced directly from a shared GitHub repository using `type: remote`. See the [Custom Evaluators guide](docs/custom-evaluators.md) for the full protocol reference, SDK usage, and how to contribute evaluators.
+Community evaluators can be referenced directly from a shared GitHub repository using `type: remote`. You can also delegate grading to the [OpenAI Evals API](https://developers.openai.com/api/reference/resources/evals/methods/create) using `type: openai_eval` (requires `pip install "agentevals-cli[openai]"` and `OPENAI_API_KEY`). See the [Custom Evaluators guide](docs/custom-evaluators.md) for the full protocol reference, SDK usage, and how to contribute evaluators.
 ## Web UI

{agentevals_cli-0.5.3 → agentevals_cli-0.6.0}/README.md RENAMED Viewed

@@ -1,15 +1,66 @@
 <p align="center">
-  <img src="docs/assets/logo-color.png" alt="agentevals" width="420" />
+  <picture>
+    <source media="(prefers-color-scheme: dark)" srcset="docs/assets/logo-color-on-transparent.svg">
+    <source media="(prefers-color-scheme: light)" srcset="docs/assets/logo-dark-on-transparent.svg">
+    <img src="docs/assets/logo-color-on-transparent.svg" alt="agentevals" width="420" />
+  </picture>
 </p>
-`agentevals` evaluates AI agent behavior from OpenTelemetry traces, without re-running the agent. Record once, score as many times as you want.
+<h1 align="center">Ship Agents Reliably</h1>
-Works with any OTel-instrumented framework (LangChain, Strands, Google ADK, and others). Supports Jaeger JSON and OTLP trace formats, built-in and custom evaluators, and LLM-based judges.
+<p align="center">
+Benchmark your agents before they hit production.<br>
+agentevals scores performance and inference quality from OpenTelemetry traces — no re-runs, no guesswork.
+</p>
+<p align="center">
+  <a href="https://github.com/agentevals-dev/agentevals/stargazers"><img src="https://img.shields.io/github/stars/agentevals-dev/agentevals?style=social" alt="GitHub Stars"></a>
+  &nbsp;
+  <a href="https://discord.gg/cpveEn8Ah2"><img src="https://img.shields.io/discord/1435836734666707190?label=Discord&logo=discord&logoColor=white&color=5865F2" alt="Discord"></a>
+  &nbsp;
+  <a href="https://github.com/agentevals-dev/agentevals/releases"><img src="https://img.shields.io/github/v/release/agentevals-dev/agentevals?label=Release" alt="Release"></a>
+  &nbsp;
+  <a href="https://github.com/agentevals-dev/agentevals/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-Apache%202.0-green.svg" alt="License"></a>
+  &nbsp;
+  <a href="https://pypi.org/project/agentevals-cli/"><img src="https://img.shields.io/pypi/v/agentevals-cli?label=PyPI&color=blue" alt="PyPI"></a>
+</p>
+<p align="center">
+  <a href="#installation">Install</a> · <a href="#quick-start">Quick Start</a> · <a href="https://github.com/agentevals-dev/agentevals/releases">Releases</a> · <a href="CONTRIBUTING.md">Contributing</a> · <a href="https://discord.gg/cpveEn8Ah2">Discord</a>
+</p>
+---
+## What is agentevals?
+agentevals is a framework-agnostic evaluation solution that scores AI agent behavior directly from [OpenTelemetry](https://opentelemetry.io/) traces. Record your agent's actions once, then evaluate as many times as you want — no re-runs, no guesswork.
+It works with any OTel-instrumented framework (LangChain, Strands, Google ADK, and others), supports Jaeger JSON and OTLP trace formats, and ships with built-in evaluators, custom evaluator support, and LLM-based judges.
 - **CLI** for scripting and CI pipelines
 - **Web UI** for visual inspection and local developer experience
 - **MCP server** so MCP clients can run evaluations from a conversation
+## Why agentevals?
+Most evaluation tools require you to **re-execute your agent** for every test — burning tokens, time, and money on duplicate LLM calls. agentevals takes a different approach:
+- **No re-execution** — score agents from existing traces without replaying expensive LLM calls
+- **Framework-agnostic** — works with any agent framework that emits OpenTelemetry spans
+- **Golden eval sets** — compare actual behavior against defined expected behaviors for deterministic pass/fail gating
+- **Custom evaluators** — write scoring logic in Python, JavaScript, or any language
+- **CI/CD ready** — gate deployments on quality thresholds directly in your pipeline
+- **Local-first** — no cloud dependency required; everything runs on your machine
+## How It Works
+agentevals follows three simple steps:
+1. **Collect traces** — Instrument your agent with OpenTelemetry (or export traces from your tracing backend). Point the OTLP exporter at the agentevals receiver, or load trace files directly.
+2. **Define eval sets** — Create golden evaluation sets that describe expected agent behavior: which tools should be called, in what order, and what the output should look like.
+3. **Run evaluations** — Use the CLI, Web UI, or MCP server to score traces against your eval sets. Get per-metric scores, pass/fail results, and detailed span-level breakdowns.
 > [!IMPORTANT]
 > This project is under active development. Expect breaking changes.
@@ -40,6 +91,7 @@ Optional extras:
 ```bash
 pip install "agentevals-cli[live]"        # MCP server support
+pip install "agentevals-cli[openai]"      # OpenAI Evals API graders
 ```
 **GitHub [releases](../../releases)** also ship **core** wheels (CLI and API only) and **bundle** wheels (with the embedded UI) if you need a specific version or offline `pip install ./path/to.whl`.
@@ -164,7 +216,7 @@ evaluators:
 agentevals run trace.json --config eval_config.yaml --eval-set eval_set.json
 ```
-Community evaluators can be referenced directly from a shared GitHub repository using `type: remote`. See the [Custom Evaluators guide](docs/custom-evaluators.md) for the full protocol reference, SDK usage, and how to contribute evaluators.
+Community evaluators can be referenced directly from a shared GitHub repository using `type: remote`. You can also delegate grading to the [OpenAI Evals API](https://developers.openai.com/api/reference/resources/evals/methods/create) using `type: openai_eval` (requires `pip install "agentevals-cli[openai]"` and `OPENAI_API_KEY`). See the [Custom Evaluators guide](docs/custom-evaluators.md) for the full protocol reference, SDK usage, and how to contribute evaluators.
 ## Web UI

agentevals_cli-0.6.0/docs/assets/logo-color-on-transparent.svg ADDED Viewed

@@ -0,0 +1,13 @@
+<svg width="3302" height="1066" viewBox="0 0 3302 1066" fill="none" xmlns="http://www.w3.org/2000/svg">
+<path d="M518.695 264C560.958 264 595.207 298.274 595.207 340.537C595.207 382.8 560.958 417.048 518.695 417.048C454.983 417.048 403.305 468.548 403 532.184V533.304C403.306 596.94 454.983 648.438 518.695 648.438H518.722C560.985 648.439 595.232 682.687 595.232 724.95C595.232 767.213 560.984 801.461 518.722 801.461C476.459 801.461 442.21 767.213 442.21 724.95V724.67C442.057 661.008 390.482 609.408 326.795 609.255H326.515C284.252 609.255 250.004 575.006 250.004 532.743C250.004 490.48 284.252 456.232 326.515 456.232H326.642C390.431 456.156 442.108 404.453 442.185 340.664V340.512C442.185 298.249 476.432 264 518.695 264ZM492.436 469.353C527.452 454.848 567.596 471.476 582.101 506.492C596.605 541.508 579.976 581.653 544.96 596.157C509.944 610.661 469.8 594.033 455.296 559.017C440.792 524.001 457.42 483.857 492.436 469.353Z" fill="#8023C3"/>
+<path d="M1029.16 401.476V655.084H982.736L976.321 616.93C956.253 644.357 928.75 658.054 893.878 658.054C870.849 658.054 850.254 652.839 832.16 642.443C814.066 632.046 799.887 617.029 789.688 597.359C779.49 577.721 774.391 554.683 774.391 528.247C774.391 501.81 779.589 479.796 789.952 460.125C800.315 440.487 814.56 425.305 832.654 414.546C850.748 403.819 871.178 398.439 893.878 398.439C912.301 398.439 928.454 401.839 942.271 408.605C956.089 415.371 967.274 424.711 975.861 436.593V401.41H1029.19L1029.16 401.476ZM902.76 612.97C924.802 612.97 942.567 605.214 956.089 589.701C969.577 574.189 976.321 554.023 976.321 529.27C976.321 504.516 969.577 483.294 956.089 467.617C942.6 451.94 924.802 444.085 902.76 444.085C880.718 444.085 862.92 451.94 849.432 467.617C835.944 483.294 829.199 503.526 829.199 528.28C829.199 553.033 835.944 573.76 849.432 589.47C862.92 605.148 880.685 613.003 902.76 613.003V612.97Z" fill="white"/>
+<path d="M2723.45 401.476V655.084H2677.03L2670.61 616.93C2650.54 644.357 2623.04 658.054 2588.17 658.054C2565.14 658.054 2544.54 652.839 2526.45 642.443C2508.36 632.046 2494.18 617.029 2483.98 597.359C2473.78 577.721 2468.68 554.683 2468.68 528.247C2468.68 501.81 2473.88 479.796 2484.24 460.125C2494.6 440.487 2508.85 425.305 2526.94 414.546C2545.04 403.819 2565.47 398.439 2588.17 398.439C2606.59 398.439 2622.74 401.839 2636.56 408.605C2650.38 415.371 2661.56 424.711 2670.15 436.593V401.41H2723.48L2723.45 401.476ZM2597.05 612.97C2619.09 612.97 2636.86 605.214 2650.38 589.701C2663.87 574.189 2670.61 554.023 2670.61 529.27C2670.61 504.516 2663.87 483.294 2650.38 467.617C2636.89 451.94 2619.09 444.085 2597.05 444.085C2575.01 444.085 2557.21 451.94 2543.72 467.617C2530.23 483.294 2523.49 503.526 2523.49 528.28C2523.49 553.033 2530.23 573.76 2543.72 589.47C2557.21 605.148 2574.97 613.003 2597.05 613.003V612.97Z" fill="white"/>
+<path d="M1308.72 401.47V644.682C1308.72 680.36 1298.19 707.985 1277.14 727.655C1256.08 747.293 1223.48 757.129 1179.36 757.129C1145.11 757.129 1117.32 749.439 1095.93 734.091C1074.51 718.744 1062.67 697.027 1060.37 668.94H1114.68C1117.97 683.132 1125.54 694.123 1137.38 701.879C1149.23 709.635 1164.52 713.529 1183.31 713.529C1231.7 713.529 1255.88 689.898 1255.88 642.701V614.482C1237.46 642.206 1209.96 656.101 1173.44 656.101C1150.41 656.101 1129.82 650.887 1111.72 640.49C1093.63 630.094 1079.45 615.242 1069.25 595.901C1059.05 576.593 1053.95 553.721 1053.95 527.284C1053.95 500.847 1059.15 479.394 1069.51 459.922C1079.88 440.449 1094.12 425.333 1112.22 414.606C1130.31 403.88 1150.74 398.5 1173.44 398.5C1192.52 398.5 1209 402.296 1222.82 409.887C1236.64 417.478 1247.82 427.907 1256.41 441.109L1262.33 401.47H1308.75H1308.72ZM1182.32 610.984C1204.36 610.984 1222.13 603.294 1235.65 587.947C1249.14 572.6 1255.88 552.698 1255.88 528.274C1255.88 503.851 1249.14 482.794 1235.65 467.084C1222.16 451.406 1204.36 443.551 1182.32 443.551C1160.28 443.551 1142.48 451.307 1128.99 466.82C1115.51 482.332 1108.76 502.498 1108.76 527.251C1108.76 552.005 1115.51 572.17 1128.99 587.683C1142.48 603.195 1160.25 610.951 1182.32 610.951V610.984Z" fill="white"/>
+<path d="M1324.01 528.769C1324.01 502.696 1329.21 479.823 1339.57 460.153C1349.94 440.515 1364.41 425.333 1383.03 414.573C1401.62 403.847 1422.94 398.467 1446.99 398.467C1471.03 398.467 1492.81 403.417 1511.43 413.319C1530.02 423.22 1544.66 437.28 1555.39 455.433C1566.08 473.585 1571.61 494.906 1571.93 519.33C1571.93 525.931 1571.44 532.697 1570.45 539.628H1379.87V542.598C1381.19 564.711 1388.1 582.237 1400.6 595.109C1413.1 607.98 1429.71 614.416 1450.47 614.416C1466.92 614.416 1480.74 610.555 1491.96 602.766C1503.14 595.01 1510.55 584.019 1514.16 569.827H1567.49C1562.89 595.571 1550.45 616.727 1530.22 633.229C1509.99 649.731 1484.72 657.982 1454.42 657.982C1428.07 657.982 1405.14 652.636 1385.53 641.876C1365.96 631.15 1350.79 616.034 1340.1 596.561C1329.41 577.088 1324.04 554.447 1324.04 528.703L1324.01 528.769ZM1517.55 500.517C1515.25 482.035 1507.91 467.579 1495.58 457.182C1483.24 446.786 1467.68 441.571 1448.93 441.571C1431.49 441.571 1416.42 446.951 1403.76 457.677C1391.09 468.404 1383.76 482.695 1381.78 500.517H1517.55Z" fill="white"/>
+<path d="M1587.2 401.47H1633.61L1639.54 434.673C1658.62 410.58 1685.63 398.5 1720.5 398.5C1735.3 398.5 1748.96 400.711 1761.49 405.2C1773.99 409.656 1784.78 416.686 1793.83 426.257C1802.88 435.828 1809.88 447.974 1814.82 462.661C1819.75 477.348 1822.22 494.94 1822.22 515.402V655.078H1768.4V518.373C1768.4 494.28 1763.3 475.929 1753.1 463.387C1742.9 450.845 1727.93 444.574 1708.16 444.574C1687.11 444.574 1670.56 451.935 1658.55 466.622C1646.54 481.309 1640.52 501.541 1640.52 527.317V655.111H1587.2V401.47Z" fill="white"/>
+<path d="M1845.75 401.47V330.609H1899.57V401.437H1960.3V448.502H1899.57V580.752C1899.57 590.653 1901.54 597.683 1905.49 601.809C1909.44 605.934 1916.18 608.014 1925.72 608.014H1966.22V655.078H1914.87C1890.85 655.078 1873.31 649.566 1862.29 638.477C1851.27 627.42 1845.75 609.994 1845.75 586.23V448.535" fill="white"/>
+<path d="M1976.12 528.769C1976.12 502.696 1981.32 479.823 1991.69 460.153C2002.05 440.515 2016.52 425.333 2035.14 414.573C2053.73 403.847 2075.05 398.467 2099.1 398.467C2123.15 398.467 2144.93 403.417 2163.51 413.319C2182.1 423.22 2196.77 437.28 2207.47 455.433C2218.16 473.585 2223.69 494.906 2224.01 519.33C2224.01 525.931 2223.52 532.697 2222.53 539.628H2031.95V542.598C2033.27 564.711 2040.18 582.237 2052.68 595.109C2065.18 607.98 2081.83 614.416 2102.55 614.416C2119 614.416 2132.85 610.555 2144.04 602.766C2155.22 595.01 2162.63 584.019 2166.24 569.827H2219.57C2214.97 595.571 2202.53 616.727 2182.3 633.229C2162.07 649.731 2136.8 657.982 2106.5 657.982C2080.15 657.982 2057.19 652.636 2037.61 641.876C2018.04 631.15 2002.87 616.034 1992.18 596.561C1981.49 577.088 1976.12 554.447 1976.12 528.703V528.769ZM2169.67 500.517C2167.36 482.035 2160.03 467.579 2147.69 457.182C2135.35 446.786 2119.79 441.571 2101.04 441.571C2083.6 441.571 2068.54 446.951 2055.87 457.677C2043.2 468.404 2035.87 482.695 2033.89 500.517H2169.67Z" fill="white"/>
+<path d="M2216.86 401.475H2274.14L2343.75 597.621L2412.38 401.475H2468.67L2375.33 655.082H2310.16L2216.83 401.475H2216.86Z" fill="white"/>
+<path d="M2754.43 308.332H2807.76V655.079H2754.43V308.332Z" fill="white"/>
+<path d="M2882.6 571.352C2883.59 584.554 2889.77 595.379 2901.12 603.796C2912.47 612.212 2927.21 616.436 2945.31 616.436C2961.43 616.436 2974.52 613.4 2984.55 607.261C2994.59 601.155 2999.62 592.97 2999.62 582.739C2999.62 574.157 2997.32 567.721 2992.71 563.431C2988.11 559.14 2981.92 556.104 2974.19 554.256C2966.46 552.44 2954.52 550.559 2938.4 548.546C2916.36 545.905 2898.16 542.341 2883.85 537.885C2869.54 533.43 2857.99 526.334 2849.28 516.597C2840.56 506.861 2836.18 493.725 2836.18 477.223C2836.18 461.711 2840.56 447.915 2849.28 435.868C2857.99 423.821 2870 414.481 2885.33 407.88C2900.63 401.279 2918 397.979 2937.41 397.979C2969.32 397.979 2995.25 405.075 3015.18 419.267C3035.09 433.459 3045.88 453.459 3047.52 479.203H2995.67C2994.36 467.651 2988.6 458.146 2978.4 450.72C2968.2 443.294 2955.37 439.564 2939.88 439.564C2924.38 439.564 2911.92 442.535 2902.34 448.476C2892.8 454.416 2888.03 462.503 2888.03 472.734C2888.03 480.325 2890.4 486.035 2895.2 489.83C2899.97 493.626 2905.99 496.266 2913.23 497.752C2920.47 499.237 2932.15 500.986 2948.3 502.966C2970.01 505.277 2988.31 508.841 3003.11 513.627C3017.91 518.413 3029.76 526.004 3038.67 536.4C3047.56 546.797 3052 560.923 3052 578.745C3052 594.587 3047.39 608.548 3038.18 620.595C3028.97 632.642 3016.27 641.883 3000.15 648.319C2984.03 654.755 2965.9 657.989 2945.83 657.989C2911.92 657.989 2884.51 650.299 2863.62 634.952C2842.73 619.605 2831.94 598.383 2831.28 571.286H2882.64L2882.6 571.352Z" fill="white"/>
+</svg>

agentevals_cli-0.6.0/docs/assets/logo-dark-on-transparent.svg ADDED Viewed

@@ -0,0 +1,13 @@
+<svg width="3302" height="1066" viewBox="0 0 3302 1066" fill="none" xmlns="http://www.w3.org/2000/svg">
+<path d="M518.695 264C560.958 264 595.207 298.274 595.207 340.537C595.207 382.8 560.958 417.048 518.695 417.048C454.983 417.048 403.305 468.548 403 532.184V533.304C403.306 596.94 454.983 648.438 518.695 648.438H518.722C560.985 648.439 595.232 682.687 595.232 724.95C595.232 767.213 560.984 801.461 518.722 801.461C476.459 801.461 442.21 767.213 442.21 724.95V724.67C442.057 661.008 390.482 609.408 326.795 609.255H326.515C284.252 609.255 250.004 575.006 250.004 532.743C250.004 490.48 284.252 456.232 326.515 456.232H326.642C390.431 456.156 442.108 404.453 442.185 340.664V340.512C442.185 298.249 476.432 264 518.695 264ZM492.436 469.353C527.452 454.848 567.596 471.476 582.101 506.492C596.605 541.508 579.976 581.653 544.96 596.157C509.944 610.661 469.8 594.033 455.296 559.017C440.792 524.001 457.42 483.857 492.436 469.353Z" fill="#151927"/>
+<path d="M1029.16 401.476V655.084H982.736L976.321 616.93C956.253 644.357 928.75 658.054 893.878 658.054C870.849 658.054 850.254 652.839 832.16 642.443C814.066 632.046 799.887 617.029 789.688 597.359C779.49 577.721 774.391 554.683 774.391 528.247C774.391 501.81 779.589 479.796 789.952 460.125C800.315 440.487 814.56 425.305 832.654 414.546C850.748 403.819 871.178 398.439 893.878 398.439C912.301 398.439 928.454 401.839 942.271 408.605C956.089 415.371 967.274 424.711 975.861 436.593V401.41H1029.19L1029.16 401.476ZM902.76 612.97C924.802 612.97 942.567 605.214 956.089 589.701C969.577 574.189 976.321 554.023 976.321 529.27C976.321 504.516 969.577 483.294 956.089 467.617C942.6 451.94 924.802 444.085 902.76 444.085C880.718 444.085 862.92 451.94 849.432 467.617C835.944 483.294 829.199 503.526 829.199 528.28C829.199 553.033 835.944 573.76 849.432 589.47C862.92 605.148 880.685 613.003 902.76 613.003V612.97Z" fill="#151927"/>
+<path d="M2723.45 401.476V655.084H2677.03L2670.61 616.93C2650.54 644.357 2623.04 658.054 2588.17 658.054C2565.14 658.054 2544.54 652.839 2526.45 642.443C2508.36 632.046 2494.18 617.029 2483.98 597.359C2473.78 577.721 2468.68 554.683 2468.68 528.247C2468.68 501.81 2473.88 479.796 2484.24 460.125C2494.6 440.487 2508.85 425.305 2526.94 414.546C2545.04 403.819 2565.47 398.439 2588.17 398.439C2606.59 398.439 2622.74 401.839 2636.56 408.605C2650.38 415.371 2661.56 424.711 2670.15 436.593V401.41H2723.48L2723.45 401.476ZM2597.05 612.97C2619.09 612.97 2636.86 605.214 2650.38 589.701C2663.87 574.189 2670.61 554.023 2670.61 529.27C2670.61 504.516 2663.87 483.294 2650.38 467.617C2636.89 451.94 2619.09 444.085 2597.05 444.085C2575.01 444.085 2557.21 451.94 2543.72 467.617C2530.23 483.294 2523.49 503.526 2523.49 528.28C2523.49 553.033 2530.23 573.76 2543.72 589.47C2557.21 605.148 2574.97 613.003 2597.05 613.003V612.97Z" fill="#151927"/>
+<path d="M1308.72 401.47V644.682C1308.72 680.36 1298.19 707.985 1277.14 727.655C1256.08 747.293 1223.48 757.129 1179.36 757.129C1145.11 757.129 1117.32 749.439 1095.93 734.091C1074.51 718.744 1062.67 697.027 1060.37 668.94H1114.68C1117.97 683.132 1125.54 694.123 1137.38 701.879C1149.23 709.635 1164.52 713.529 1183.31 713.529C1231.7 713.529 1255.88 689.898 1255.88 642.701V614.482C1237.46 642.206 1209.96 656.101 1173.44 656.101C1150.41 656.101 1129.82 650.887 1111.72 640.49C1093.63 630.094 1079.45 615.242 1069.25 595.901C1059.05 576.593 1053.95 553.721 1053.95 527.284C1053.95 500.847 1059.15 479.394 1069.51 459.922C1079.88 440.449 1094.12 425.333 1112.22 414.606C1130.31 403.88 1150.74 398.5 1173.44 398.5C1192.52 398.5 1209 402.296 1222.82 409.887C1236.64 417.478 1247.82 427.907 1256.41 441.109L1262.33 401.47H1308.75H1308.72ZM1182.32 610.984C1204.36 610.984 1222.13 603.294 1235.65 587.947C1249.14 572.6 1255.88 552.698 1255.88 528.274C1255.88 503.851 1249.14 482.794 1235.65 467.084C1222.16 451.406 1204.36 443.551 1182.32 443.551C1160.28 443.551 1142.48 451.307 1128.99 466.82C1115.51 482.332 1108.76 502.498 1108.76 527.251C1108.76 552.005 1115.51 572.17 1128.99 587.683C1142.48 603.195 1160.25 610.951 1182.32 610.951V610.984Z" fill="#151927"/>
+<path d="M1324.01 528.769C1324.01 502.696 1329.21 479.823 1339.57 460.153C1349.94 440.515 1364.41 425.333 1383.03 414.573C1401.62 403.847 1422.94 398.467 1446.99 398.467C1471.03 398.467 1492.81 403.417 1511.43 413.319C1530.02 423.22 1544.66 437.28 1555.39 455.433C1566.08 473.585 1571.61 494.906 1571.93 519.33C1571.93 525.931 1571.44 532.697 1570.45 539.628H1379.87V542.598C1381.19 564.711 1388.1 582.237 1400.6 595.109C1413.1 607.98 1429.71 614.416 1450.47 614.416C1466.92 614.416 1480.74 610.555 1491.96 602.766C1503.14 595.01 1510.55 584.019 1514.16 569.827H1567.49C1562.89 595.571 1550.45 616.727 1530.22 633.229C1509.99 649.731 1484.72 657.982 1454.42 657.982C1428.07 657.982 1405.14 652.636 1385.53 641.876C1365.96 631.15 1350.79 616.034 1340.1 596.561C1329.41 577.088 1324.04 554.447 1324.04 528.703L1324.01 528.769ZM1517.55 500.517C1515.25 482.035 1507.91 467.579 1495.58 457.182C1483.24 446.786 1467.68 441.571 1448.93 441.571C1431.49 441.571 1416.42 446.951 1403.76 457.677C1391.09 468.404 1383.76 482.695 1381.78 500.517H1517.55Z" fill="#151927"/>
+<path d="M1587.2 401.47H1633.61L1639.54 434.673C1658.62 410.58 1685.63 398.5 1720.5 398.5C1735.3 398.5 1748.96 400.711 1761.49 405.2C1773.99 409.656 1784.78 416.686 1793.83 426.257C1802.88 435.828 1809.88 447.974 1814.82 462.661C1819.75 477.348 1822.22 494.94 1822.22 515.402V655.078H1768.4V518.373C1768.4 494.28 1763.3 475.929 1753.1 463.387C1742.9 450.845 1727.93 444.574 1708.16 444.574C1687.11 444.574 1670.56 451.935 1658.55 466.622C1646.54 481.309 1640.52 501.541 1640.52 527.317V655.111H1587.2V401.47Z" fill="#151927"/>
+<path d="M1845.75 401.47V330.609H1899.57V401.437H1960.3V448.502H1899.57V580.752C1899.57 590.653 1901.54 597.683 1905.49 601.809C1909.44 605.934 1916.18 608.014 1925.72 608.014H1966.22V655.078H1914.87C1890.85 655.078 1873.31 649.566 1862.29 638.477C1851.27 627.42 1845.75 609.994 1845.75 586.23V448.535" fill="#151927"/>
+<path d="M1976.12 528.769C1976.12 502.696 1981.32 479.823 1991.69 460.153C2002.05 440.515 2016.52 425.333 2035.14 414.573C2053.73 403.847 2075.05 398.467 2099.1 398.467C2123.15 398.467 2144.93 403.417 2163.51 413.319C2182.1 423.22 2196.77 437.28 2207.47 455.433C2218.16 473.585 2223.69 494.906 2224.01 519.33C2224.01 525.931 2223.52 532.697 2222.53 539.628H2031.95V542.598C2033.27 564.711 2040.18 582.237 2052.68 595.109C2065.18 607.98 2081.83 614.416 2102.55 614.416C2119 614.416 2132.85 610.555 2144.04 602.766C2155.22 595.01 2162.63 584.019 2166.24 569.827H2219.57C2214.97 595.571 2202.53 616.727 2182.3 633.229C2162.07 649.731 2136.8 657.982 2106.5 657.982C2080.15 657.982 2057.19 652.636 2037.61 641.876C2018.04 631.15 2002.87 616.034 1992.18 596.561C1981.49 577.088 1976.12 554.447 1976.12 528.703V528.769ZM2169.67 500.517C2167.36 482.035 2160.03 467.579 2147.69 457.182C2135.35 446.786 2119.79 441.571 2101.04 441.571C2083.6 441.571 2068.54 446.951 2055.87 457.677C2043.2 468.404 2035.87 482.695 2033.89 500.517H2169.67Z" fill="#151927"/>
+<path d="M2216.86 401.475H2274.14L2343.75 597.621L2412.38 401.475H2468.67L2375.33 655.082H2310.16L2216.83 401.475H2216.86Z" fill="#151927"/>
+<path d="M2754.43 308.332H2807.76V655.079H2754.43V308.332Z" fill="#151927"/>
+<path d="M2882.6 571.352C2883.59 584.554 2889.77 595.379 2901.12 603.796C2912.47 612.212 2927.21 616.436 2945.31 616.436C2961.43 616.436 2974.52 613.4 2984.55 607.261C2994.59 601.155 2999.62 592.97 2999.62 582.739C2999.62 574.157 2997.32 567.721 2992.71 563.431C2988.11 559.14 2981.92 556.104 2974.19 554.256C2966.46 552.44 2954.52 550.559 2938.4 548.546C2916.36 545.905 2898.16 542.341 2883.85 537.885C2869.54 533.43 2857.99 526.334 2849.28 516.597C2840.56 506.861 2836.18 493.725 2836.18 477.223C2836.18 461.711 2840.56 447.915 2849.28 435.868C2857.99 423.821 2870 414.481 2885.33 407.88C2900.63 401.279 2918 397.979 2937.41 397.979C2969.32 397.979 2995.25 405.075 3015.18 419.267C3035.09 433.459 3045.88 453.459 3047.52 479.203H2995.67C2994.36 467.651 2988.6 458.146 2978.4 450.72C2968.2 443.294 2955.37 439.564 2939.88 439.564C2924.38 439.564 2911.92 442.535 2902.34 448.476C2892.8 454.416 2888.03 462.503 2888.03 472.734C2888.03 480.325 2890.4 486.035 2895.2 489.83C2899.97 493.626 2905.99 496.266 2913.23 497.752C2920.47 499.237 2932.15 500.986 2948.3 502.966C2970.01 505.277 2988.31 508.841 3003.11 513.627C3017.91 518.413 3029.76 526.004 3038.67 536.4C3047.56 546.797 3052 560.923 3052 578.745C3052 594.587 3047.39 608.548 3038.18 620.595C3028.97 632.642 3016.27 641.883 3000.15 648.319C2984.03 654.755 2965.9 657.989 2945.83 657.989C2911.92 657.989 2884.51 650.299 2863.62 634.952C2842.73 619.605 2831.94 598.383 2831.28 571.286H2882.64L2882.6 571.352Z" fill="#151927"/>
+</svg>

{agentevals_cli-0.5.3 → agentevals_cli-0.6.0}/docs/custom-evaluators.md RENAMED Viewed

@@ -85,7 +85,9 @@ agentevals run traces/my_trace.json \
 ## Eval Config Reference
-Each evaluator entry in the `evaluators` list uses the following fields:
+Each evaluator entry in the `evaluators` list uses the following fields. The `type` field determines which other fields are valid.
+### `type: code` (local scripts)
 | Field | Required | Default | Description |
 |---|---|---|---|
@@ -96,6 +98,16 @@ Each evaluator entry in the `evaluators` list uses the following fields:
 | `timeout` | no | `30` | Subprocess timeout in seconds |
 | `config` | no | `{}` | Arbitrary key-value pairs passed to the evaluator |
+### `type: openai_eval` (OpenAI Evals API)
+| Field | Required | Default | Description |
+|---|---|---|---|
+| `name` | yes | | Unique name for the evaluator (used in output) |
+| `type` | yes | | `openai_eval` for OpenAI Evals API graders |
+| `threshold` | no | `0.5` | Maps to `pass_threshold` in the OpenAI grader |
+| `timeout` | no | `120` | Max seconds to wait for the OpenAI eval run |
+| `grader` | yes | | OpenAI grader config (see [OpenAI Evals Graders](#openai-evals-api-graders)) |
 ## Protocol
 Every evaluator — regardless of language — communicates via the same JSON protocol over stdin/stdout.
@@ -275,6 +287,40 @@ evaluators:
 Remote evaluators are cached in `~/.cache/agentevals/evaluators/`. To force a re-download, delete the cached file.
+## OpenAI Evals API Graders
+You can delegate grading to the [OpenAI Evals API](https://developers.openai.com/api/reference/resources/evals/methods/create) instead of running scoring logic locally. This requires `pip install "agentevals-cli[openai]"` and `OPENAI_API_KEY` to be set.
+### Text Similarity Grader
+Compares the agent's response against a golden reference using text similarity metrics. Requires an eval set.
+```yaml
+evaluators:
+  - name: response_similarity
+    type: openai_eval
+    threshold: 0.8
+    grader:
+      type: text_similarity
+      evaluation_metric: fuzzy_match
+```
+The `grader.evaluation_metric` field selects the similarity algorithm:
+| Metric | Description |
+|---|---|
+| `fuzzy_match` | Approximate string matching using edit distance |
+| `bleu` | N-gram overlap score, commonly used for translation quality |
+| `gleu` | Google's variant of BLEU with sentence-level scoring |
+| `meteor` | Alignment-based metric considering synonyms and paraphrases |
+| `cosine` | Cosine similarity on vectorized text |
+| `rouge_1` through `rouge_5` | Unigram through 5-gram overlap (F-measure) |
+| `rouge_l` | Longest common subsequence overlap (F-measure) |
+### How it works
+Under the hood, agentevals creates an ephemeral eval on OpenAI, submits the actual and expected responses as JSONL items, polls for results, and cleans up. The agent's response and the golden reference are both placed in the `item` namespace (with `include_sample_schema: false`), so OpenAI only grades the provided text without generating any model outputs.
 ### Configuring the GitHub source
 By default, evaluators are fetched from the official community repository. Override with environment variables:
@@ -303,42 +349,43 @@ The community repo uses per-evaluator manifests. A CI workflow compiles all `eva
 Custom evaluators use a layered architecture designed for extensibility.
 ```
-┌─────────────────────────────────────────┐
-│  Eval Config (YAML)                     │
-│  type: code | remote                    │
-└──────────────┬──────────────────────────┘
-               │
-               ▼
-┌─────────────────────────────────────────┐
-│  EvaluatorResolver                      │
-│  Downloads remote → local cache         │
-│  (passthrough for type: code)           │
-└──────────────┬──────────────────────────┘
-               │
-               ▼
-┌─────────────────────────────────────────┐
-│  CustomEvaluatorRunner                  │
-│  ADK Evaluator adapter                  │
-│  Invocation ↔ EvalInput/EvalResult      │
-└──────────────┬──────────────────────────┘
-               │
-               ▼
-┌─────────────────────────────────────────┐
-│  EvaluatorBackend (ABC) — executor factory │
-│  async run(EvalInput) → EvalResult      │
-├─────────────────────────────────────────┤
-│  "local"  → SubprocessBackend           │
-│  "docker" → DockerBackend (future)      │
-└──────────────┬──────────────────────────┘
-               │
-               ▼
-┌─────────────────────────────────────────┐
-│  Runtime registry                       │
-│  PythonRuntime (.py)                    │
-│  NodeRuntime (.js, .ts)                 │
-└─────────────────────────────────────────┘
+┌─────────────────────────────────────────────┐
+│  Eval Config (YAML)                         │
+│  type: code | remote | openai_eval          │
+└──────────────┬─────────────┬────────────────┘
+               │             │
+     code/remote         openai_eval
+               │             │
+               ▼             ▼
+┌──────────────────────┐  ┌──────────────────────┐
+│  EvaluatorResolver   │  │  OpenAI Evals API    │
+│  remote → local      │  │  create eval + run   │
+│  (passthrough: code) │  │  poll → get results  │
+└──────────┬───────────┘  └──────────────────────┘
+           │
+           ▼
+┌──────────────────────────┐
+│  CustomEvaluatorRunner   │
+│  ADK Evaluator adapter   │
+│  Invocation ↔ EvalInput  │
+└──────────┬───────────────┘
+           │
+           ▼
+┌──────────────────────────┐
+│  EvaluatorBackend (ABC)  │
+│  "local"  → Subprocess   │
+│  "docker" → (future)     │
+└──────────┬───────────────┘
+           │
+           ▼
+┌──────────────────────────┐
+│  Runtime registry        │
+│  PythonRuntime (.py)     │
+│  NodeRuntime (.js, .ts)  │
+└──────────────────────────┘
 ```
+- **`type: openai_eval`** takes a separate path: it calls the OpenAI Evals API directly (create eval, create run, poll, collect results) and returns a `MetricResult`. It does not go through the subprocess/backend stack.
 - **`EvaluatorSource`** is the registry abstraction. Implementations (`BuiltinEvaluatorSource`, `GitHubEvaluatorSource`) list and fetch evaluators from different registries.
 - **`EvaluatorResolver`** downloads remote evaluators and converts `RemoteEvaluatorDef` to `CodeEvaluatorDef` with a local cached path.
 - **`EvaluatorBackend`** is the execution abstraction. The `executor` field in config selects which factory to use (`"local"` → `SubprocessBackend`). New executors (e.g. `DockerBackend`) register via `register_executor()`.

{agentevals_cli-0.5.3 → agentevals_cli-0.6.0}/examples/custom_evaluators/eval_config.yaml RENAMED Viewed

@@ -32,4 +32,3 @@ evaluators:
     ref: evaluators/random_evaluator/random_evaluator.py
     threshold: 0.110
     executor: local

{agentevals_cli-0.5.3 → agentevals_cli-0.6.0}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 [project]
 name = "agentevals-cli"
-version = "0.5.3"
+version = "0.6.0"
 description = "Standalone framework to evaluate agent correctness based on portable OpenTelemetry traces"
 readme = "README.md"
 requires-python = ">=3.11"
@@ -30,6 +30,9 @@ streaming = [
     "opentelemetry-sdk>=1.20.0",
     "websockets>=12.0",
 ]
+openai = [
+    "openai>=2.0",
+]
 [project.scripts]
 agentevals = "agentevals.cli:main"

{agentevals_cli-0.5.3 → agentevals_cli-0.6.0}/src/agentevals/api/routes.py RENAMED Viewed

@@ -22,6 +22,7 @@ from ..config import (
     CodeEvaluatorDef,
     CustomEvaluatorDef,
     EvalRunConfig,
+    OpenAIEvalDef,
 )
 from ..extraction import get_extractor
 from ..runner import RunResult, get_loader, load_eval_set, run_evaluation
@@ -58,6 +59,7 @@ router = APIRouter()
 _TYPE_TO_MODEL = {
     "builtin": BuiltinMetricDef,
     "code": CodeEvaluatorDef,
+    "openai_eval": OpenAIEvalDef,
 }

{agentevals_cli-0.5.3 → agentevals_cli-0.6.0}/src/agentevals/config.py RENAMED Viewed

@@ -53,8 +53,48 @@ class RemoteEvaluatorDef(BaseEvaluatorDef):
     ref: str = Field(description="Source-specific reference (e.g. path within the repo).")
+_VALID_SIMILARITY_METRICS = frozenset(
+    {
+        "fuzzy_match",
+        "bleu",
+        "gleu",
+        "meteor",
+        "cosine",
+        "rouge_1",
+        "rouge_2",
+        "rouge_3",
+        "rouge_4",
+        "rouge_5",
+        "rouge_l",
+    }
+)
+class OpenAIEvalDef(BaseModel):
+    """An evaluator that delegates grading to the OpenAI Evals API."""
+    type: Literal["openai_eval"] = "openai_eval"
+    name: str
+    threshold: float = 0.5
+    timeout: int = Field(default=120, description="Max seconds to wait for the OpenAI eval run to complete.")
+    grader: dict[str, Any] = Field(description="OpenAI grader config passed to testing_criteria.")
+    @field_validator("grader")
+    @classmethod
+    def _validate_grader(cls, v: dict[str, Any]) -> dict[str, Any]:
+        grader_type = v.get("type")
+        if grader_type != "text_similarity":
+            raise ValueError(f"Only 'text_similarity' grader type is currently supported, got '{grader_type}'")
+        metric = v.get("evaluation_metric")
+        if not metric:
+            raise ValueError("'evaluation_metric' is required for text_similarity grader")
+        if metric not in _VALID_SIMILARITY_METRICS:
+            raise ValueError(f"Unknown evaluation_metric '{metric}'. Valid: {sorted(_VALID_SIMILARITY_METRICS)}")
+        return v
 CustomEvaluatorDef = Annotated[
-    BuiltinMetricDef | CodeEvaluatorDef | RemoteEvaluatorDef,
+    BuiltinMetricDef | CodeEvaluatorDef | RemoteEvaluatorDef | OpenAIEvalDef,
     Field(discriminator="type"),
 ]

{agentevals_cli-0.5.3 → agentevals_cli-0.6.0}/src/agentevals/custom_evaluators.py RENAMED Viewed

@@ -81,6 +81,9 @@ class Runtime(abc.ABC):
 class PythonRuntime(Runtime):
+    def __init__(self, python_path: Path | None = None):
+        self._exe = str(python_path) if python_path else sys.executable
     @property
     def name(self) -> str:
         return "Python"
@@ -90,13 +93,16 @@ class PythonRuntime(Runtime):
         return (".py",)
     def build_command(self, path: Path) -> list[str]:
-        return [sys.executable, str(path)]
+        return [self._exe, str(path)]
     def is_available(self) -> bool:
         return True
 class NodeRuntime(Runtime):
+    def __init__(self) -> None:
+        self._exe = shutil.which("node")
     @property
     def name(self) -> str:
         return "Node.js"
@@ -106,10 +112,12 @@ class NodeRuntime(Runtime):
         return (".js", ".ts")
     def build_command(self, path: Path) -> list[str]:
-        node = shutil.which("node")
-        if not node:
+        if not self._exe:
             raise RuntimeError("Node.js not found on PATH (required for .js/.ts evaluators)")
-        return [node, str(path)]
+        return [self._exe, str(path)]
+    def is_available(self) -> bool:
+        return self._exe is not None
 _RUNTIMES: list[Runtime] = [
@@ -203,12 +211,13 @@ class SubprocessBackend(EvaluatorBackend):
     """Runs a local code file (.py, .js, .ts, …) as a subprocess.
     The correct interpreter is resolved from the file extension via the
-    :data:`_RUNTIMES` registry.
+    :data:`_RUNTIMES` registry.  Pass a pre-configured *runtime* to override
+    the default (e.g. a :class:`PythonRuntime` with a venv interpreter).
     """
-    def __init__(self, path: Path, timeout: int = 30):
+    def __init__(self, path: Path, timeout: int = 30, runtime: Runtime | None = None):
         self._path = path.resolve()
-        self._runtime = _resolve_runtime(self._path)
+        self._runtime = runtime or _resolve_runtime(self._path)
         self._timeout = timeout
         if not self._path.exists():
@@ -223,7 +232,7 @@ class SubprocessBackend(EvaluatorBackend):
 # Executor factory
 # ---------------------------------------------------------------------------
-_EXECUTOR_FACTORIES: dict[str, Callable[[Path, int], EvaluatorBackend]] = {
+_EXECUTOR_FACTORIES: dict[str, Callable[..., EvaluatorBackend]] = {
     "local": lambda path, timeout: SubprocessBackend(path, timeout),
 }
@@ -236,7 +245,7 @@ def create_executor(executor_name: str, path: Path, timeout: int = 30) -> Evalua
     return factory(path, timeout)
-def register_executor(name: str, factory: Callable[[Path, int], EvaluatorBackend]) -> None:
+def register_executor(name: str, factory: Callable[..., EvaluatorBackend]) -> None:
     """Register a new executor factory (e.g. for Docker support)."""
     _EXECUTOR_FACTORIES[name] = factory
@@ -416,16 +425,41 @@ async def evaluate_custom_evaluator(
     """
     import inspect as _inspect
-    from .config import CodeEvaluatorDef, RemoteEvaluatorDef
+    from .config import CodeEvaluatorDef, OpenAIEvalDef, RemoteEvaluatorDef
     from .runner import MetricResult
+    if isinstance(evaluator_def, OpenAIEvalDef):
+        from .openai_eval_backend import evaluate_openai_eval
+        return await evaluate_openai_eval(evaluator_def, actual_invocations, expected_invocations)
     if isinstance(evaluator_def, RemoteEvaluatorDef):
         from .evaluator.resolver import get_default_resolver
         evaluator_def = await get_default_resolver().resolve(evaluator_def)
     if isinstance(evaluator_def, CodeEvaluatorDef):
-        backend = create_executor(evaluator_def.executor, Path(evaluator_def.path), evaluator_def.timeout)
+        evaluator_path = Path(evaluator_def.path)
+        runtime: Runtime | None = None
+        if evaluator_path.suffix == ".py":
+            from .evaluator.venv import ensure_venv_async
+            try:
+                venv_python = await ensure_venv_async(evaluator_path)
+            except Exception as exc:
+                logger.error("Failed to set up venv for '%s': %s", evaluator_def.name, exc)
+                return MetricResult(
+                    metric_name=evaluator_def.name,
+                    error=f"Dependency installation failed: {exc}",
+                )
+            if venv_python:
+                runtime = PythonRuntime(python_path=venv_python)
+        if runtime is not None:
+            backend = SubprocessBackend(evaluator_path, evaluator_def.timeout, runtime=runtime)
+        else:
+            backend = create_executor(evaluator_def.executor, evaluator_path, evaluator_def.timeout)
     else:
         raise ValueError(f"Unsupported custom evaluator type: {type(evaluator_def).__name__}")

{agentevals_cli-0.5.3 → agentevals_cli-0.6.0}/src/agentevals/eval_config_loader.py RENAMED Viewed

@@ -13,6 +13,7 @@ from .config import (
     CodeEvaluatorDef,
     CustomEvaluatorDef,
     EvalRunConfig,
+    OpenAIEvalDef,
     RemoteEvaluatorDef,
 )
@@ -22,6 +23,7 @@ _TYPE_TO_MODEL = {
     "builtin": BuiltinMetricDef,
     "code": CodeEvaluatorDef,
     "remote": RemoteEvaluatorDef,
+    "openai_eval": OpenAIEvalDef,
 }
@@ -42,7 +44,7 @@ def _parse_evaluator_entry(entry: dict[str, Any]) -> tuple[str | None, CustomEva
     evaluator_type = entry.get("type")
     if not evaluator_type:
-        raise ValueError(f"Evaluator entry '{name}' must have a 'type' field (builtin, code, or remote)")
+        raise ValueError(f"Evaluator entry '{name}' must have a 'type' field ({', '.join(_TYPE_TO_MODEL)})")
     if evaluator_type not in _TYPE_TO_MODEL:
         raise ValueError(

agentevals-cli 0.5.3__tar.gz → 0.6.0__tar.gz

agentevals-cli 0.5.3tar.gz → 0.6.0tar.gz