agentevals-cli 0.6.3__tar.gz → 0.7.0__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (241) hide show
  1. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/Dockerfile +2 -2
  2. agentevals_cli-0.7.0/PKG-INFO +419 -0
  3. agentevals_cli-0.7.0/README.md +393 -0
  4. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/charts/agentevals/Chart.yaml +1 -1
  5. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/charts/agentevals/templates/NOTES.txt +3 -2
  6. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/charts/agentevals/templates/deployment.yaml +3 -0
  7. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/charts/agentevals/templates/service.yaml +4 -0
  8. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/charts/agentevals/values.yaml +4 -0
  9. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/docs/otel-compatibility.md +107 -6
  10. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/examples/README.md +17 -1
  11. agentevals_cli-0.7.0/examples/kubernetes/README.md +257 -0
  12. agentevals_cli-0.7.0/examples/zero-code-examples/ollama/requirements.txt +7 -0
  13. agentevals_cli-0.7.0/examples/zero-code-examples/ollama/run.py +194 -0
  14. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/examples/zero-code-examples/openai-agents/requirements.txt +2 -1
  15. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/examples/zero-code-examples/openai-agents/run.py +1 -0
  16. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/flake.lock +0 -21
  17. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/flake.nix +21 -13
  18. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/packages/evaluator-sdk-py/src/agentevals_evaluator_sdk/types.py +2 -0
  19. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/pyproject.toml +14 -2
  20. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/_protocol.py +2 -0
  21. agentevals_cli-0.7.0/src/agentevals/_static/assets/index-7YPfPT4N.js +342 -0
  22. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/_static/index.html +1 -1
  23. agentevals_cli-0.7.0/src/agentevals/api/app.py +152 -0
  24. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/api/dependencies.py +7 -2
  25. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/api/models.py +3 -0
  26. agentevals_cli-0.7.0/src/agentevals/api/otlp_app.py +27 -0
  27. agentevals_cli-0.7.0/src/agentevals/api/otlp_grpc.py +98 -0
  28. agentevals_cli-0.6.3/src/agentevals/api/otlp_routes.py → agentevals_cli-0.7.0/src/agentevals/api/otlp_processing.py +44 -106
  29. agentevals_cli-0.7.0/src/agentevals/api/otlp_routes.py +69 -0
  30. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/api/routes.py +35 -0
  31. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/api/streaming_routes.py +3 -1
  32. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/builtin_metrics.py +135 -4
  33. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/cli.py +83 -32
  34. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/config.py +13 -0
  35. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/custom_evaluators.py +24 -7
  36. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/eval_config_loader.py +4 -0
  37. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/extraction.py +86 -5
  38. agentevals_cli-0.7.0/src/agentevals/mcp_server.py +473 -0
  39. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/output.py +77 -19
  40. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/runner.py +43 -3
  41. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/streaming/ws_server.py +47 -2
  42. agentevals_cli-0.7.0/src/agentevals/trace_attrs.py +72 -0
  43. agentevals_cli-0.7.0/src/agentevals/trace_metrics.py +205 -0
  44. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/tests/integration/conftest.py +14 -20
  45. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/tests/integration/test_live_agents.py +34 -34
  46. agentevals_cli-0.7.0/tests/integration/test_otlp_grpc_receiver.py +128 -0
  47. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/tests/test_api.py +94 -5
  48. agentevals_cli-0.7.0/tests/test_cli.py +87 -0
  49. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/tests/test_extraction.py +208 -2
  50. agentevals_cli-0.7.0/tests/test_mcp_server.py +342 -0
  51. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/tests/test_otlp_receiver.py +187 -50
  52. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/tests/test_runner.py +133 -1
  53. agentevals_cli-0.7.0/tests/test_trace_metrics.py +519 -0
  54. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/package-lock.json +15 -15
  55. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/api/client.ts +3 -0
  56. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/inspector/ComparisonPanel.tsx +10 -1
  57. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/inspector/InspectorView.tsx +10 -0
  58. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/inspector/PerformanceSection.tsx +44 -1
  59. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/streaming/SessionCard.tsx +42 -0
  60. agentevals_cli-0.7.0/ui/src/components/streaming/SessionMetadata.tsx +95 -0
  61. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/upload/MetricSelector.tsx +46 -31
  62. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/upload/UploadView.tsx +20 -0
  63. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/context/TraceContext.tsx +2 -0
  64. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/context/TraceProvider.tsx +9 -1
  65. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/lib/types.ts +40 -0
  66. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/uv.lock +767 -30
  67. agentevals_cli-0.6.3/PKG-INFO +0 -333
  68. agentevals_cli-0.6.3/README.md +0 -307
  69. agentevals_cli-0.6.3/src/agentevals/_static/assets/index-CMANliTZ.js +0 -342
  70. agentevals_cli-0.6.3/src/agentevals/api/app.py +0 -133
  71. agentevals_cli-0.6.3/src/agentevals/api/otlp_app.py +0 -25
  72. agentevals_cli-0.6.3/src/agentevals/mcp_server.py +0 -237
  73. agentevals_cli-0.6.3/src/agentevals/trace_attrs.py +0 -33
  74. agentevals_cli-0.6.3/src/agentevals/trace_metrics.py +0 -126
  75. agentevals_cli-0.6.3/ui/src/components/streaming/SessionMetadata.tsx +0 -78
  76. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/.claude/skills/eval/SKILL.md +0 -0
  77. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/.claude/skills/eval/evals/evals.json +0 -0
  78. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/.claude/skills/inspect/SKILL.md +0 -0
  79. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/.claude/skills/inspect/evals/evals.json +0 -0
  80. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/.dockerignore +0 -0
  81. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/.github/ISSUE_TEMPLATE/bug_report.yml +0 -0
  82. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/.github/ISSUE_TEMPLATE/config.yml +0 -0
  83. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/.github/ISSUE_TEMPLATE/feature_request.yml +0 -0
  84. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/.github/workflows/ci.yml +0 -0
  85. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/.github/workflows/publish-evaluator-sdk.yml +0 -0
  86. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/.github/workflows/release.yml +0 -0
  87. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/.gitignore +0 -0
  88. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/.mcp.json +0 -0
  89. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/CONTRIBUTING.md +0 -0
  90. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/DEVELOPMENT.md +0 -0
  91. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/LICENSE +0 -0
  92. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/Makefile +0 -0
  93. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/charts/agentevals/templates/_helpers.tpl +0 -0
  94. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/charts/agentevals/templates/serviceaccount.yaml +0 -0
  95. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/docs/assets/logo-color-on-transparent.svg +0 -0
  96. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/docs/assets/logo-color.png +0 -0
  97. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/docs/assets/logo-dark-on-transparent.svg +0 -0
  98. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/docs/custom-evaluators.md +0 -0
  99. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/docs/eval-set-format.md +0 -0
  100. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/docs/streaming.md +0 -0
  101. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/examples/custom_evaluators/eval_config.yaml +0 -0
  102. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/examples/custom_evaluators/response_quality.py +0 -0
  103. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/examples/custom_evaluators/tool_call_checker.py +0 -0
  104. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/examples/dice_agent/README.md +0 -0
  105. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/examples/dice_agent/agent.py +0 -0
  106. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/examples/dice_agent/eval_set.json +0 -0
  107. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/examples/dice_agent/main.py +0 -0
  108. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/examples/dice_agent/test_streaming.py +0 -0
  109. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/examples/langchain_agent/README.md +0 -0
  110. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/examples/langchain_agent/agent.py +0 -0
  111. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/examples/langchain_agent/eval_set.json +0 -0
  112. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/examples/langchain_agent/main.py +0 -0
  113. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/examples/langchain_agent/requirements.txt +0 -0
  114. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/examples/langchain_agent/test_streaming.py +0 -0
  115. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/examples/sdk_example/async_example.py +0 -0
  116. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/examples/sdk_example/context_manager_example.py +0 -0
  117. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/examples/sdk_example/decorator_example.py +0 -0
  118. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/examples/sdk_example/requirements.txt +0 -0
  119. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/examples/strands_agent/agent.py +0 -0
  120. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/examples/strands_agent/eval_set.json +0 -0
  121. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/examples/strands_agent/main.py +0 -0
  122. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/examples/strands_agent/requirements.txt +0 -0
  123. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/examples/zero-code-examples/adk/requirements.txt +0 -0
  124. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/examples/zero-code-examples/adk/run.py +0 -0
  125. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/examples/zero-code-examples/langchain/requirements.txt +0 -0
  126. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/examples/zero-code-examples/langchain/run.py +0 -0
  127. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/examples/zero-code-examples/strands/requirements.txt +0 -0
  128. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/examples/zero-code-examples/strands/run.py +0 -0
  129. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/packages/evaluator-sdk-py/README.md +0 -0
  130. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/packages/evaluator-sdk-py/pyproject.toml +0 -0
  131. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/packages/evaluator-sdk-py/src/agentevals_evaluator_sdk/__init__.py +0 -0
  132. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/packages/evaluator-sdk-py/src/agentevals_evaluator_sdk/decorator.py +0 -0
  133. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/samples/eval_set_helm.json +0 -0
  134. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/samples/evalset_helm_3_2026-02-23.json +0 -0
  135. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/samples/evalset_k8s_2026-02-20.json +0 -0
  136. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/samples/helm.json +0 -0
  137. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/samples/helm_2.json +0 -0
  138. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/samples/helm_3.json +0 -0
  139. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/samples/k8s.json +0 -0
  140. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/__init__.py +0 -0
  141. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/_static/assets/index-BqibLiHO.css +0 -0
  142. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/_static/logo.svg +0 -0
  143. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/_static/vite.svg +0 -0
  144. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/api/__init__.py +0 -0
  145. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/api/debug_routes.py +0 -0
  146. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/converter.py +0 -0
  147. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/evaluator/__init__.py +0 -0
  148. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/evaluator/resolver.py +0 -0
  149. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/evaluator/sources.py +0 -0
  150. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/evaluator/templates.py +0 -0
  151. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/evaluator/venv.py +0 -0
  152. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/genai_converter.py +0 -0
  153. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/loader/__init__.py +0 -0
  154. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/loader/base.py +0 -0
  155. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/loader/jaeger.py +0 -0
  156. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/loader/otlp.py +0 -0
  157. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/openai_eval_backend.py +0 -0
  158. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/sdk.py +0 -0
  159. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/streaming/__init__.py +0 -0
  160. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/streaming/incremental_processor.py +0 -0
  161. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/streaming/processor.py +0 -0
  162. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/streaming/session.py +0 -0
  163. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/utils/__init__.py +0 -0
  164. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/utils/genai_messages.py +0 -0
  165. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/utils/log_buffer.py +0 -0
  166. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/src/agentevals/utils/log_enrichment.py +0 -0
  167. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/tests/integration/__init__.py +0 -0
  168. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/tests/integration/test_evaluation_pipeline.py +0 -0
  169. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/tests/integration/test_session_grouping.py +0 -0
  170. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/tests/integration/test_timing_stress.py +0 -0
  171. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/tests/test_converter.py +0 -0
  172. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/tests/test_genai_converter.py +0 -0
  173. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/tests/test_jaeger_loader.py +0 -0
  174. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/tests/test_log_enrichment.py +0 -0
  175. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/tests/test_otlp_loader.py +0 -0
  176. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/tests/test_output.py +0 -0
  177. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/tests/test_protocol.py +0 -0
  178. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/tests/test_sdk.py +0 -0
  179. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/.gitignore +0 -0
  180. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/README.md +0 -0
  181. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/eslint.config.js +0 -0
  182. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/index.html +0 -0
  183. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/package.json +0 -0
  184. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/public/logo.svg +0 -0
  185. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/public/vite.svg +0 -0
  186. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/App.css +0 -0
  187. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/App.tsx +0 -0
  188. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/assets/react.svg +0 -0
  189. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/annotation-queue/AnnotationDetailPanel.tsx +0 -0
  190. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/annotation-queue/AnnotationQueueView.tsx +0 -0
  191. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/annotation-queue/AnnotationTable.tsx +0 -0
  192. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/bug-report/BugReportModal.tsx +0 -0
  193. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/builder/BuilderHeader.tsx +0 -0
  194. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/builder/BuilderView.tsx +0 -0
  195. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/builder/EvalCaseCard.tsx +0 -0
  196. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/builder/EvalCasesList.tsx +0 -0
  197. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/builder/InvocationEditor.tsx +0 -0
  198. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/builder/JsonPreview.tsx +0 -0
  199. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/builder/MetadataEditor.tsx +0 -0
  200. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/builder/TraceUploadZone.tsx +0 -0
  201. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/builder/index.ts +0 -0
  202. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/dashboard/DashboardView.tsx +0 -0
  203. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/dashboard/MetricScoreCard.tsx +0 -0
  204. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/dashboard/PerformanceCard.tsx +0 -0
  205. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/dashboard/PerformanceCharts.tsx +0 -0
  206. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/dashboard/SummaryStats.tsx +0 -0
  207. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/dashboard/TraceCard.tsx +0 -0
  208. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/dashboard/TraceTable.tsx +0 -0
  209. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/inspector/DataSection.tsx +0 -0
  210. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/inspector/InspectorHeader.tsx +0 -0
  211. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/inspector/InspectorLayout.tsx +0 -0
  212. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/inspector/InvocationCard.tsx +0 -0
  213. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/inspector/InvocationSummaryPanel.tsx +0 -0
  214. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/inspector/MetricResultsSection.tsx +0 -0
  215. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/inspector/MetricsComparisonSection.tsx +0 -0
  216. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/inspector/ToolCallList.tsx +0 -0
  217. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/inspector/TrajectoryComparisonDetails.tsx +0 -0
  218. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/sidebar/Sidebar.tsx +0 -0
  219. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/streaming/LiveConversationPanel.tsx +0 -0
  220. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/streaming/LiveMessage.tsx +0 -0
  221. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/streaming/LiveStreamingView.tsx +0 -0
  222. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/upload/EvalSetEditorDrawer.tsx +0 -0
  223. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/upload/FileDropZone.tsx +0 -0
  224. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/upload/RawJsonPreview.tsx +0 -0
  225. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/upload/TraceEditorDrawer.tsx +0 -0
  226. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/components/welcome/WelcomeView.tsx +0 -0
  227. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/config.ts +0 -0
  228. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/index.css +0 -0
  229. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/lib/console-capture.ts +0 -0
  230. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/lib/evalset-builder.ts +0 -0
  231. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/lib/network-capture.ts +0 -0
  232. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/lib/trace-helpers.ts +0 -0
  233. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/lib/trace-loader.ts +0 -0
  234. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/lib/trace-metadata.ts +0 -0
  235. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/lib/trace-patcher.ts +0 -0
  236. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/lib/utils.ts +0 -0
  237. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/src/main.tsx +0 -0
  238. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/tsconfig.app.json +0 -0
  239. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/tsconfig.json +0 -0
  240. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/tsconfig.node.json +0 -0
  241. {agentevals_cli-0.6.3 → agentevals_cli-0.7.0}/ui/vite.config.ts +0 -0
@@ -33,6 +33,6 @@ USER app
33
33
  ENV PATH="/app/.venv/bin:$PATH"
34
34
  ENV AGENTEVALS_SERVER_URL=http://127.0.0.1:8001
35
35
 
36
- EXPOSE 8001 4318 8080
36
+ EXPOSE 8001 4318 4317 8080
37
37
 
38
- CMD ["agentevals", "serve", "--host", "0.0.0.0", "--port", "8001", "--otlp-port", "4318", "--mcp-port", "8080"]
38
+ CMD ["agentevals", "serve", "--host", "0.0.0.0", "--port", "8001", "--otlp-http-port", "4318", "--otlp-grpc-port", "4317", "--mcp-port", "8080"]
@@ -0,0 +1,419 @@
1
+ Metadata-Version: 2.4
2
+ Name: agentevals-cli
3
+ Version: 0.7.0
4
+ Summary: Standalone framework to evaluate agent correctness based on portable OpenTelemetry traces
5
+ License-File: LICENSE
6
+ Requires-Python: >=3.11
7
+ Requires-Dist: click>=8.0
8
+ Requires-Dist: fastapi>=0.115.0
9
+ Requires-Dist: google-adk[eval]>=1.30.0
10
+ Requires-Dist: httpx>=0.27.0
11
+ Requires-Dist: opentelemetry-proto>=1.36.0
12
+ Requires-Dist: python-dotenv>=1.0.0
13
+ Requires-Dist: python-multipart>=0.0.12
14
+ Requires-Dist: pyyaml>=6.0
15
+ Requires-Dist: tabulate>=0.9.0
16
+ Requires-Dist: uvicorn[standard]>=0.32.0
17
+ Provides-Extra: live
18
+ Requires-Dist: httpx>=0.27.0; extra == 'live'
19
+ Requires-Dist: mcp>=1.26.0; extra == 'live'
20
+ Provides-Extra: openai
21
+ Requires-Dist: openai>=2.0; extra == 'openai'
22
+ Provides-Extra: streaming
23
+ Requires-Dist: opentelemetry-sdk>=1.20.0; extra == 'streaming'
24
+ Requires-Dist: websockets>=12.0; extra == 'streaming'
25
+ Description-Content-Type: text/markdown
26
+
27
+ <p align="center">
28
+ <picture>
29
+ <source media="(prefers-color-scheme: dark)" srcset="docs/assets/logo-color-on-transparent.svg">
30
+ <source media="(prefers-color-scheme: light)" srcset="docs/assets/logo-dark-on-transparent.svg">
31
+ <img src="docs/assets/logo-color-on-transparent.svg" alt="agentevals" width="420" />
32
+ </picture>
33
+ </p>
34
+
35
+ <h1 align="center">Ship Agents Reliably</h1>
36
+
37
+ <p align="center">
38
+ Benchmark your agents before they hit production.<br>
39
+ agentevals scores performance and inference quality from OpenTelemetry traces. No re-runs, no guesswork.
40
+ </p>
41
+
42
+ <p align="center">
43
+ <a href="https://github.com/agentevals-dev/agentevals/stargazers"><img src="https://img.shields.io/github/stars/agentevals-dev/agentevals?style=social" alt="GitHub Stars"></a>
44
+ &nbsp;
45
+ <a href="https://discord.gg/cpveEn8Ah2"><img src="https://img.shields.io/discord/1435836734666707190?label=Discord&logo=discord&logoColor=white&color=5865F2" alt="Discord"></a>
46
+ &nbsp;
47
+ <a href="https://github.com/agentevals-dev/agentevals/releases"><img src="https://img.shields.io/github/v/release/agentevals-dev/agentevals?label=Release" alt="Release"></a>
48
+ &nbsp;
49
+ <a href="https://github.com/agentevals-dev/agentevals/blob/main/LICENSE"><img src="https://img.shields.io/badge/License-Apache%202.0-green.svg" alt="License"></a>
50
+ &nbsp;
51
+ <a href="https://pypi.org/project/agentevals-cli/"><img src="https://img.shields.io/pypi/v/agentevals-cli?label=PyPI&color=blue" alt="PyPI"></a>
52
+ </p>
53
+
54
+ <p align="center">
55
+ <a href="#installation">Install</a> · <a href="#quick-start">Quick Start</a> · <a href="https://github.com/agentevals-dev/agentevals/releases">Releases</a> · <a href="CONTRIBUTING.md">Contributing</a> · <a href="https://discord.gg/cpveEn8Ah2">Discord</a>
56
+ </p>
57
+
58
+ ---
59
+
60
+ ## What is agentevals?
61
+
62
+ agentevals is a framework-agnostic evaluation solution that scores AI agent behavior directly from [OpenTelemetry](https://opentelemetry.io/) traces. Record your agent's actions once, then evaluate as many times as you want without re-executing or burning extra tokens.
63
+
64
+ It works with any OTel-instrumented framework (LangChain, Strands, Google ADK, OpenAI Agents SDK, and others), supports Jaeger JSON and native OTLP trace formats, and ships with built-in evaluators, custom evaluator support, and LLM-based judges.
65
+
66
+ - **No re-execution**: score agents from existing traces without replaying expensive LLM calls
67
+ - **Golden eval sets**: compare actual behavior against defined expected behaviors for deterministic pass/fail gating
68
+ - **Custom evaluators**: write scoring logic in Python, JavaScript, or any language, or offload scoring to OpenAI Eval API
69
+ - **CI/CD ready**: gate deployments on quality thresholds directly in your pipeline
70
+ - **Local-first**: no cloud dependency required; everything runs on your machine
71
+ - **Multiple interfaces**: CLI for scripting and CI, Web UI for visual inspection, MCP server for conversational evaluation, Helm chart for Kubernetes environments
72
+
73
+ > [!IMPORTANT]
74
+ > This project is under active development. Expect breaking changes.
75
+
76
+ ## Contents
77
+
78
+ - [Installation](#installation)
79
+ - [Quick Start](#quick-start)
80
+ - [Use-cases and Integrations](#use-cases-and-integrations)
81
+ - [CLI](#cli)
82
+ - [Custom Evaluators](#custom-evaluators)
83
+ - [Web UI](#web-ui)
84
+ - [Deployment](#deployment)
85
+ - [MCP Server](#mcp-server)
86
+ - [Claude Code Skills](#claude-code-skills)
87
+ - [Examples](#examples)
88
+ - [Docs](#docs)
89
+ - [Development](#development)
90
+ - [FAQ](#faq)
91
+
92
+ ## Installation
93
+
94
+ **From PyPI** (recommended): the published package includes the **CLI**, **REST API**, and **embedded web UI**.
95
+
96
+ ```bash
97
+ pip install agentevals-cli
98
+ ```
99
+
100
+ Optional extras:
101
+
102
+ ```bash
103
+ pip install "agentevals-cli[live]" # MCP server support
104
+ pip install "agentevals-cli[openai]" # OpenAI Evals API graders
105
+ ```
106
+
107
+ **GitHub [releases](../../releases)** also ship **core** wheels (CLI and API only) and **bundle** wheels (with the embedded UI) if you need a specific version or offline `pip install ./path/to.whl`.
108
+
109
+ **From source** with `uv` or Nix:
110
+
111
+ ```bash
112
+ uv sync
113
+ # or: nix develop .
114
+ ```
115
+
116
+ See [DEVELOPMENT.md](DEVELOPMENT.md) for build instructions.
117
+
118
+ ## Quick Start
119
+
120
+ Examples use `agentevals` on your PATH after `pip install agentevals-cli`. If you are working from a clone of this repo, use `uv run agentevals` instead.
121
+
122
+ The `samples/` directory includes real traces from a Kubernetes Helm agent and matching eval sets that define expected behavior (which tools should be called, what the response should contain).
123
+
124
+ **Score a trace against an eval set:**
125
+
126
+ ```bash
127
+ agentevals run samples/helm.json \
128
+ --eval-set samples/eval_set_helm.json \
129
+ -m tool_trajectory_avg_score
130
+ ```
131
+
132
+ The agent was asked to list Helm releases. The eval set expects a call to `helm_list_releases`. It matches:
133
+
134
+ ```
135
+ Trace: 3e289017fe03ffd7c4145316d2eb3d0d
136
+ Invocations: 1
137
+ Metric Score Status Per-Invocation Time
138
+ ------ ------------------------- ------- -------- ---------------- ------
139
+ [PASS] tool_trajectory_avg_score 1 PASSED 1 0ms
140
+ ```
141
+
142
+ **Catch a mismatch.** Run a different trace against the same eval set:
143
+
144
+ ```bash
145
+ agentevals run samples/k8s.json \
146
+ --eval-set samples/eval_set_helm.json \
147
+ -m tool_trajectory_avg_score
148
+ ```
149
+
150
+ This trace is from a different agent session that never called the expected tool. The evaluation fails:
151
+
152
+ ```
153
+ [FAIL] tool_trajectory_avg_score 0 FAILED 0 0ms
154
+ Invocation 1 trajectory mismatch:
155
+ Expected:
156
+ - helm_list_releases({})
157
+ Actual:
158
+ (none)
159
+ ```
160
+
161
+ **Evaluate multiple dimensions at once:**
162
+
163
+ ```bash
164
+ agentevals run samples/helm_3.json \
165
+ --eval-set samples/evalset_helm_3_2026-02-23.json \
166
+ -m tool_trajectory_avg_score \
167
+ -m response_match_score
168
+ ```
169
+
170
+ `tool_trajectory_avg_score` checks whether the right tools were called. `response_match_score` checks whether the agent's final answer matches the expected response.
171
+
172
+ **Explore visually.** Launch the Web UI and upload traces from the browser:
173
+
174
+ ```bash
175
+ agentevals serve
176
+ # opens http://localhost:8001
177
+ ```
178
+
179
+ You can also point any OTel-instrumented agent directly at the built-in receiver (`OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318`). The UI streams tool calls, inputs, and outputs live as your agent runs. For production setups, the same receiver slots into a Kubernetes OTel Collector pipeline as an exporter destination. See [Use-cases and Integrations](#use-cases-and-integrations) and the [Kubernetes example](examples/kubernetes/README.md) for walkthroughs.
180
+
181
+ **Next steps:**
182
+
183
+ - `agentevals evaluator list` to see all built-in and community evaluators
184
+ - [Custom Evaluators](#custom-evaluators) to write your own scoring logic
185
+
186
+ ## Use-cases and Integrations
187
+
188
+ ### Zero-Code (Recommended)
189
+
190
+ Point any OTel-instrumented agent at the agentevals receiver. No SDK, no code changes:
191
+
192
+ ```bash
193
+ # Terminal 1
194
+ agentevals serve --dev
195
+
196
+ # Terminal 2
197
+ export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
198
+ export OTEL_RESOURCE_ATTRIBUTES="agentevals.session_name=my-agent"
199
+ python your_agent.py
200
+ ```
201
+
202
+ For OTLP/gRPC exporters, use:
203
+
204
+ ```bash
205
+ export OTEL_EXPORTER_OTLP_ENDPOINT=localhost:4317
206
+ export OTEL_EXPORTER_OTLP_PROTOCOL=grpc
207
+ ```
208
+
209
+ Traces stream to the UI in real-time. Works with LangChain, Strands, Google ADK, OpenAI Agents SDK, or any framework that emits OTel spans (`http/protobuf`, `http/json`, and OTLP/gRPC supported). Sessions are auto-created and grouped by `agentevals.session_name`. Set `agentevals.eval_set_id` to associate traces with an eval set.
210
+
211
+ See [examples/zero-code-examples/](examples/zero-code-examples/) for working examples.
212
+
213
+ ### AgentEvals SDK
214
+
215
+ For programmatic session lifecycle and decorator API:
216
+
217
+ ```python
218
+ from agentevals import AgentEvals
219
+
220
+ app = AgentEvals()
221
+
222
+ with app.session(eval_set_id="my-eval"):
223
+ agent.invoke("Roll a 20-sided die for me")
224
+ ```
225
+
226
+ Requires `pip install "agentevals-cli[streaming]"`. See [examples/sdk_example/](examples/sdk_example/) for framework-specific patterns.
227
+
228
+ ## CLI
229
+
230
+ ```bash
231
+ # Multiple traces, JSON output
232
+ agentevals run samples/helm.json samples/k8s.json \
233
+ --eval-set samples/eval_set_helm.json \
234
+ -m tool_trajectory_avg_score \
235
+ --output json
236
+
237
+ # List available evaluators
238
+ agentevals evaluator list
239
+
240
+ # Flexible trajectory matching (EXACT | IN_ORDER | ANY_ORDER)
241
+ agentevals run trace.json \
242
+ --eval-set eval_set.json \
243
+ -m tool_trajectory_avg_score \
244
+ --trajectory-match-type IN_ORDER
245
+ ```
246
+
247
+ Run `agentevals run --help` for all options.
248
+
249
+ ## Custom Evaluators
250
+
251
+ Write scoring logic in Python, JavaScript, or any language. Scaffold a new evaluator with:
252
+
253
+ ```bash
254
+ agentevals evaluator init my_evaluator
255
+ ```
256
+
257
+ Reference it alongside built-in metrics in an eval config:
258
+
259
+ ```yaml
260
+ evaluators:
261
+ - name: tool_trajectory_avg_score
262
+ type: builtin
263
+ - name: my_evaluator
264
+ type: code
265
+ path: ./evaluators/my_evaluator.py
266
+ threshold: 0.7
267
+ ```
268
+
269
+ Evaluators with a `requirements.txt` get automatic virtual environment management. You can also use `type: remote` for community evaluators from GitHub, or `type: openai_eval` to delegate grading to the [OpenAI Evals API](https://developers.openai.com/api/reference/resources/evals/methods/create) (requires `pip install "agentevals-cli[openai]"`).
270
+
271
+ See the [Custom Evaluators guide](docs/custom-evaluators.md) for the full protocol reference, SDK helpers, and how to contribute evaluators.
272
+
273
+ ## Web UI
274
+
275
+ ```bash
276
+ agentevals serve # bundled UI on http://localhost:8001
277
+ ```
278
+
279
+ Upload traces and eval sets, select metrics, and view results with interactive span trees. Live-streamed traces appear in the "Local Dev" tab, grouped by session ID. For running from source, see [DEVELOPMENT.md](DEVELOPMENT.md).
280
+
281
+ Interactive API docs are available at `/docs` (Swagger) and `/redoc` while the server is running. The OTLP receiver on port 4318 serves its own docs at `http://localhost:4318/docs`.
282
+
283
+ ## Deployment
284
+
285
+ ### Docker
286
+
287
+ A `Dockerfile` is included at the project root. The image bundles the API, web UI, and OTLP receiver:
288
+
289
+ ```bash
290
+ docker build -t agentevals .
291
+ docker run -p 8001:8001 -p 4317:4317 -p 4318:4318 agentevals
292
+ ```
293
+
294
+ | Port | Purpose |
295
+ |------|---------|
296
+ | 8001 | Web UI and REST API |
297
+ | 4317 | OTLP gRPC receiver (traces and logs) |
298
+ | 4318 | OTLP HTTP receiver (traces and logs) |
299
+ | 8080 | MCP (Streamable HTTP) |
300
+
301
+ ### Helm
302
+
303
+ A Helm chart is available in [`charts/agentevals/`](charts/agentevals/):
304
+
305
+ ```bash
306
+ helm install agentevals ./charts/agentevals
307
+ ```
308
+
309
+ See the [Kubernetes example](examples/kubernetes/README.md) for an end-to-end walkthrough deploying agentevals alongside kagent and an OTel Collector on Kubernetes.
310
+
311
+ ## MCP Server
312
+
313
+ Exposes evaluation tools to MCP clients. A `.mcp.json` at the project root lets Claude Code pick it up automatically.
314
+
315
+ | Tool | Requires `serve` | Description |
316
+ |------|:---:|-------------|
317
+ | `list_metrics` | yes | List available metrics |
318
+ | `evaluate_traces` | no | Evaluate local trace files (OTLP or Jaeger) |
319
+ | `list_sessions` | yes | List streaming sessions |
320
+ | `summarize_session` | yes | Structured summary of a session's tool calls |
321
+ | `evaluate_sessions` | yes | Evaluate sessions against a golden reference |
322
+
323
+ ```bash
324
+ # Custom server URL (requires pip install "agentevals-cli[live]")
325
+ AGENTEVALS_SERVER_URL=http://localhost:9000 agentevals mcp
326
+ ```
327
+
328
+ The React UI and MCP server share the same in-memory session state and can run simultaneously.
329
+
330
+ ## Claude Code Skills
331
+
332
+ Two slash-command workflows in `.claude/skills/`, available automatically in this repo:
333
+
334
+ | Skill | What it does |
335
+ |-------|-------------|
336
+ | `/eval` | Score traces or compare sessions against a golden reference |
337
+ | `/inspect` | Turn-by-turn narrative of a live session with anomaly detection |
338
+
339
+ ## Examples
340
+
341
+ Working examples are in the [`examples/`](examples/) directory:
342
+
343
+ | Example | Description |
344
+ |---------|-------------|
345
+ | [ADK](examples/zero-code-examples/adk/) | Google ADK agent with zero-code OTel export |
346
+ | [LangChain](examples/zero-code-examples/langchain/) | LangChain agent with zero-code OTel export |
347
+ | [Strands](examples/zero-code-examples/strands/) | Strands SDK agent with zero-code OTel export |
348
+ | [OpenAI Agents](examples/zero-code-examples/openai-agents/) | OpenAI Agents SDK with zero-code OTel export |
349
+ | [Ollama](examples/zero-code-examples/ollama/) | LangChain + Ollama for local LLM evaluation |
350
+ | [Kubernetes](examples/kubernetes/) | End-to-end deployment with kagent and OTel Collector |
351
+
352
+ ## Docs
353
+
354
+ | Guide | Description |
355
+ |-------|-------------|
356
+ | [Eval Set Format](docs/eval-set-format.md) | Schema, field reference, and examples for golden eval set JSON files |
357
+ | [Custom Evaluators](docs/custom-evaluators.md) | Write your own scoring logic in Python, JavaScript, or any language |
358
+ | [Live Streaming](docs/streaming.md) | Real-time trace streaming, dev server setup, and session management |
359
+ | [OpenTelemetry Compatibility](docs/otel-compatibility.md) | Supported OTel conventions, message delivery mechanisms, and OTLP receiver |
360
+
361
+ ## Development
362
+
363
+ ```bash
364
+ uv run pytest # run tests
365
+ uv run agentevals serve --dev # backend
366
+ cd ui && npm run dev # frontend (separate terminal)
367
+ ```
368
+
369
+ See [DEVELOPMENT.md](DEVELOPMENT.md) for build tiers, Makefile targets, and Nix setup. To contribute, see [CONTRIBUTING.md](CONTRIBUTING.md).
370
+
371
+ ## FAQ
372
+
373
+ **Do I need a database or any infrastructure to run agentevals?**
374
+
375
+ No. agentevals is a single `pip install` with no database, no message queue, and no external services. The CLI evaluates trace files directly from disk. The web UI and live streaming use in-memory session state.
376
+
377
+ **Does the CLI require a running server?**
378
+
379
+ No. `agentevals run` evaluates trace files entirely offline. The server (`agentevals serve`) is only needed for the web UI, live OTLP streaming, and server-dependent MCP tools like `list_sessions`.
380
+
381
+ **Can I use agentevals in CI/CD?**
382
+
383
+ Yes. Pass trace files and an eval set, set a threshold, and let the exit code gate your deployment. Combine with `--output json` for machine-readable results. No server process needed.
384
+
385
+ **What if I switch agent frameworks?**
386
+
387
+ Because agentevals uses OpenTelemetry as its universal interface, switching frameworks does not require changing your evaluation setup. As long as your new framework emits OTel spans, the same eval sets and metrics work as before.
388
+
389
+ **Can I write evaluators in my own language?**
390
+
391
+ Yes. A custom evaluator is any program that reads JSON from stdin and writes a score to stdout. Python and JavaScript have first-class scaffolding support (`agentevals evaluator init`), but any language works.
392
+
393
+ **Can I plug agentevals into an existing OTel pipeline?**
394
+
395
+ Yes. The OTLP receiver on port 4318 accepts standard `http/protobuf` and `http/json` trace exports, so it slots into any OpenTelemetry pipeline as just another exporter destination. If your pipeline uses gRPC (port 4317), place an [OTel Collector](https://opentelemetry.io/docs/collector/) in front to bridge gRPC to HTTP. The [Kubernetes example](examples/kubernetes/README.md) shows this pattern.
396
+
397
+ **How does this compare to ADK's evaluations?**
398
+
399
+ Unlike ADK's eval method, which couples agent execution with evaluation, agentevals only handles scoring: it takes pre-recorded traces and compares them against expected behavior using metrics like tool trajectory matching, response quality, and LLM-based judgments.
400
+
401
+ However, if you're iterating on your agents locally, you can point your agents to agentevals and you will see rich runtime information in your browser. For more details, use the bundled wheel and explore the Local Development option in the UI.
402
+
403
+ **How does this compare to Bedrock AgentCore's evaluation?**
404
+
405
+ AgentCore's evaluation integration (via `strands-agents-evals`) also couples agent execution with evaluation. It re-invokes the agent for each test case, converts the resulting OTel spans to AWS's ADOT format, and scores them against 4 built-in evaluators (Helpfulness, Accuracy, Harmfulness, Relevance) via a cloud API call. This means you need an AWS account, valid credentials, and network access for every evaluation.
406
+
407
+ agentevals scores pre-recorded traces locally without re-running anything. It works with standard Jaeger JSON and OTLP formats from any framework, supports open-ended metrics (tool trajectory matching, LLM-based judges, custom scorers), and ships with a CLI, web UI, and MCP server. No cloud dependency required, though we do include all ADK's GCP-based evals as of now.
408
+
409
+ **How does this compare to LangSmith?**
410
+
411
+ LangSmith is a cloud platform (self-hosting requires an Enterprise plan) where offline evaluation re-executes your application against curated datasets. Its deepest integration is with LangChain/LangGraph, though it can work with other frameworks. agentevals scores pre-recorded OTel traces without re-execution, requires no cloud account or enterprise license, and uses OpenTelemetry as the universal interface rather than a proprietary SDK.
412
+
413
+ **How does this compare to Langfuse?**
414
+
415
+ Langfuse is a full observability platform (requires Postgres, ClickHouse, Redis, and S3 for self-hosting) that supports both offline experiments (re-execution) and online evaluation of ingested traces. Traces must be ingested into Langfuse first via its SDK or OTel integration before they can be scored. agentevals evaluates raw OTel trace files or live OTLP streams directly with no database or platform infrastructure required.
416
+
417
+ **How does this compare to Opik?**
418
+
419
+ Opik's primary evaluation path re-runs your application code against a dataset, incurring additional LLM costs per eval run. It also supports online evaluation rules that auto-score production traces. While Opik supports OpenTelemetry ingestion alongside its own SDK, its evaluation workflow still centers on re-execution against datasets. agentevals evaluates pre-recorded OTel traces from any framework without re-execution, and runs entirely locally with no cloud dependency.