PyPI - agentevals-cli - Versions diffs - 0.7.0__tar.gz → 0.7.1__tar.gz - Mend

agentevals-cli 0.7.0tar.gz → 0.7.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (235) hide show

{agentevals_cli-0.7.0 → agentevals_cli-0.7.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: agentevals-cli
-Version: 0.7.0
+Version: 0.7.1
 Summary: Standalone framework to evaluate agent correctness based on portable OpenTelemetry traces
 License-File: LICENSE
 Requires-Python: >=3.11

{agentevals_cli-0.7.0 → agentevals_cli-0.7.1}/examples/README.md RENAMED Viewed

@@ -29,6 +29,7 @@ agentevals accepts OTLP/HTTP on port 4318 (`http/protobuf` and `http/json`) and
 | [zero-code-examples/ollama/](./zero-code-examples/ollama/) | LangChain | Ollama |
 | [zero-code-examples/strands/](./zero-code-examples/strands/) | Strands | OpenAI |
 | [zero-code-examples/adk/](./zero-code-examples/adk/) | Google ADK | Gemini |
+| [zero-code-examples/pydantic-ai/](./zero-code-examples/pydantic-ai/) | Pydantic AI | OpenAI |
 This approach works with any framework that has OTel instrumentation: LangChain, Strands, Google ADK, etc. If your framework already emits OTel spans, you only need to add `OTLPSpanExporter` (and `OTLPLogExporter` if it uses GenAI log-based content delivery).
@@ -103,6 +104,7 @@ Detection checks for `gen_ai.request.model` / `gen_ai.input.messages` (GenAI sem
 | [zero-code-examples/ollama/](./zero-code-examples/ollama/) | LangChain | Ollama | GenAI semconv (logs) | Standard OTLP export |
 | [zero-code-examples/strands/](./zero-code-examples/strands/) | Strands | OpenAI | GenAI semconv (events*) | Standard OTLP export |
 | [zero-code-examples/adk/](./zero-code-examples/adk/) | Google ADK | Gemini | ADK built-in | Standard OTLP export |
+| [zero-code-examples/pydantic-ai/](./zero-code-examples/pydantic-ai/) | Pydantic AI | OpenAI | GenAI semconv (span attrs) | Standard OTLP export |
 | [langchain_agent](./langchain_agent/) | LangChain | OpenAI | GenAI semconv (logs) | SDK WebSocket |
 | [strands_agent](./strands_agent/) | Strands | OpenAI | GenAI semconv (events*) | SDK WebSocket |
 | [dice_agent](./dice_agent/) | Google ADK | Gemini | ADK built-in | SDK WebSocket |
@@ -217,6 +219,7 @@ python examples/zero-code-examples/langchain/run.py
 python examples/zero-code-examples/ollama/run.py
 python examples/zero-code-examples/strands/run.py
 python examples/zero-code-examples/adk/run.py
+python examples/zero-code-examples/pydantic-ai/run.py
 # SDK examples:
 python examples/sdk_example/context_manager_example.py
@@ -232,7 +235,7 @@ python examples/strands_agent/main.py
 Traces stream to the dev server in real-time. Evaluation runs automatically when the session completes.
 See each example's README for prerequisites and detailed instructions:
-- [zero-code-examples/](./zero-code-examples/) (LangChain + Strands, standard OTLP)
+- [zero-code-examples/](./zero-code-examples/) (LangChain, Strands, ADK, OpenAI Agents, Pydantic AI — standard OTLP)
 - [dice_agent/README.md](./dice_agent/README.md) (Google ADK + Gemini)
 - [langchain_agent/README.md](./langchain_agent/README.md) (LangChain + OpenAI, SDK)
 - [strands_agent/](./strands_agent/) (Strands + OpenAI, SDK)

agentevals_cli-0.7.1/examples/zero-code-examples/pydantic-ai/requirements.txt ADDED Viewed

@@ -0,0 +1,5 @@
+pydantic-ai>=1.81.0
+opentelemetry-sdk>=1.36.0
+opentelemetry-exporter-otlp-proto-http>=1.36.0
+python-dotenv>=1.0.0

agentevals_cli-0.7.1/examples/zero-code-examples/pydantic-ai/run.py ADDED Viewed

@@ -0,0 +1,105 @@
+"""Run a dice-rolling Pydantic AI agent with OTLP export — no agentevals SDK.
+Demonstrates zero-code integration: any OTel-instrumented agent streams
+traces to agentevals by pointing the OTLP exporter at the receiver.
+Pydantic AI has built-in OTel support via Agent.instrument_all(). By default
+it uses version 2 of the GenAI semconv format, storing message content in span
+attributes — only a TracerProvider is needed.
+No separate instrumentation library is needed.
+Prerequisites:
+    1. pip install -r requirements.txt
+    2. agentevals serve --dev
+    3. export OPENAI_API_KEY="your-key-here"
+Usage:
+    python examples/zero-code-examples/pydantic-ai/run.py
+"""
+import os
+import random
+from dotenv import load_dotenv
+from opentelemetry import trace
+from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
+from opentelemetry.sdk.resources import Resource
+from opentelemetry.sdk.trace import TracerProvider
+from opentelemetry.sdk.trace.export import BatchSpanProcessor
+from pydantic_ai import Agent
+load_dotenv(override=True)
+def roll_die(sides: int) -> int:
+    """Roll a die with the given number of sides and return the result."""
+    return random.randint(1, sides)
+def check_prime(number: int) -> bool:
+    """Return True if the number is prime, False otherwise."""
+    if number < 2:
+        return False
+    for i in range(2, int(number**0.5) + 1):
+        if number % i == 0:
+            return False
+    return True
+def main():
+    if not os.getenv("OPENAI_API_KEY"):
+        print("OPENAI_API_KEY not set.")
+        return
+    endpoint = os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4318")
+    print(f"OTLP endpoint: {endpoint}")
+    os.environ.setdefault(
+        "OTEL_RESOURCE_ATTRIBUTES",
+        "agentevals.eval_set_id=pydantic_ai_eval,agentevals.session_name=pydantic-ai-zero-code",
+    )
+    resource = Resource.create()
+    tracer_provider = TracerProvider(resource=resource)
+    tracer_provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(), schedule_delay_millis=1000))
+    trace.set_tracer_provider(tracer_provider)
+    # Enable Pydantic AI's built-in OTel instrumentation. This one call
+    # wires up all agents globally — no framework-specific instrumentor
+    # library (like opentelemetry-instrumentation-openai-v2) is needed.
+    Agent.instrument_all()
+    agent = Agent(
+        "openai:gpt-4o-mini",
+        instructions="You are a helpful assistant. You can roll dice and check if numbers are prime.",
+    )
+    agent.tool_plain(roll_die)
+    agent.tool_plain(check_prime)
+    test_queries = [
+        "Hi! Can you help me?",
+        "Roll a 20-sided die for me",
+        "Is the number you rolled prime?",
+    ]
+    message_history = []
+    try:
+        for i, query in enumerate(test_queries, 1):
+            print(f"\n[{i}/{len(test_queries)}] User: {query}")
+            result = agent.run_sync(query, message_history=message_history)
+            print(f"     Agent: {result.output}")
+            # Pass the full message history forward for multi-turn conversation.
+            message_history = result.all_messages()
+    finally:
+        print()
+        tracer_provider.force_flush()
+        print("All traces flushed to OTLP receiver.")
+if __name__ == "__main__":
+    main()

{agentevals_cli-0.7.0 → agentevals_cli-0.7.1}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 [project]
 name = "agentevals-cli"
-version = "0.7.0"
+version = "0.7.1"
 description = "Standalone framework to evaluate agent correctness based on portable OpenTelemetry traces"
 readme = "README.md"
 requires-python = ">=3.11"

{agentevals_cli-0.7.0 → agentevals_cli-0.7.1}/src/agentevals/api/app.py RENAMED Viewed

@@ -10,8 +10,7 @@ from contextlib import asynccontextmanager
 from pathlib import Path
 from typing import TYPE_CHECKING
-from fastapi import FastAPI, Request
-from fastapi import WebSocket
+from fastapi import FastAPI, Request, WebSocket
 from fastapi.middleware.cors import CORSMiddleware
 from fastapi.responses import StreamingResponse

{agentevals_cli-0.7.0 → agentevals_cli-0.7.1}/src/agentevals/api/models.py RENAMED Viewed

@@ -11,6 +11,8 @@ from typing import Any, Generic, TypeVar
 from pydantic import BaseModel, ConfigDict, Field
 from pydantic.alias_generators import to_camel
+from ..config import EvalParams
 T = TypeVar("T")
@@ -134,6 +136,14 @@ class ConvertTracesData(CamelModel):
     traces: list[TraceConversionEntry]
+class EvaluateJsonRequest(CamelModel):
+    """Request body for JSON-based trace evaluation (``POST /evaluate/json``)."""
+    traces: dict = Field(description="OTLP JSON export with resourceSpans structure.")
+    config: EvalParams = Field(default_factory=EvalParams, description="Evaluation parameters.")
+    eval_set: dict | None = Field(default=None, description="Optional ADK EvalSet JSON.")
 # ---------------------------------------------------------------------------
 # SSE evaluation event models
 # ---------------------------------------------------------------------------

{agentevals_cli-0.7.0 → agentevals_cli-0.7.1}/src/agentevals/api/routes.py RENAMED Viewed

@@ -11,7 +11,7 @@ import shutil
 import tempfile
 from typing import Any
-from fastapi import APIRouter, File, Form, HTTPException, UploadFile
+from fastapi import APIRouter, File, Form, HTTPException, Request, UploadFile
 from fastapi.responses import StreamingResponse
 from pydantic.alias_generators import to_camel
@@ -27,13 +27,22 @@ from ..config import (
 )
 from ..converter import convert_traces
 from ..extraction import get_extractor
-from ..runner import RunResult, get_loader, load_eval_set, run_evaluation
+from ..loader.otlp import OtlpJsonLoader
+from ..runner import (
+    RunResult,
+    get_loader,
+    load_eval_set,
+    load_eval_set_from_dict,
+    run_evaluation,
+    run_evaluation_from_traces,
+)
 from ..trace_metrics import extract_performance_metrics, extract_trace_metadata
 from .models import (
     ApiKeyStatus,
     ConfigData,
     ConvertTracesData,
     EvalSetValidation,
+    EvaluateJsonRequest,
     HealthData,
     MetricInfo,
     SSEDoneEvent,
@@ -61,6 +70,8 @@ def _camel_keys(obj: Any) -> Any:
 router = APIRouter()
+_MAX_JSON_BODY_BYTES = 50 * 1024 * 1024  # 50 MB (multipart endpoints allow 10 MB per file)
 _TYPE_TO_MODEL = {
     "builtin": BuiltinMetricDef,
     "code": CodeEvaluatorDef,
@@ -729,3 +740,148 @@ async def evaluate_traces_stream(
             "Connection": "keep-alive",
         },
     )
+def _parse_json_request(request: EvaluateJsonRequest):
+    """Parse traces and eval set from an EvaluateJsonRequest.
+    Returns (traces, eval_set).  Raises HTTPException on invalid input.
+    """
+    try:
+        traces = OtlpJsonLoader().load_from_dict(request.traces)
+    except ValueError as exc:
+        raise HTTPException(status_code=400, detail=str(exc)) from exc
+    if not traces:
+        raise HTTPException(status_code=400, detail="No traces found in OTLP JSON")
+    eval_set = None
+    if request.eval_set:
+        try:
+            eval_set = load_eval_set_from_dict(request.eval_set)
+        except Exception as exc:
+            raise HTTPException(status_code=400, detail=f"Invalid eval set: {exc}") from exc
+    return traces, eval_set
+def _check_json_body_size(raw_request: Request):
+    content_length = int(raw_request.headers.get("content-length", 0))
+    if content_length > _MAX_JSON_BODY_BYTES:
+        raise HTTPException(
+            status_code=413,
+            detail=f"Request body exceeds {_MAX_JSON_BODY_BYTES // (1024 * 1024)}MB limit",
+        )
+def _sse_error(message: str) -> str:
+    return f"data: {SSEErrorEvent(error=message).model_dump_json(by_alias=True)}\n\n"
+@router.post("/evaluate/json", response_model=StandardResponse[RunResult])
+async def evaluate_traces_json(request: EvaluateJsonRequest, raw_request: Request):
+    """Evaluate OTLP JSON traces passed in the request body."""
+    _check_json_body_size(raw_request)
+    traces, eval_set = _parse_json_request(request)
+    try:
+        result = await run_evaluation_from_traces(
+            traces=traces,
+            config=request.config,
+            eval_set=eval_set,
+        )
+        return StandardResponse(data=_camel_keys(result.model_dump(by_alias=True)))
+    except Exception as exc:
+        logger.exception("JSON evaluation failed")
+        raise HTTPException(status_code=500, detail=f"Internal error: {exc!s}") from exc
+@router.post("/evaluate/json/stream")
+async def evaluate_traces_json_stream(request: EvaluateJsonRequest, raw_request: Request):
+    """Evaluate OTLP JSON traces with real-time progress via SSE."""
+    _check_json_body_size(raw_request)
+    async def event_generator():
+        try:
+            try:
+                traces, eval_set = _parse_json_request(request)
+            except HTTPException as exc:
+                yield _sse_error(exc.detail)
+                return
+            for trace in traces:
+                try:
+                    extractor = get_extractor(trace)
+                    perf_metrics = _camel_keys(extract_performance_metrics(trace, extractor))
+                    trace_metadata = _camel_keys(extract_trace_metadata(trace, extractor))
+                    evt = SSEPerformanceMetricsEvent(
+                        trace_id=trace.trace_id,
+                        performance_metrics=perf_metrics,
+                        trace_metadata=trace_metadata,
+                    )
+                    yield f"event: performance_metrics\ndata: {evt.model_dump_json(by_alias=True)}\n\n"
+                except Exception as e:
+                    logger.error(f"Failed to extract early performance metrics: {e}")
+            queue: asyncio.Queue = asyncio.Queue()
+            async def progress_callback(message: str):
+                await queue.put(("progress", message))
+            async def trace_progress_callback(trace_result):
+                await queue.put(("trace_progress", trace_result))
+            async def run_with_progress():
+                result = await run_evaluation_from_traces(
+                    traces=traces,
+                    config=request.config,
+                    eval_set=eval_set,
+                    progress_callback=progress_callback,
+                    trace_progress_callback=trace_progress_callback,
+                )
+                await queue.put(("done", result))
+            eval_task = asyncio.create_task(run_with_progress())
+            try:
+                while True:
+                    msg = await queue.get()
+                    tag, payload = msg
+                    if tag == "done":
+                        evt = SSEDoneEvent(
+                            result=_camel_keys(payload.model_dump(by_alias=True)),
+                        )
+                        yield f"data: {evt.model_dump_json(by_alias=True)}\n\n"
+                        break
+                    elif tag == "trace_progress":
+                        evt = SSETraceProgressEvent(
+                            trace_progress=SSETraceProgress(
+                                trace_id=payload.trace_id,
+                                partial_result=_camel_keys(payload.model_dump(by_alias=True)),
+                            )
+                        )
+                        yield f"data: {evt.model_dump_json(by_alias=True)}\n\n"
+                    elif tag == "progress":
+                        evt = SSEProgressEvent(message=payload)
+                        yield f"data: {evt.model_dump_json(by_alias=True)}\n\n"
+            finally:
+                if not eval_task.done():
+                    eval_task.cancel()
+                    try:
+                        await eval_task
+                    except asyncio.CancelledError:
+                        pass
+        except Exception as exc:
+            logger.exception("JSON evaluation stream failed")
+            yield _sse_error(str(exc))
+    return StreamingResponse(
+        event_generator(),
+        media_type="text/event-stream",
+        headers={
+            "Cache-Control": "no-cache",
+            "Connection": "keep-alive",
+        },
+    )

{agentevals_cli-0.7.0 → agentevals_cli-0.7.1}/src/agentevals/config.py RENAMED Viewed

@@ -5,7 +5,8 @@ from __future__ import annotations
 from pathlib import Path
 from typing import Annotated, Any, Literal
-from pydantic import BaseModel, Field, field_validator
+from pydantic import BaseModel, ConfigDict, Field, field_validator
+from pydantic.alias_generators import to_camel
 class BuiltinMetricDef(BaseModel):
@@ -99,13 +100,14 @@ CustomEvaluatorDef = Annotated[
 ]
-class EvalRunConfig(BaseModel):
-    trace_files: list[str] = Field(description="Paths to trace files (Jaeger JSON or OTLP JSON).")
+class EvalParams(BaseModel):
+    """Evaluation parameters independent of how traces are provided.
-    eval_set_file: str | None = Field(
-        default=None,
-        description="Path to a golden eval set JSON file (ADK EvalSet format).",
-    )
+    Used by ``run_evaluation_from_traces`` for programmatic / API-driven
+    evaluation.  ``EvalRunConfig`` inherits from this and adds file I/O fields.
+    """
+    model_config = ConfigDict(alias_generator=to_camel, populate_by_name=True)
     metrics: list[str] = Field(
         default_factory=lambda: ["tool_trajectory_avg_score"],
@@ -117,11 +119,6 @@ class EvalRunConfig(BaseModel):
         description="Custom evaluator definitions.",
     )
-    trace_format: str = Field(
-        default="jaeger-json",
-        description="Format of the trace files (jaeger-json or otlp-json).",
-    )
     judge_model: str | None = Field(
         default=None,
         description="LLM model for judge-based metrics.",
@@ -129,7 +126,9 @@ class EvalRunConfig(BaseModel):
     threshold: float | None = Field(
         default=None,
-        description="Score threshold for pass/fail.",
+        ge=0,
+        le=1,
+        description="Score threshold for pass/fail (0.0 to 1.0).",
     )
     trajectory_match_type: str | None = Field(
@@ -145,17 +144,35 @@ class EvalRunConfig(BaseModel):
             raise ValueError(f"Invalid trajectory_match_type '{v}'. Valid values: {sorted(valid)}")
         return v.upper() if v is not None else v
-    output_format: str = Field(
-        default="table",
-        description="Output format: 'table', 'json', or 'summary'.",
-    )
     max_concurrent_traces: int = Field(
         default=10,
+        ge=1,
         description="Maximum number of traces to evaluate concurrently.",
     )
     max_concurrent_evals: int = Field(
         default=5,
+        ge=1,
         description="Maximum number of concurrent metric evaluations (LLM API calls).",
     )
+class EvalRunConfig(EvalParams):
+    """Full configuration for file-based evaluation runs."""
+    trace_files: list[str] = Field(description="Paths to trace files (Jaeger JSON or OTLP JSON).")
+    eval_set_file: str | None = Field(
+        default=None,
+        description="Path to a golden eval set JSON file (ADK EvalSet format).",
+    )
+    trace_format: str = Field(
+        default="jaeger-json",
+        description="Format of the trace files (jaeger-json or otlp-json).",
+    )
+    output_format: str = Field(
+        default="table",
+        description="Output format: 'table', 'json', or 'summary'.",
+    )

{agentevals_cli-0.7.0 → agentevals_cli-0.7.1}/src/agentevals/loader/otlp.py RENAMED Viewed

@@ -56,6 +56,12 @@ class OtlpJsonLoader(TraceLoader):
         logger.info("Loaded %d trace(s) from %s", len(traces), source)
         return traces
+    def load_from_dict(self, data: dict) -> list[Trace]:
+        """Load traces from an OTLP JSON dict (resourceSpans structure)."""
+        if "resourceSpans" not in data:
+            raise ValueError("Expected OTLP JSON with 'resourceSpans' key")
+        return self._parse_otlp_export(data)
     def _parse_otlp_export(self, data: dict) -> list[Trace]:
         """Parse full OTLP export structure with resourceSpans."""
         all_spans = []
@@ -122,23 +128,40 @@ class OtlpJsonLoader(TraceLoader):
         Some SDKs (e.g. Strands) store message content in span events rather
         than span attributes. This promotes those values so the converter can
         find them via normal attribute lookups.
+        Accepts events in OTLP array format or flat/nested dict format.
         """
         for event in span_data.get("events", []):
-            for attr in event.get("attributes", []):
-                key = attr.get("key", "")
-                if key in self._GENAI_EVENT_KEYS and key not in attributes:
-                    value_obj = attr.get("value", {})
-                    if "stringValue" in value_obj:
-                        attributes[key] = value_obj["stringValue"]
-    def _extract_attributes(self, attrs_list: list[dict]) -> dict:
-        """Convert OTLP attributes array to flat dict.
-        OTLP attributes are [{key, value: {stringValue|intValue|...}}]
-        We flatten to {key: value} for easier use.
+            event_attrs = event.get("attributes", [])
+            if isinstance(event_attrs, dict):
+                flat = self._flatten_nested_dict(event_attrs)
+                for key in self._GENAI_EVENT_KEYS:
+                    if key in flat and key not in attributes:
+                        attributes[key] = flat[key]
+            else:
+                for attr in event_attrs:
+                    key = attr.get("key", "")
+                    if key in self._GENAI_EVENT_KEYS and key not in attributes:
+                        value_obj = attr.get("value", {})
+                        if "stringValue" in value_obj:
+                            attributes[key] = value_obj["stringValue"]
+    def _extract_attributes(self, attrs) -> dict:
+        """Convert attributes to a flat ``{key: value}`` dict.
+        Accepts three formats:
+        1. OTLP array: ``[{key, value: {stringValue|intValue|...}}]``
+        2. Flat dict: ``{"gen_ai.operation.name": "chat"}``
+        3. Nested dict (ClickHouse JSON column): ``{"gen_ai": {"operation": {"name": "chat"}}}``
+        Formats 2 and 3 are auto-detected by checking whether *attrs* is a dict.
+        Nested dicts are recursively flattened to dot-notation keys.
         """
+        if isinstance(attrs, dict):
+            return self._flatten_nested_dict(attrs)
         result = {}
-        for attr in attrs_list:
+        for attr in attrs:
             key = attr.get("key", "")
             value_obj = attr.get("value", {})
@@ -157,6 +180,25 @@ class OtlpJsonLoader(TraceLoader):
         return result
+    @staticmethod
+    def _flatten_nested_dict(d: dict, prefix: str = "") -> dict:
+        """Recursively flatten a nested dict to dot-notation keys.
+        ``{"gen_ai": {"operation": {"name": "chat"}}}``
+        becomes ``{"gen_ai.operation.name": "chat"}``.
+        Already-flat keys (e.g. ``{"service.name": "agent"}``) pass through
+        unchanged.
+        """
+        result = {}
+        for key, value in d.items():
+            full_key = f"{prefix}{key}" if not prefix else f"{prefix}.{key}"
+            if isinstance(value, dict):
+                result.update(OtlpJsonLoader._flatten_nested_dict(value, full_key))
+            else:
+                result[full_key] = value
+        return result
     def _build_traces(self, all_spans: list[Span]) -> list[Trace]:
         """Group spans by trace_id and build parent-child relationships."""
         traces_by_id: dict[str, list[Span]] = {}

agentevals-cli 0.7.0__tar.gz → 0.7.1__tar.gz

agentevals-cli 0.7.0tar.gz → 0.7.1tar.gz