PyPI - deepeval - Versions diffs - 3.5.8__tar.gz → 3.8.0__tar.gz - Mend

deepeval 3.5.8tar.gz → 3.8.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (605) hide show

{deepeval-3.5.8 → deepeval-3.8.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: deepeval
-Version: 3.5.8
+Version: 3.8.0
 Summary: The LLM Evaluation Framework
 Home-page: https://github.com/confident-ai/deepeval
 License: Apache-2.0
@@ -13,26 +13,23 @@ Classifier: Programming Language :: Python :: 3.9
 Classifier: Programming Language :: Python :: 3.10
 Classifier: Programming Language :: Python :: 3.11
 Requires-Dist: aiohttp
-Requires-Dist: anthropic
 Requires-Dist: click (>=8.0.0,<8.3.0)
-Requires-Dist: google-genai (>=1.9.0,<2.0.0)
 Requires-Dist: grpcio (>=1.67.1,<2.0.0)
 Requires-Dist: jinja2
 Requires-Dist: nest_asyncio
-Requires-Dist: ollama
 Requires-Dist: openai
 Requires-Dist: opentelemetry-api (>=1.24.0,<2.0.0)
 Requires-Dist: opentelemetry-exporter-otlp-proto-grpc (>=1.24.0,<2.0.0)
 Requires-Dist: opentelemetry-sdk (>=1.24.0,<2.0.0)
 Requires-Dist: portalocker
-Requires-Dist: posthog (>=6.3.0,<7.0.0)
+Requires-Dist: posthog (>=5.4.0,<6.0.0)
 Requires-Dist: pydantic (>=2.11.7,<3.0.0)
 Requires-Dist: pydantic-settings (>=2.10.1,<3.0.0)
 Requires-Dist: pyfiglet
 Requires-Dist: pytest
 Requires-Dist: pytest-asyncio
 Requires-Dist: pytest-repeat
-Requires-Dist: pytest-rerunfailures (>=12.0,<13.0)
+Requires-Dist: pytest-rerunfailures
 Requires-Dist: pytest-xdist
 Requires-Dist: python-dotenv (>=1.1.1,<2.0.0)
 Requires-Dist: requests (>=2.31.0,<3.0.0)
@@ -103,9 +100,9 @@ Description-Content-Type: text/markdown
     <a href="https://www.readme-i18n.com/confident-ai/deepeval?lang=zh">中文</a>
 </p>
-**DeepEval** is a simple-to-use, open-source LLM evaluation framework, for evaluating and testing large-language model systems. It is similar to Pytest but specialized for unit testing LLM outputs. DeepEval incorporates the latest research to evaluate LLM outputs based on metrics such as G-Eval, hallucination, answer relevancy, RAGAS, etc., which uses LLMs and various other NLP models that runs **locally on your machine** for evaluation.
+**DeepEval** is a simple-to-use, open-source LLM evaluation framework, for evaluating and testing large-language model systems. It is similar to Pytest but specialized for unit testing LLM outputs. DeepEval incorporates the latest research to evaluate LLM outputs based on metrics such as G-Eval, task completion, answer relevancy, hallucination, etc., which uses LLM-as-a-judge and other NLP models that run **locally on your machine** for evaluation.
-Whether your LLM applications are RAG pipelines, chatbots, AI agents, implemented via LangChain or LlamaIndex, DeepEval has you covered. With it, you can easily determine the optimal models, prompts, and architecture to improve your RAG pipeline, agentic workflows, prevent prompt drifting, or even transition from OpenAI to hosting your own Deepseek R1 with confidence.
+Whether your LLM applications are AI agents, RAG pipelines, or chatbots, implemented via LangChain or OpenAI, DeepEval has you covered. With it, you can easily determine the optimal models, prompts, and architecture to improve your RAG pipeline, agentic workflows, prevent prompt drifting, or even transition from OpenAI to hosting your own Deepseek R1 with confidence.
 > [!IMPORTANT]
 > Need a place for your DeepEval testing data to live 🏡❤️? [Sign up to the DeepEval platform](https://confident-ai.com?utm_source=GitHub) to compare iterations of your LLM app, generate & share testing reports, and more.
@@ -118,10 +115,10 @@ Whether your LLM applications are RAG pipelines, chatbots, AI agents, implemente
 # 🔥 Metrics and Features
-> 🥳 You can now share DeepEval's test results on the cloud directly on [Confident AI](https://confident-ai.com?utm_source=GitHub)'s infrastructure
+> 🥳 You can now share DeepEval's test results on the cloud directly on [Confident AI](https://confident-ai.com?utm_source=GitHub)
 - Supports both end-to-end and component-level LLM evaluation.
-- Large variety of ready-to-use LLM evaluation metrics (all with explanations) powered by **ANY** LLM of your choice, statistical methods, or NLP models that runs **locally on your machine**:
+- Large variety of ready-to-use LLM evaluation metrics (all with explanations) powered by **ANY** LLM of your choice, statistical methods, or NLP models that run **locally on your machine**:
   - G-Eval
   - DAG ([deep acyclic graph](https://deepeval.com/docs/metrics-dag))
   - **RAG metrics:**
@@ -161,7 +158,7 @@ Whether your LLM applications are RAG pipelines, chatbots, AI agents, implemente
   - TruthfulQA
   - HumanEval
   - GSM8K
-- [100% integrated with Confident AI](https://confident-ai.com?utm_source=GitHub) for the full evaluation lifecycle:
+- [100% integrated with Confident AI](https://confident-ai.com?utm_source=GitHub) for the full evaluation & observability lifecycle:
   - Curate/annotate evaluation datasets on the cloud
   - Benchmark LLM app using dataset, and compare with previous iterations to experiment which models/prompts works best
   - Fine-tune metrics for custom results
@@ -170,7 +167,7 @@ Whether your LLM applications are RAG pipelines, chatbots, AI agents, implemente
   - Repeat until perfection
 > [!NOTE]
-> Confident AI is the DeepEval platform. Create an account [here.](https://app.confident-ai.com?utm_source=GitHub)
+> DeepEval is available on Confident AI, an LLM evals platform for AI observability and quality. Create an account [here.](https://app.confident-ai.com?utm_source=GitHub)
 <br />
@@ -359,7 +356,7 @@ for golden in dataset.goldens:
 @pytest.mark.parametrize(
     "test_case",
-    dataset,
+    dataset.test_cases,
 )
 def test_customer_chatbot(test_case: LLMTestCase):
     answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
@@ -397,7 +394,7 @@ cp .env.example .env.local
 # DeepEval With Confident AI
-DeepEval's cloud platform, [Confident AI](https://confident-ai.com?utm_source=Github), allows you to:
+DeepEval is available on [Confident AI](https://confident-ai.com?utm_source=Github), an evals & observability platform that allows you to:
 1. Curate/annotate evaluation datasets on the cloud
 2. Benchmark LLM app using dataset, and compare with previous iterations to experiment which models/prompts works best
@@ -439,6 +436,7 @@ Using `.env.local` or `.env` is optional. If they are missing, DeepEval uses you
 ```bash
 cp .env.example .env.local
 # then edit .env.local (ignored by git)
+```
 <br />

{deepeval-3.5.8 → deepeval-3.8.0}/README.md RENAMED Viewed

@@ -53,9 +53,9 @@
     <a href="https://www.readme-i18n.com/confident-ai/deepeval?lang=zh">中文</a>
 </p>
-**DeepEval** is a simple-to-use, open-source LLM evaluation framework, for evaluating and testing large-language model systems. It is similar to Pytest but specialized for unit testing LLM outputs. DeepEval incorporates the latest research to evaluate LLM outputs based on metrics such as G-Eval, hallucination, answer relevancy, RAGAS, etc., which uses LLMs and various other NLP models that runs **locally on your machine** for evaluation.
+**DeepEval** is a simple-to-use, open-source LLM evaluation framework, for evaluating and testing large-language model systems. It is similar to Pytest but specialized for unit testing LLM outputs. DeepEval incorporates the latest research to evaluate LLM outputs based on metrics such as G-Eval, task completion, answer relevancy, hallucination, etc., which uses LLM-as-a-judge and other NLP models that run **locally on your machine** for evaluation.
-Whether your LLM applications are RAG pipelines, chatbots, AI agents, implemented via LangChain or LlamaIndex, DeepEval has you covered. With it, you can easily determine the optimal models, prompts, and architecture to improve your RAG pipeline, agentic workflows, prevent prompt drifting, or even transition from OpenAI to hosting your own Deepseek R1 with confidence.
+Whether your LLM applications are AI agents, RAG pipelines, or chatbots, implemented via LangChain or OpenAI, DeepEval has you covered. With it, you can easily determine the optimal models, prompts, and architecture to improve your RAG pipeline, agentic workflows, prevent prompt drifting, or even transition from OpenAI to hosting your own Deepseek R1 with confidence.
 > [!IMPORTANT]
 > Need a place for your DeepEval testing data to live 🏡❤️? [Sign up to the DeepEval platform](https://confident-ai.com?utm_source=GitHub) to compare iterations of your LLM app, generate & share testing reports, and more.
@@ -68,10 +68,10 @@ Whether your LLM applications are RAG pipelines, chatbots, AI agents, implemente
 # 🔥 Metrics and Features
-> 🥳 You can now share DeepEval's test results on the cloud directly on [Confident AI](https://confident-ai.com?utm_source=GitHub)'s infrastructure
+> 🥳 You can now share DeepEval's test results on the cloud directly on [Confident AI](https://confident-ai.com?utm_source=GitHub)
 - Supports both end-to-end and component-level LLM evaluation.
-- Large variety of ready-to-use LLM evaluation metrics (all with explanations) powered by **ANY** LLM of your choice, statistical methods, or NLP models that runs **locally on your machine**:
+- Large variety of ready-to-use LLM evaluation metrics (all with explanations) powered by **ANY** LLM of your choice, statistical methods, or NLP models that run **locally on your machine**:
   - G-Eval
   - DAG ([deep acyclic graph](https://deepeval.com/docs/metrics-dag))
   - **RAG metrics:**
@@ -111,7 +111,7 @@ Whether your LLM applications are RAG pipelines, chatbots, AI agents, implemente
   - TruthfulQA
   - HumanEval
   - GSM8K
-- [100% integrated with Confident AI](https://confident-ai.com?utm_source=GitHub) for the full evaluation lifecycle:
+- [100% integrated with Confident AI](https://confident-ai.com?utm_source=GitHub) for the full evaluation & observability lifecycle:
   - Curate/annotate evaluation datasets on the cloud
   - Benchmark LLM app using dataset, and compare with previous iterations to experiment which models/prompts works best
   - Fine-tune metrics for custom results
@@ -120,7 +120,7 @@ Whether your LLM applications are RAG pipelines, chatbots, AI agents, implemente
   - Repeat until perfection
 > [!NOTE]
-> Confident AI is the DeepEval platform. Create an account [here.](https://app.confident-ai.com?utm_source=GitHub)
+> DeepEval is available on Confident AI, an LLM evals platform for AI observability and quality. Create an account [here.](https://app.confident-ai.com?utm_source=GitHub)
 <br />
@@ -309,7 +309,7 @@ for golden in dataset.goldens:
 @pytest.mark.parametrize(
     "test_case",
-    dataset,
+    dataset.test_cases,
 )
 def test_customer_chatbot(test_case: LLMTestCase):
     answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
@@ -347,7 +347,7 @@ cp .env.example .env.local
 # DeepEval With Confident AI
-DeepEval's cloud platform, [Confident AI](https://confident-ai.com?utm_source=Github), allows you to:
+DeepEval is available on [Confident AI](https://confident-ai.com?utm_source=Github), an evals & observability platform that allows you to:
 1. Curate/annotate evaluation datasets on the cloud
 2. Benchmark LLM app using dataset, and compare with previous iterations to experiment which models/prompts works best
@@ -389,6 +389,7 @@ Using `.env.local` or `.env` is optional. If they are missing, DeepEval uses you
 ```bash
 cp .env.example .env.local
 # then edit .env.local (ignored by git)
+```
 <br />

{deepeval-3.5.8 → deepeval-3.8.0}/deepeval/__init__.py RENAMED Viewed

@@ -1,24 +1,56 @@
+from __future__ import annotations
+import logging
 import os
-import warnings
 import re
+import warnings
-# load environment variables before other imports
+# IMPORTANT: load environment variables before other imports
 from deepeval.config.settings import autoload_dotenv, get_settings
+logging.getLogger("deepeval").addHandler(logging.NullHandler())
 autoload_dotenv()
-from ._version import __version__
-from deepeval.evaluate import evaluate, assert_test
-from deepeval.evaluate.compare import compare
-from deepeval.test_run import on_test_run_end, log_hyperparameters
-from deepeval.utils import login
-from deepeval.telemetry import *
+def _expose_public_api() -> None:
+    # All other imports must happen after env is loaded
+    # Do not do this at module level or ruff will complain with E402
+    global __version__, evaluate, assert_test, compare
+    global on_test_run_end, log_hyperparameters, login, telemetry
+    from ._version import __version__ as _version
+    from deepeval.evaluate import (
+        evaluate as _evaluate,
+        assert_test as _assert_test,
+    )
+    from deepeval.evaluate.compare import compare as _compare
+    from deepeval.test_run import (
+        on_test_run_end as _on_end,
+        log_hyperparameters as _log_hparams,
+    )
+    from deepeval.utils import login as _login
+    import deepeval.telemetry as _telemetry
+    __version__ = _version
+    evaluate = _evaluate
+    assert_test = _assert_test
+    compare = _compare
+    on_test_run_end = _on_end
+    log_hyperparameters = _log_hparams
+    login = _login
+    telemetry = _telemetry
+_expose_public_api()
 settings = get_settings()
 if not settings.DEEPEVAL_GRPC_LOGGING:
-    os.environ.setdefault("GRPC_VERBOSITY", "ERROR")
-    os.environ.setdefault("GRPC_TRACE", "")
+    if os.getenv("GRPC_VERBOSITY") is None:
+        os.environ["GRPC_VERBOSITY"] = settings.GRPC_VERBOSITY or "ERROR"
+    if os.getenv("GRPC_TRACE") is None:
+        os.environ["GRPC_TRACE"] = settings.GRPC_TRACE or ""
 __all__ = [
@@ -70,9 +102,5 @@ def update_warning_opt_in():
     return os.getenv("DEEPEVAL_UPDATE_WARNING_OPT_IN") == "1"
-def is_read_only_env():
-    return os.getenv("DEEPEVAL_FILE_SYSTEM") == "READ_ONLY"
 if update_warning_opt_in():
     check_for_update()

deepeval-3.8.0/deepeval/_version.py ADDED Viewed

	@@ -0,0 +1 @@
1	+ __version__: str = "3.8.0"

deepeval-3.8.0/deepeval/anthropic/__init__.py ADDED Viewed

@@ -0,0 +1,19 @@
+try:
+    import anthropic  # noqa: F401
+except ImportError:
+    raise ModuleNotFoundError(
+        "Please install anthropic to use this feature: 'pip install anthropic'"
+    )
+try:
+    from anthropic import Anthropic, AsyncAnthropic  # noqa: F401
+except ImportError:
+    Anthropic = None  # type: ignore
+    AsyncAnthropic = None  # type: ignore
+if Anthropic or AsyncAnthropic:
+    from deepeval.anthropic.patch import patch_anthropic_classes
+    from deepeval.telemetry import capture_tracing_integration
+    with capture_tracing_integration("anthropic"):
+        patch_anthropic_classes()

deepeval-3.8.0/deepeval/anthropic/extractors.py ADDED Viewed

@@ -0,0 +1,94 @@
+from anthropic.types.message import Message
+from anthropic.types import ToolUseBlock
+from typing import Any, Dict
+from deepeval.anthropic.utils import (
+    render_messages_anthropic,
+    stringify_anthropic_content,
+)
+from deepeval.model_integrations.types import InputParameters, OutputParameters
+from deepeval.test_case.llm_test_case import ToolCall
+def safe_extract_input_parameters(kwargs: Dict[str, Any]) -> InputParameters:
+    # guarding against errors to be compatible with legacy APIs
+    try:
+        return extract_messages_api_input_parameters(kwargs)
+    except:
+        return InputParameters(model="NA")
+def extract_messages_api_input_parameters(
+    kwargs: Dict[str, Any],
+) -> InputParameters:
+    model = kwargs.get("model")
+    tools = kwargs.get("tools")
+    messages = kwargs.get("messages")
+    tool_descriptions = (
+        {tool["name"]: tool["description"] for tool in tools}
+        if tools is not None
+        else None
+    )
+    input_argument = ""
+    user_messages = []
+    for message in messages:
+        role = message["role"]
+        if role == "user":
+            user_messages.append(message["content"])
+    if len(user_messages) > 0:
+        input_argument = user_messages[0]
+    return InputParameters(
+        model=model,
+        input=stringify_anthropic_content(input_argument),
+        messages=render_messages_anthropic(messages),
+        tools=tools,
+        tool_descriptions=tool_descriptions,
+    )
+def safe_extract_output_parameters(
+    message_response: Message,
+    input_parameters: InputParameters,
+) -> OutputParameters:
+    # guarding against errors to be compatible with legacy APIs
+    try:
+        return extract_messages_api_output_parameters(
+            message_response, input_parameters
+        )
+    except:
+        return OutputParameters()
+def extract_messages_api_output_parameters(
+    message_response: Message,
+    input_parameters: InputParameters,
+) -> OutputParameters:
+    output = str(message_response.content[0].text)
+    prompt_tokens = message_response.usage.input_tokens
+    completion_tokens = message_response.usage.output_tokens
+    tools_called = None
+    anthropic_tool_calls = [
+        block
+        for block in message_response.content
+        if isinstance(block, ToolUseBlock)
+    ]
+    if anthropic_tool_calls:
+        tools_called = []
+        tool_descriptions = input_parameters.tool_descriptions or {}
+        for tool_call in anthropic_tool_calls:
+            tools_called.append(
+                ToolCall(
+                    name=tool_call.name,
+                    input_parameters=tool_call.input,
+                    description=tool_descriptions.get(tool_call.name),
+                )
+            )
+    return OutputParameters(
+        output=output,
+        prompt_tokens=prompt_tokens,
+        completion_tokens=completion_tokens,
+        tools_called=tools_called,
+    )

deepeval-3.8.0/deepeval/anthropic/patch.py ADDED Viewed

@@ -0,0 +1,169 @@
+from typing import Callable
+from functools import wraps
+from deepeval.anthropic.extractors import (
+    safe_extract_input_parameters,
+    safe_extract_output_parameters,
+    InputParameters,
+)
+from deepeval.model_integrations.utils import _update_all_attributes
+from deepeval.tracing import observe
+from deepeval.tracing.trace_context import current_llm_context
+_ORIGINAL_METHODS = {}
+_ANTHROPIC_PATCHED = False
+def patch_anthropic_classes():
+    """
+    Monkey patch Anthropic resource classes directly.
+    """
+    global _ANTHROPIC_PATCHED
+    # Single guard - if already patched, return immediately
+    if _ANTHROPIC_PATCHED:
+        return
+    try:
+        from anthropic.resources.messages import Messages, AsyncMessages
+        # Store original methods before patching
+        if hasattr(Messages, "create"):
+            _ORIGINAL_METHODS["Messages.create"] = Messages.create
+            Messages.create = _create_sync_wrapper(Messages.create)
+        if hasattr(AsyncMessages, "create"):
+            _ORIGINAL_METHODS["AsyncMessages.create"] = AsyncMessages.create
+            AsyncMessages.create = _create_async_wrapper(AsyncMessages.create)
+    except ImportError:
+        pass
+    _ANTHROPIC_PATCHED = True
+def _create_sync_wrapper(original_method):
+    """
+    Create a wrapper for sync methods - called ONCE during patching.
+    """
+    @wraps(original_method)
+    def method_wrapper(self, *args, **kwargs):
+        bound_method = original_method.__get__(self, type(self))
+        patched = _patch_sync_anthropic_client_method(
+            original_method=bound_method
+        )
+        return patched(*args, **kwargs)
+    return method_wrapper
+def _create_async_wrapper(original_method):
+    """
+    Create a wrapper for sync methods - called ONCE during patching.
+    """
+    @wraps(original_method)
+    def method_wrapper(self, *args, **kwargs):
+        bound_method = original_method.__get__(self, type(self))
+        patched = _patch_async_anthropic_client_method(
+            original_method=bound_method
+        )
+        return patched(*args, **kwargs)
+    return method_wrapper
+def _patch_sync_anthropic_client_method(original_method: Callable):
+    @wraps(original_method)
+    def patched_sync_anthropic_method(*args, **kwargs):
+        input_parameters: InputParameters = safe_extract_input_parameters(
+            kwargs
+        )
+        llm_context = current_llm_context.get()
+        @observe(
+            type="llm",
+            model=input_parameters.model,
+            metrics=llm_context.metrics,
+            metric_collection=llm_context.metric_collection,
+        )
+        def llm_generation(*args, **kwargs):
+            messages_api_response = original_method(*args, **kwargs)
+            output_parameters = safe_extract_output_parameters(
+                messages_api_response, input_parameters
+            )
+            _update_all_attributes(
+                input_parameters,
+                output_parameters,
+                llm_context.expected_tools,
+                llm_context.expected_output,
+                llm_context.context,
+                llm_context.retrieval_context,
+            )
+            return messages_api_response
+        return llm_generation(*args, **kwargs)
+    return patched_sync_anthropic_method
+def _patch_async_anthropic_client_method(original_method: Callable):
+    @wraps(original_method)
+    async def patched_async_anthropic_method(*args, **kwargs):
+        input_parameters: InputParameters = safe_extract_input_parameters(
+            kwargs
+        )
+        llm_context = current_llm_context.get()
+        @observe(
+            type="llm",
+            model=input_parameters.model,
+            metrics=llm_context.metrics,
+            metric_collection=llm_context.metric_collection,
+        )
+        async def llm_generation(*args, **kwargs):
+            messages_api_response = await original_method(*args, **kwargs)
+            output_parameters = safe_extract_output_parameters(
+                messages_api_response, input_parameters
+            )
+            _update_all_attributes(
+                input_parameters,
+                output_parameters,
+                llm_context.expected_tools,
+                llm_context.expected_output,
+                llm_context.context,
+                llm_context.retrieval_context,
+            )
+            return messages_api_response
+        return await llm_generation(*args, **kwargs)
+    return patched_async_anthropic_method
+def unpatch_anthropic_classes():
+    """
+    Restore Anthropic resource classes to their original state.
+    """
+    global _ANTHROPIC_PATCHED
+    # If not patched, nothing to do
+    if not _ANTHROPIC_PATCHED:
+        return
+    try:
+        from anthropic.resources.messages import Messages, AsyncMessages
+        # Restore original methods for Messages
+        if hasattr(Messages, "create"):
+            Messages.create = _ORIGINAL_METHODS["Messages.create"]
+        if hasattr(AsyncMessages, "create"):
+            AsyncMessages.create = _ORIGINAL_METHODS["AsyncMessages.create"]
+    except ImportError:
+        pass
+    # Reset the patched flag
+    _ANTHROPIC_PATCHED = False

deepeval 3.5.8__tar.gz → 3.8.0__tar.gz

deepeval 3.5.8tar.gz → 3.8.0tar.gz