PyPI - arize-phoenix - Versions diffs - 4.12.1rc1__tar.gz → 4.15.0__tar.gz - Mend

arize-phoenix 4.12.1rc1tar.gz → 4.15.0tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of arize-phoenix might be problematic. Click here for more details.

Files changed (293) hide show

{arize_phoenix-4.12.1rc1 → arize_phoenix-4.15.0}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.3
 Name: arize-phoenix
-Version: 4.12.1rc1
+Version: 4.15.0
 Summary: AI Observability and Evaluation
 Project-URL: Documentation, https://docs.arize.com/phoenix/
 Project-URL: Issues, https://github.com/Arize-ai/phoenix/issues
@@ -48,7 +48,7 @@ Requires-Dist: scipy
 Requires-Dist: sqlalchemy[asyncio]<3,>=2.0.4
 Requires-Dist: sqlean-py>=3.45.1
 Requires-Dist: starlette
-Requires-Dist: strawberry-graphql==0.235.0
+Requires-Dist: strawberry-graphql==0.236.0
 Requires-Dist: tqdm
 Requires-Dist: typing-extensions>=4.5; python_version < '3.12'
 Requires-Dist: typing-extensions>=4.6; python_version >= '3.12'
@@ -65,11 +65,12 @@ Requires-Dist: opentelemetry-sdk; extra == 'container'
 Requires-Dist: opentelemetry-semantic-conventions; extra == 'container'
 Requires-Dist: prometheus-client; extra == 'container'
 Requires-Dist: py-grpc-prometheus; extra == 'container'
-Requires-Dist: strawberry-graphql[opentelemetry]==0.235.0; extra == 'container'
+Requires-Dist: strawberry-graphql[opentelemetry]==0.236.0; extra == 'container'
 Requires-Dist: uvloop; (platform_system != 'Windows') and extra == 'container'
 Provides-Extra: dev
 Requires-Dist: anthropic; extra == 'dev'
 Requires-Dist: arize[autoembeddings,llm-evaluation]; extra == 'dev'
+Requires-Dist: asgi-lifespan; extra == 'dev'
 Requires-Dist: asyncpg; extra == 'dev'
 Requires-Dist: gcsfs; extra == 'dev'
 Requires-Dist: google-cloud-aiplatform>=1.3; extra == 'dev'
@@ -78,6 +79,7 @@ Requires-Dist: jupyter; extra == 'dev'
 Requires-Dist: langchain>=0.0.334; extra == 'dev'
 Requires-Dist: litellm>=1.0.3; extra == 'dev'
 Requires-Dist: llama-index>=0.10.3; extra == 'dev'
+Requires-Dist: mypy==1.11.0; extra == 'dev'
 Requires-Dist: nbqa; extra == 'dev'
 Requires-Dist: pandas-stubs==2.0.3.230814; (python_version < '3.9') and extra == 'dev'
 Requires-Dist: pandas-stubs==2.2.2.240603; (python_version >= '3.9') and extra == 'dev'
@@ -88,9 +90,9 @@ Requires-Dist: psycopg[binary]; extra == 'dev'
 Requires-Dist: pytest-asyncio; extra == 'dev'
 Requires-Dist: pytest-cov; extra == 'dev'
 Requires-Dist: pytest-postgresql; extra == 'dev'
-Requires-Dist: pytest==8.2.2; extra == 'dev'
-Requires-Dist: ruff==0.4.9; extra == 'dev'
-Requires-Dist: strawberry-graphql[debug-server,opentelemetry]==0.235.0; extra == 'dev'
+Requires-Dist: pytest==8.3.1; extra == 'dev'
+Requires-Dist: ruff==0.5.4; extra == 'dev'
+Requires-Dist: strawberry-graphql[debug-server,opentelemetry]==0.236.0; extra == 'dev'
 Requires-Dist: tabulate; extra == 'dev'
 Requires-Dist: types-tabulate; extra == 'dev'
 Provides-Extra: evals
@@ -138,6 +140,8 @@ Phoenix is an open-source AI observability platform designed for experimentation
 -   **_Tracing_** - Trace your LLM application's runtime using OpenTelemetry-based instrumentation.
 -   **_Evaluation_** - Leverage LLMs to benchmark your application's performance using response and retrieval evals.
+-   **_Datasets_** -  Create versioned datasets of examples for experimentation, evaluation, and fine-tuning.
+-   **_Experiments_** -  Track and evaluate changes to prompts, LLMs, and retrieval.
 -   **_Inference Analysis_** - Visualize inferences and embeddings using dimensionality reduction and clustering to identify drift and performance degradation.
 Phoenix is vendor and language agnostic with out-of-the-box support for popular frameworks (🦙LlamaIndex, 🦜⛓LangChain, 🧩DSPy) and LLM providers (OpenAI, Bedrock, and more). For details on auto-instrumentation, check out the [OpenInference](https://github.com/Arize-ai/openinference) project.

{arize_phoenix-4.12.1rc1 → arize_phoenix-4.15.0}/README.md RENAMED Viewed

@@ -31,6 +31,8 @@ Phoenix is an open-source AI observability platform designed for experimentation
 -   **_Tracing_** - Trace your LLM application's runtime using OpenTelemetry-based instrumentation.
 -   **_Evaluation_** - Leverage LLMs to benchmark your application's performance using response and retrieval evals.
+-   **_Datasets_** -  Create versioned datasets of examples for experimentation, evaluation, and fine-tuning.
+-   **_Experiments_** -  Track and evaluate changes to prompts, LLMs, and retrieval.
 -   **_Inference Analysis_** - Visualize inferences and embeddings using dimensionality reduction and clustering to identify drift and performance degradation.
 Phoenix is vendor and language agnostic with out-of-the-box support for popular frameworks (🦙LlamaIndex, 🦜⛓LangChain, 🧩DSPy) and LLM providers (OpenAI, Bedrock, and more). For details on auto-instrumentation, check out the [OpenInference](https://github.com/Arize-ai/openinference) project.

{arize_phoenix-4.12.1rc1 → arize_phoenix-4.15.0}/pyproject.toml RENAMED Viewed

@@ -31,7 +31,7 @@ dependencies = [
   "starlette",
   "uvicorn",
   "psutil",
-  "strawberry-graphql==0.235.0",  # need to pin version because we're monkey-patching
+  "strawberry-graphql==0.236.0",  # need to pin version because we're monkey-patching
   "pyarrow",
   "typing-extensions>=4.5; python_version<'3.12'",
   # A minimum version of typing-extensions==4.6.0 is needed to avoid this issue on Python 3.12: https://github.com/Azure/azure-sdk-for-python/issues/33442#issuecomment-1847886784
@@ -70,19 +70,20 @@ dev = [
   "hatch",
   "jupyter",
   "nbqa",
-  "ruff==0.4.9",
+  "ruff==0.5.4",
+  "mypy==1.11.0",
   "pandas>=1.0",
   "tabulate",  # used by DataFrame.to_markdown()
   "types-tabulate",
   "pandas-stubs==2.2.2.240603; python_version>='3.9'",
   "pandas-stubs==2.0.3.230814; python_version<'3.9'",
-  "pytest==8.2.2",
+  "pytest==8.3.1",
   "pytest-asyncio",
   "pytest-cov",
   "pytest-postgresql",
   "asyncpg",
   "psycopg[binary]",
-  "strawberry-graphql[debug-server,opentelemetry]==0.235.0",  # need to pin version because we're monkey-patching
+  "strawberry-graphql[debug-server,opentelemetry]==0.236.0",  # need to pin version because we're monkey-patching
   "pre-commit",
   "arize[AutoEmbeddings, LLM_Evaluation]",
   "llama-index>=0.10.3",
@@ -91,6 +92,7 @@ dev = [
   "google-cloud-aiplatform>=1.3",
   "anthropic",
   "prometheus_client",
+  "asgi-lifespan",
 ]
 evals = []
 experimental = []
@@ -114,7 +116,7 @@ container = [
   "opentelemetry-instrumentation-sqlalchemy",
   "opentelemetry-instrumentation-grpc",
   "py-grpc-prometheus",
-  "strawberry-graphql[opentelemetry]==0.235.0",  # need to pin version because we're monkey-patching
+  "strawberry-graphql[opentelemetry]==0.236.0",  # need to pin version because we're monkey-patching
   "uvloop; platform_system != 'Windows'",
 ]
@@ -147,7 +149,7 @@ dependencies = [
   "numpy",
   "pandas==2.2.2; python_version>='3.9'",
   "pandas==1.4.0; python_version<'3.9'",
-  "pytest==8.2.2",
+  "pytest==8.3.1",
   "pytest-asyncio",
   "pytest-cov",
   "pytest-postgresql",
@@ -168,11 +170,12 @@ dependencies = [
   "respx", # For OpenAI testing
   "nest-asyncio", # for executor testing
   "astunparse; python_version<'3.9'",  # `ast.unparse(...)` is only available starting with Python 3.9
+  "asgi-lifespan",
 ]
 [tool.hatch.envs.type]
 dependencies = [
-  "mypy==1.10.0",
+  "mypy==1.11.0",
   "tenacity",
   "pandas>=1.0",
   "pandas-stubs==2.0.3.230814",
@@ -194,7 +197,7 @@ dependencies = [
   "opentelemetry-instrumentation-sqlalchemy",
   "opentelemetry-instrumentation-grpc",
   "py-grpc-prometheus",
-  "strawberry-graphql[opentelemetry]==0.235.0",  # need to pin version because we're monkey-patching
+  "strawberry-graphql[opentelemetry]==0.236.0",  # need to pin version because we're monkey-patching
   "requests",  # this is needed to type-check third-party packages
   "pydantic==1.10.17; python_version=='3.8'",  # lower minor versions of pydantic break strawberry mypy plugin
   "pydantic==1.10.17; python_version=='3.9'",  # lower minor versions of pydantic break strawberry mypy plugin
@@ -207,7 +210,7 @@ python = ["3.8", "3.9", "3.12"]
 [tool.hatch.envs.style]
 detached = true
 dependencies = [
-  "ruff==0.4.9",
+  "ruff==0.5.4",
 ]
 [[tool.hatch.envs.style.matrix]]
@@ -289,11 +292,11 @@ dependencies = [
 [tool.hatch.envs.publish.scripts]
 testpypi = [
-  #"check-wheel-contents dist/",
+  "check-wheel-contents dist/",
   "twine upload  --verbose --repository testpypi dist/*",
 ]
 pypi = [
-  #"check-wheel-contents dist/",
+  "check-wheel-contents dist/",
   "twine upload --verbose dist/*",
 ]
@@ -304,7 +307,7 @@ check = [
 [tool.hatch.envs.gql]
 dependencies = [
-  "strawberry-graphql[cli]==0.235.0",  # need to pin version because we're monkey-patching
+  "strawberry-graphql[cli]==0.236.0",  # need to pin version because we're monkey-patching
   "requests",
 ]

{arize_phoenix-4.12.1rc1 → arize_phoenix-4.15.0}/src/phoenix/db/bulk_inserter.py RENAMED Viewed

@@ -7,7 +7,6 @@ from itertools import islice
 from time import perf_counter
 from typing import (
     Any,
-    AsyncContextManager,
     Awaitable,
     Callable,
     Iterable,
@@ -19,7 +18,6 @@ from typing import (
 )
 from cachetools import LRUCache
-from sqlalchemy.ext.asyncio import AsyncSession
 from typing_extensions import TypeAlias
 import phoenix.trace.v1 as pb
@@ -31,6 +29,7 @@ from phoenix.db.insertion.evaluation import (
 from phoenix.db.insertion.helpers import DataManipulation, DataManipulationEvent
 from phoenix.db.insertion.span import SpanInsertionEvent, insert_span
 from phoenix.server.api.dataloaders import CacheForDataLoaders
+from phoenix.server.types import DbSessionFactory
 from phoenix.trace.schemas import Span
 logger = logging.getLogger(__name__)
@@ -46,7 +45,7 @@ class TransactionResult:
 class BulkInserter:
     def __init__(
         self,
-        db: Callable[[], AsyncContextManager[AsyncSession]],
+        db: DbSessionFactory,
         *,
         cache_for_dataloaders: Optional[CacheForDataLoaders] = None,
         initial_batch_of_operations: Iterable[DataManipulation] = (),
@@ -105,8 +104,10 @@ class BulkInserter:
         )
     async def __aexit__(self, *args: Any) -> None:
-        self._operations = None
         self._running = False
+        if self._task:
+            self._task.cancel()
+            self._task = None
     def _enqueue_operation(self, operation: DataManipulation) -> None:
         cast("Queue[DataManipulation]", self._operations).put_nowait(operation)

{arize_phoenix-4.12.1rc1 → arize_phoenix-4.15.0}/src/phoenix/db/engines.py RENAMED Viewed

@@ -8,7 +8,7 @@ from typing import Any
 import aiosqlite
 import numpy as np
 import sqlean
-from sqlalchemy import URL, event, make_url
+from sqlalchemy import URL, StaticPool, event, make_url
 from sqlalchemy.ext.asyncio import AsyncEngine, create_async_engine
 from typing_extensions import assert_never
@@ -105,6 +105,7 @@ def aio_sqlite_engine(
         echo=echo,
         json_serializer=_dumps,
         async_creator=async_creator,
+        poolclass=StaticPool,
     )
     event.listen(engine.sync_engine, "connect", set_sqlite_pragma)
     if not migrate:

{arize_phoenix-4.12.1rc1 → arize_phoenix-4.15.0}/src/phoenix/experiments/evaluators/base.py RENAMED Viewed

@@ -90,11 +90,15 @@ class Evaluator(ABC):
             if super_cls in (LLMEvaluator, Evaluator):
                 break
             if evaluate := super_cls.__dict__.get(Evaluator.evaluate.__name__):
+                if isinstance(evaluate, classmethod):
+                    evaluate = evaluate.__func__
                 assert callable(evaluate), "`evaluate()` method should be callable"
                 # need to remove the first param, i.e. `self`
                 _validate_sig(functools.partial(evaluate, None), "evaluate")
                 return
             if async_evaluate := super_cls.__dict__.get(Evaluator.async_evaluate.__name__):
+                if isinstance(async_evaluate, classmethod):
+                    async_evaluate = async_evaluate.__func__
                 assert callable(async_evaluate), "`async_evaluate()` method should be callable"
                 # need to remove the first param, i.e. `self`
                 _validate_sig(functools.partial(async_evaluate, None), "async_evaluate")

{arize_phoenix-4.12.1rc1 → arize_phoenix-4.15.0}/src/phoenix/experiments/evaluators/code_evaluators.py RENAMED Viewed

@@ -9,6 +9,19 @@ from phoenix.experiments.types import EvaluationResult, TaskOutput
 class JSONParsable(CodeEvaluator):
+    """
+    An evaluator that checks if the output of an experiment run is a JSON-parsable string.
+    Example:
+        .. code-block:: python
+            from phoenix.experiments import run_experiment
+            from phoenix.experiments.evaluators import JSONParsable
+            run_experiment(dataset, task, evaluators=[JSONParsable])
+    """
+    @classmethod
     def evaluate(self, *, output: Optional[TaskOutput] = None, **_: Any) -> EvaluationResult:
         assert isinstance(output, str), "Experiment run output must be a string"
         try:
@@ -22,6 +35,22 @@ class JSONParsable(CodeEvaluator):
 class ContainsKeyword(CodeEvaluator):
+    """
+    An evaluator that checks if a keyword is present in the output of an experiment run.
+    Args:
+        keyword (str): The keyword to search for in the output.
+        name (str, optional): An optional name for the evaluator. Defaults to "Contains(<keyword>)".
+    Example:
+        .. code-block:: python
+            from phoenix.experiments import run_experiment
+            from phoenix.experiments.evaluators import ContainsKeyword
+            run_experiment(dataset, task, evaluators=[ContainsKeyword("foo")])
+    """
     def __init__(self, keyword: str, name: Optional[str] = None) -> None:
         self.keyword = keyword
         self._name = name or f"Contains({repr(keyword)})"
@@ -39,6 +68,23 @@ class ContainsKeyword(CodeEvaluator):
 class ContainsAnyKeyword(CodeEvaluator):
+    """
+    An evaluator that checks if any of the keywords are present in the output of an experiment run.
+    Args:
+        keywords (List[str]): The keywords to search for in the output.
+        name (str, optional): An optional name for the evaluator. Defaults to
+            "ContainsAny(<keywords>)".
+    Example:
+        .. code-block:: python
+            from phoenix.experiments import run_experiment
+            from phoenix.experiments.evaluators import ContainsAnyKeyword
+            run_experiment(dataset, task, evaluators=[ContainsAnyKeyword(["foo", "bar"])])
+    """
     def __init__(self, keywords: List[str], name: Optional[str] = None) -> None:
         self.keywords = keywords
         self._name = name or f"ContainsAny({keywords})"
@@ -57,6 +103,23 @@ class ContainsAnyKeyword(CodeEvaluator):
 class ContainsAllKeywords(CodeEvaluator):
+    """
+    An evaluator that checks if all of the keywords are present in the output of an experiment run.
+    Args:
+        keywords (List[str]): The keywords to search for in the output.
+        name (str, optional): An optional name for the evaluator. Defaults to
+            "ContainsAll(<keywords>)".
+    Example:
+        .. code-block:: python
+            from phoenix.experiments import run_experiment
+            from phoenix.experiments.evaluators import ContainsAllKeywords
+            run_experiment(dataset, task, evaluators=[ContainsAllKeywords(["foo", "bar"])])
+    """
     def __init__(self, keywords: List[str], name: Optional[str] = None) -> None:
         self.keywords = keywords
         self._name = name or f"ContainsAll({keywords})"
@@ -77,6 +140,23 @@ class ContainsAllKeywords(CodeEvaluator):
 class MatchesRegex(CodeEvaluator):
+    r"""
+    An experiment evaluator that checks if the output of an experiment run matches a regex pattern.
+    Args:
+        pattern (Union[str, re.Pattern[str]]): The regex pattern to match the output against.
+        name (str, optional): An optional name for the evaluator. Defaults to "matches_({pattern})".
+    Example:
+        .. code-block:: python
+            from phoenix.experiments import run_experiment
+            from phoenix.experiments.evaluators import MatchesRegex
+            phone_number_evaluator = MatchesRegex(r"\d{3}-\d{3}-\d{4}", name="valid-phone-number")
+            run_experiment(dataset, task, evaluators=[phone_number_evaluator])
+    """
     def __init__(self, pattern: Union[str, re.Pattern[str]], name: Optional[str] = None) -> None:
         if isinstance(pattern, str):
             pattern = re.compile(pattern)

{arize_phoenix-4.12.1rc1 → arize_phoenix-4.15.0}/src/phoenix/experiments/evaluators/llm_evaluators.py RENAMED Viewed

@@ -18,6 +18,31 @@ from phoenix.experiments.types import (
 class LLMCriteriaEvaluator(LLMEvaluator):
+    """
+    An experiment evaluator that uses an LLM to evaluate whether the text meets a custom criteria.
+    This evaluator uses the chain-of-thought technique to perform a binary evaluation of text based
+    on a custom criteria and description. When used as an experiment evaluator,
+    `LLMCriteriaEvaluator` will return a score of 1.0 if the text meets the criteria and a score of
+    0.0 if not. The explanation produced by the chain-of-thought technique will be included in the
+    experiment evaluation as well.
+    Example criteria and descriptions:
+        - "thoughtfulness" - "shows careful consideration and fair judgement"
+        - "clarity" - "is easy to understand and follow"
+        - "professionalism" - "is respectful and appropriate for a formal setting"
+    Args:
+        model: The LLM model wrapper to use for evaluation. Compatible models can be imported from
+            the `phoenix.evals` module.
+        criteria: The criteria to evaluate the text against, the criteria should be able to be used
+            as a noun in a sentence.
+        description (str): A description of the criteria, used to clarify instructions to the LLM.
+            The description should complete this sentence: "{criteria} means the text
+            {description}".
+        name (str): The name of the evaluator
+    """
     _base_template = (
         "Determine if the following text is {criteria}. {description}"
         "First, explain step-by-step why you think the text is or is not {criteria}. Then provide "
@@ -117,6 +142,14 @@ ConcisenessEvaluator = criteria_evaluator_factory(
     description="is just a few sentences and easy to follow",
     default_name="Conciseness",
 )
+"""
+An experiment evaluator that uses an LLM to evaluate whether the text is concise.
+Args:
+    model: The LLM model wrapper to use for evaluation. Compatible models can be imported from
+        the `phoenix.evals` module.
+    name (str, optional): The name of the evaluator, defaults to "Conciseness".
+"""
 HelpfulnessEvaluator = criteria_evaluator_factory(
@@ -125,6 +158,14 @@ HelpfulnessEvaluator = criteria_evaluator_factory(
     description="provides useful information",
     default_name="Helpfulness",
 )
+"""
+An experiment evaluator that uses an LLM to evaluate whether the text is helpful.
+Args:
+    model: The LLM model wrapper to use for evaluation. Compatible models can be imported from
+        the `phoenix.evals` module.
+    name (str, optional): The name of the evaluator, defaults to "Helpfulness".
+"""
 CoherenceEvaluator = criteria_evaluator_factory(
@@ -133,6 +174,14 @@ CoherenceEvaluator = criteria_evaluator_factory(
     description="is coherent, well-structured, and logically sound",
     default_name="Coherence",
 )
+"""
+An experiment evaluator that uses an LLM to evaluate whether the text is coherent.
+Args:
+    model: The LLM model wrapper to use for evaluation. Compatible models can be imported from
+        the `phoenix.evals` module.
+    name (str, optional): The name of the evaluator, defaults to "Coherence".
+"""
 def _parse_label_from_explanation(raw_string: str) -> str:
@@ -149,6 +198,33 @@ def _parse_label_from_explanation(raw_string: str) -> str:
 class RelevanceEvaluator(LLMEvaluator):
+    """
+    An experiment evaluator that uses an LLM to evaluate whether a response is relevant to a query.
+    This evaluator uses the chain-of-thought technique to perform a binary evaluation of whether
+    the output "response" of an experiment is relevant to its input "query". When used as an
+    experiment evaluator, `RelevanceEvaluator` will return a score of 1.0 if the response is
+    relevant to the query and a score of 0.0 if not. The explanation produced by the
+    chain-of-thought technique will be included in the experiment evaluation as well.
+    Optionally, you can provide custom functions to extract the query and response from the input
+    and output of the experiment task. By default, the evaluator will use the dataset example as
+    the input and the output of the experiment task as the response.
+    Args:
+        model: The LLM model wrapper to use for evaluation. Compatible models can be imported from
+            the `phoenix.evals` module.
+        get_query (callable, optional): A function that extracts the query from the input of the
+            experiment task. The function should take the input and metadata of the dataset example
+            and return a string. By default, the function will return the string representation of
+            the input.
+        get_response (callable, optional): A function that extracts the response from the output of
+            the experiment task. The function should take the output and metadata of the experiment
+            task and return a string. By default, the function will return the string representation
+            of the output.
+        name (str, optional): The name of the evaluator. Defaults to "Relevance".
+    """
     template = (
         "Determine if the following response is relevant to the query. In this context, "
         "'relevance' means that the response directly addresses the core question or topic of the "
@@ -174,7 +250,7 @@ class RelevanceEvaluator(LLMEvaluator):
         model: LLMBaseModel,
         get_query: Optional[Callable[[ExampleInput, ExampleMetadata], str]] = None,
         get_response: Optional[Callable[[Optional[TaskOutput], ExampleMetadata], str]] = None,
-        name: str = "RelevanceEvaluator",
+        name: str = "Relevance",
     ):
         self.model = model
         self._name = name

{arize_phoenix-4.12.1rc1 → arize_phoenix-4.15.0}/src/phoenix/experiments/evaluators/utils.py RENAMED Viewed

@@ -1,6 +1,5 @@
 import functools
 import inspect
-from itertools import chain, islice, repeat
 from typing import TYPE_CHECKING, Any, Callable, Optional, Union
 from phoenix.experiments.types import (
@@ -75,6 +74,72 @@ def create_evaluator(
     name: Optional[str] = None,
     scorer: Optional[Callable[[Any], EvaluationResult]] = None,
 ) -> Callable[[Callable[..., Any]], "Evaluator"]:
+    """
+    A decorator that configures a sync or async function to be used as an experiment evaluator.
+    If the `evaluator` is a function of one argument then that argument will be
+    bound to the `output` of an experiment task. Alternatively, the `evaluator` can be a function
+    of any combination of specific argument names that will be bound to special values:
+        `input`: The input field of the dataset example
+        `output`: The output of an experiment task
+        `expected`: The expected or reference output of the dataset example
+        `reference`: An alias for `expected`
+        `metadata`: Metadata associated with the dataset example
+    Args:
+        kind (str | AnnotatorKind): Broadly indicates how the evaluator scores an experiment run.
+            Valid kinds are: "CODE", "LLM". Defaults to "CODE".
+        name (str, optional): The name of the evaluator. If not provided, the name of the function
+            will be used.
+        scorer (callable, optional): An optional function that converts the output of the wrapped
+            function into an `EvaluationResult`. This allows configuring the evaluation
+            payload by setting a label, score and explanation. By default, numeric outputs will
+            be recorded as scores, boolean outputs will be recorded as scores and labels, and
+            string outputs will be recorded as labels. If the output is a 2-tuple, the first item
+            will be recorded as the score and the second item will recorded as the explanation.
+    Examples:
+        Configuring an evaluator that returns a boolean
+        .. code-block:: python
+            @create_evaluator(kind="CODE", name="exact-match)
+            def match(output: str, expected: str) -> bool:
+                return output == expected
+        Configuring an evaluator that returns a label
+        .. code-block:: python
+            client = openai.Client()
+            @create_evaluator(kind="LLM")
+            def label(output: str) -> str:
+                res = client.chat.completions.create(
+                    model = "gpt-4",
+                    messages = [
+                        {
+                            "role": "user",
+                            "content": (
+                                "in one word, characterize the sentiment of the following customer "
+                                f"request: {output}"
+                            )
+                        },
+                    ],
+                )
+                label = res.choices[0].message.content
+                return label
+        Configuring an evaluator that returns a score and explanation
+        .. code-block:: python
+            from textdistance import levenshtein
+            @create_evaluator(kind="CODE", name="levenshtein-distance")
+            def ld(output: str, expected: str) -> Tuple[float, str]:
+                return (
+                    levenshtein(output, expected),
+                    f"Levenshtein distance between {output} and {expected}"
+                )
+    """
     if scorer is None:
         scorer = _default_eval_scorer
@@ -163,24 +228,8 @@ def _default_eval_scorer(result: Any) -> EvaluationResult:
         return EvaluationResult(score=float(result))
     if isinstance(result, str):
         return EvaluationResult(label=result)
-    if isinstance(result, (tuple, list)) and 0 < len(result) <= 3:
-        # Possible interpretations are:
-        # - 3-tuple: (Score, Label, Explanation)
-        # - 2-tuple: (Score, Explanation) or (Label, Explanation)
-        # - 1-tuple: (Score, ) or (Label, )
-        # Note that (Score, Label) conflicts with (Score, Explanation) and we
-        # pick the latter because it's probably more prevalent. To get
-        # (Score, Label), use a 3-tuple instead, i.e. (Score, Label, None).
-        a, b, c = islice(chain(result, repeat(None)), 3)
-        score, label, explanation = None, a, b
-        if hasattr(a, "__float__"):
-            try:
-                score = float(a)
-            except ValueError:
-                pass
-            else:
-                label, explanation = (None, b) if len(result) < 3 else (b, c)
-        return EvaluationResult(score=score, label=label, explanation=explanation)
-    if result is None:
-        return EvaluationResult(score=0)
+    if isinstance(result, (tuple, list)) and len(result) == 2:
+        # If the result is a 2-tuple, the first item will be recorded as the score
+        # and the second item will recorded as the explanation.
+        return EvaluationResult(score=float(result[0]), explanation=str(result[1]))
     raise ValueError(f"Unsupported evaluation result type: {type(result)}")

arize-phoenix 4.12.1rc1__tar.gz → 4.15.0__tar.gz

Potentially problematic release.

arize-phoenix 4.12.1rc1tar.gz → 4.15.0tar.gz