PyPI - arize-phoenix - Versions diffs - 0.0.50rc1__tar.gz → 1.1.1__tar.gz - Mend

arize-phoenix 0.0.50rc1tar.gz → 1.1.1tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of arize-phoenix might be problematic. Click here for more details.

Files changed (159) hide show

{arize_phoenix-0.0.50rc1 → arize_phoenix-1.1.1}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.1
 Name: arize-phoenix
-Version: 0.0.50rc1
+Version: 1.1.1
 Summary: ML Observability in your notebook
 Project-URL: Documentation, https://docs.arize.com/phoenix/
 Project-URL: Issues, https://github.com/Arize-ai/phoenix/issues
@@ -35,22 +35,23 @@ Requires-Dist: uvicorn
 Requires-Dist: wrapt
 Provides-Extra: dev
 Requires-Dist: arize[autoembeddings,llm-evaluation]; extra == 'dev'
-Requires-Dist: black[jupyter]; extra == 'dev'
 Requires-Dist: gcsfs; extra == 'dev'
 Requires-Dist: hatch; extra == 'dev'
 Requires-Dist: jupyter; extra == 'dev'
-Requires-Dist: langchain>=0.0.293; extra == 'dev'
-Requires-Dist: llama-index>=0.8.29; extra == 'dev'
+Requires-Dist: langchain>=0.0.334; extra == 'dev'
+Requires-Dist: llama-index>=0.9.0; extra == 'dev'
 Requires-Dist: nbqa; extra == 'dev'
 Requires-Dist: pandas-stubs<=2.0.2.230605; extra == 'dev'
 Requires-Dist: pre-commit; extra == 'dev'
 Requires-Dist: pytest; extra == 'dev'
 Requires-Dist: pytest-cov; extra == 'dev'
 Requires-Dist: pytest-lazy-fixture; extra == 'dev'
-Requires-Dist: ruff==0.0.290; extra == 'dev'
+Requires-Dist: ruff==0.1.5; extra == 'dev'
 Requires-Dist: strawberry-graphql[debug-server]==0.208.2; extra == 'dev'
 Provides-Extra: experimental
 Requires-Dist: tenacity; extra == 'experimental'
+Provides-Extra: llama-index
+Requires-Dist: llama-index~=0.9.0; extra == 'llama-index'
 Description-Content-Type: text/markdown
 <p align="center">
@@ -102,6 +103,7 @@ Phoenix provides MLOps and LLMOps insights at lightning speed with zero-config o
     -   [Exportable Clusters](#exportable-clusters)
 -   [Retrieval-Augmented Generation Analysis](#retrieval-augmented-generation-analysis)
 -   [Structured Data Analysis](#structured-data-analysis)
+-   [Breaking Changes](#breaking-changes)
 -   [Community](#community)
 -   [Thanks](#thanks)
 -   [Copyright, Patent, and License](#copyright-patent-and-license)
@@ -267,7 +269,7 @@ pip install arize-phoenix[experimental] ipython matplotlib openai pycm scikit-le
 ```python
 from phoenix.experimental.evals import (
-    RAG_RELEVANCY_PROMPT_TEMPLATE_STR,
+    RAG_RELEVANCY_PROMPT_TEMPLATE,
     RAG_RELEVANCY_PROMPT_RAILS_MAP,
     OpenAIModel,
     download_benchmark_dataset,
@@ -292,7 +294,7 @@ model = OpenAIModel(
     temperature=0.0,
 )
 rails =list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values())
-df["eval_relevance"] = llm_classify(df, model, RAG_RELEVANCY_PROMPT_TEMPLATE_STR, rails)
+df[["eval_relevance"]] = llm_classify(df, model, RAG_RELEVANCY_PROMPT_TEMPLATE, rails)
 #Golden dataset has True/False map to -> "irrelevant" / "relevant"
 #we can then scikit compare to output of template - same format
 y_true = df["relevant"].map({True: "relevant", False: "irrelevant"})
@@ -419,6 +421,10 @@ train_ds = px.Dataset(dataframe=train_df, schema=schema, name="training")
 session = px.launch_app(primary=prod_ds, reference=train_ds)
 ```
+## Breaking Changes
+-   **v1.0.0** - Phoenix now exclusively supports the `openai>=1.0.0` sdk. If you are using an older version of the OpenAI SDK, you can continue to use `arize-phoenix==0.1.1`. However, we recommend upgrading to the latest version of the OpenAI SDK as it contains many improvements. If you are using Phoenix with LlamaIndex and and LangChain, you will have to upgrade to the versions of these packages that support the OpenAI `1.0.0` SDK as well (`llama-index>=0.8.64`, `langchain>=0.0.334`)
 ## Community
 Join our community to connect with thousands of machine learning practitioners and ML observability enthusiasts.

{arize_phoenix-0.0.50rc1 → arize_phoenix-1.1.1}/README.md RENAMED Viewed

@@ -47,6 +47,7 @@ Phoenix provides MLOps and LLMOps insights at lightning speed with zero-config o
     -   [Exportable Clusters](#exportable-clusters)
 -   [Retrieval-Augmented Generation Analysis](#retrieval-augmented-generation-analysis)
 -   [Structured Data Analysis](#structured-data-analysis)
+-   [Breaking Changes](#breaking-changes)
 -   [Community](#community)
 -   [Thanks](#thanks)
 -   [Copyright, Patent, and License](#copyright-patent-and-license)
@@ -212,7 +213,7 @@ pip install arize-phoenix[experimental] ipython matplotlib openai pycm scikit-le
 ```python
 from phoenix.experimental.evals import (
-    RAG_RELEVANCY_PROMPT_TEMPLATE_STR,
+    RAG_RELEVANCY_PROMPT_TEMPLATE,
     RAG_RELEVANCY_PROMPT_RAILS_MAP,
     OpenAIModel,
     download_benchmark_dataset,
@@ -237,7 +238,7 @@ model = OpenAIModel(
     temperature=0.0,
 )
 rails =list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values())
-df["eval_relevance"] = llm_classify(df, model, RAG_RELEVANCY_PROMPT_TEMPLATE_STR, rails)
+df[["eval_relevance"]] = llm_classify(df, model, RAG_RELEVANCY_PROMPT_TEMPLATE, rails)
 #Golden dataset has True/False map to -> "irrelevant" / "relevant"
 #we can then scikit compare to output of template - same format
 y_true = df["relevant"].map({True: "relevant", False: "irrelevant"})
@@ -364,6 +365,10 @@ train_ds = px.Dataset(dataframe=train_df, schema=schema, name="training")
 session = px.launch_app(primary=prod_ds, reference=train_ds)
 ```
+## Breaking Changes
+-   **v1.0.0** - Phoenix now exclusively supports the `openai>=1.0.0` sdk. If you are using an older version of the OpenAI SDK, you can continue to use `arize-phoenix==0.1.1`. However, we recommend upgrading to the latest version of the OpenAI SDK as it contains many improvements. If you are using Phoenix with LlamaIndex and and LangChain, you will have to upgrade to the versions of these packages that support the OpenAI `1.0.0` SDK as well (`llama-index>=0.8.64`, `langchain>=0.0.334`)
 ## Community
 Join our community to connect with thousands of machine learning practitioners and ML observability enthusiasts.

{arize_phoenix-0.0.50rc1 → arize_phoenix-1.1.1}/pyproject.toml RENAMED Viewed

@@ -43,12 +43,11 @@ dynamic = ["version"]
 [project.optional-dependencies]
 dev = [
-  "black[jupyter]",
   "gcsfs",
   "hatch",
   "jupyter",
   "nbqa",
-  "ruff==0.0.290",
+  "ruff==0.1.5",
   "pandas-stubs<=2.0.2.230605",  # version 2.0.3.230814 is causing a dependency conflict.
   "pytest",
   "pytest-cov",
@@ -56,12 +55,15 @@ dev = [
   "strawberry-graphql[debug-server]==0.208.2",
   "pre-commit",
   "arize[AutoEmbeddings, LLM_Evaluation]",
-  "llama-index>=0.8.29",
-  "langchain>=0.0.293",
+  "llama-index>=0.9.0",
+  "langchain>=0.0.334",
 ]
 experimental = [
   "tenacity",
 ]
+llama-index = [
+  "llama-index~=0.9.0",
+]
 [project.urls]
 Documentation = "https://docs.arize.com/phoenix/"
@@ -92,9 +94,9 @@ dependencies = [
   "pytest-cov",
   "pytest-lazy-fixture",
   "arize",
-  "langchain>=0.0.293",
-  "llama-index>=0.8.29",
-  "openai",
+  "langchain>=0.0.334",
+  "llama-index>=0.9.0",
+  "openai>=1.0.0",
   "tenacity",
   "nltk==3.8.1",
   "sentence-transformers==2.2.2",
@@ -104,25 +106,26 @@ dependencies = [
   "responses",
   "tiktoken",
   "typing-extensions<4.6.0",  # for Colab
+  "httpx", # For OpenAI testing
+  "respx", # For OpenAI testing
 ]
 [tool.hatch.envs.type]
 dependencies = [
   "mypy==1.5.1",
-  "llama-index>=0.8.29",
+  "llama-index>=0.9.0",
   "pandas-stubs<=2.0.2.230605",  # version 2.0.3.230814 is causing a dependency conflict.
   "types-psutil",
   "types-tqdm",
   "types-requests",
   "types-protobuf",
+  "openai>=1.0.0",
 ]
 [tool.hatch.envs.style]
 detached = true
 dependencies = [
-  "black~=23.3.0",
-  "black[jupyter]~=23.3.0",
-  "ruff~=0.0.290",
+  "ruff~=0.1.5",
 ]
 [tool.hatch.envs.notebooks]
@@ -178,11 +181,11 @@ check = [
 [tool.hatch.envs.style.scripts]
 check = [
-  "black --check --diff --color .",
   "ruff .",
+  "ruff format --check --diff .",
 ]
 fix = [
-  "black .",
+  "ruff format .",
   "ruff --fix .",
 ]
@@ -207,10 +210,6 @@ pypi = [
   "twine upload --verbose dist/*",
 ]
-[tool.black]
-line-length = 100
-exclude = '_pb2\.pyi?$'
 [tool.hatch.envs.docs.scripts]
 check = [
   "interrogate -vv src/",
@@ -278,11 +277,15 @@ module = [
 ignore_missing_imports = true
 [tool.ruff]
-exclude = [".git", "__pycache__", "docs/source/conf.py", "*_pb2.py*"]
+exclude = [".git", "__pycache__", "docs/source/conf.py", "*_pb2.py*", "*.pyi"]
+extend-include = ["*.ipynb"]
 ignore-init-module-imports = true
 line-length = 100
 select = ["E", "F", "W", "I"]
 target-version = "py38"
+[tool.ruff.lint.per-file-ignores]
+"*.ipynb" = ["E402", "E501"]
 [tool.ruff.isort]
 force-single-line = false

{arize_phoenix-0.0.50rc1 → arize_phoenix-1.1.1}/src/phoenix/__init__.py RENAMED Viewed

@@ -5,7 +5,7 @@ from .session.session import Session, active_session, close_app, launch_app
 from .trace.fixtures import load_example_traces
 from .trace.trace_dataset import TraceDataset
-__version__ = "0.0.50rc1"
+__version__ = "1.1.1"
 # module level doc-string
 __doc__ = """

{arize_phoenix-0.0.50rc1 → arize_phoenix-1.1.1}/src/phoenix/experimental/evals/__init__.py RENAMED Viewed

@@ -1,16 +1,17 @@
-from .functions import llm_classify, llm_eval_binary, llm_generate, run_relevance_eval
+from .functions import llm_classify, llm_generate, run_relevance_eval
 from .models import OpenAIModel, VertexAIModel
 from .retrievals import compute_precisions_at_k
 from .templates import (
     CODE_READABILITY_PROMPT_RAILS_MAP,
-    CODE_READABILITY_PROMPT_TEMPLATE_STR,
+    CODE_READABILITY_PROMPT_TEMPLATE,
     HALLUCINATION_PROMPT_RAILS_MAP,
-    HALLUCINATION_PROMPT_TEMPLATE_STR,
+    HALLUCINATION_PROMPT_TEMPLATE,
     NOT_PARSABLE,
     RAG_RELEVANCY_PROMPT_RAILS_MAP,
-    RAG_RELEVANCY_PROMPT_TEMPLATE_STR,
+    RAG_RELEVANCY_PROMPT_TEMPLATE,
     TOXICITY_PROMPT_RAILS_MAP,
-    TOXICITY_PROMPT_TEMPLATE_STR,
+    TOXICITY_PROMPT_TEMPLATE,
+    ClassificationTemplate,
     PromptTemplate,
 )
 from .utils.downloads import download_benchmark_dataset
@@ -19,19 +20,19 @@ __all__ = [
     "compute_precisions_at_k",
     "download_benchmark_dataset",
     "llm_classify",
-    "llm_eval_binary",
     "llm_generate",
     "OpenAIModel",
     "VertexAIModel",
     "PromptTemplate",
+    "ClassificationTemplate",
     "CODE_READABILITY_PROMPT_RAILS_MAP",
-    "CODE_READABILITY_PROMPT_TEMPLATE_STR",
+    "CODE_READABILITY_PROMPT_TEMPLATE",
     "HALLUCINATION_PROMPT_RAILS_MAP",
-    "HALLUCINATION_PROMPT_TEMPLATE_STR",
+    "HALLUCINATION_PROMPT_TEMPLATE",
     "RAG_RELEVANCY_PROMPT_RAILS_MAP",
-    "RAG_RELEVANCY_PROMPT_TEMPLATE_STR",
-    "TOXICITY_PROMPT_TEMPLATE_STR",
+    "RAG_RELEVANCY_PROMPT_TEMPLATE",
     "TOXICITY_PROMPT_RAILS_MAP",
+    "TOXICITY_PROMPT_TEMPLATE",
     "NOT_PARSABLE",
     "run_relevance_eval",
 ]

arize_phoenix-1.1.1/src/phoenix/experimental/evals/evaluators.py ADDED Viewed

@@ -0,0 +1,139 @@
+from typing import List, Optional
+from phoenix.experimental.evals import PromptTemplate
+from phoenix.experimental.evals.models import BaseEvalModel
+class MapReducer:
+    """
+    Evaluates data that is too large to fit into a single context window using a
+    map-reduce strategy. The data must first be divided into "chunks" that
+    individually fit into an LLM's context window. Each chunk of data is
+    individually evaluated (the "map" step), producing intermediate outputs that
+    are combined into a single result (the "reduce" step).
+    This is the simplest strategy for evaluating long-context data.
+    """
+    def __init__(
+        self,
+        model: BaseEvalModel,
+        map_prompt_template: PromptTemplate,
+        reduce_prompt_template: PromptTemplate,
+    ) -> None:
+        """Initializes an instance.
+        Args:
+            model (BaseEvalModel): The LLM model to use for evaluation.
+            map_prompt_template (PromptTemplate): The template that is mapped
+            over each chunk to produce intermediate outputs. Must contain the
+            {chunk} placeholder.
+            reduce_prompt_template (PromptTemplate): The template that combines
+            the intermediate outputs into a single result. Must contain the
+            {mapped} placeholder, which will be formatted as a list of the
+            intermediate outputs produced by the map step.
+        """
+        self._model = model
+        self._map_prompt_template = map_prompt_template
+        self._reduce_prompt_template = reduce_prompt_template
+    def evaluate(self, chunks: List[str]) -> str:
+        """Evaluates a list of two or more chunks.
+        Args:
+            chunks (List[str]): A list of chunks to be evaluated. Each chunk is
+            inserted into the map_prompt_template and must therefore fit within
+            the LLM's context window and still leave room for the rest of the
+            prompt.
+        Returns:
+            str: The output of the map-reduce process.
+        """
+        if len(chunks) < 2:
+            raise ValueError(
+                "The map-reduce strategy is not needed to evaluate data "
+                "that fits within a single context window. "
+                "Consider using llm_classify instead."
+            )
+        model = self._model
+        mapped_records = []
+        for chunk in chunks:
+            map_prompt = self._map_prompt_template.format({"chunk": chunk})
+            intermediate_output = model(map_prompt)
+            mapped_records.append(intermediate_output)
+        reduce_prompt = self._reduce_prompt_template.format({"mapped": repr(mapped_records)})
+        return model(reduce_prompt)
+class Refiner:
+    """
+    Evaluates data that is too large to fit into a single context window using a
+    refine strategy. The data must first be divided into "chunks" that
+    individually fit into an LLM's context window. An initial "accumulator" is
+    generated from the first chunk of data. The accumulator is subsequently
+    refined by iteratively updating and incorporating new information from each
+    subsequent chunk. An optional synthesis step can be used to synthesize the
+    final accumulator into a desired format.
+    """
+    def __init__(
+        self,
+        model: BaseEvalModel,
+        initial_prompt_template: PromptTemplate,
+        refine_prompt_template: PromptTemplate,
+        synthesize_prompt_template: Optional[PromptTemplate] = None,
+    ) -> None:
+        """Initializes an instance.
+        Args:
+            model (BaseEvalModel): The LLM model to use for evaluation.
+            initial_prompt_template (PromptTemplate): The template for the
+            initial invocation of the model that will generate the initial
+            accumulator. Should contain the {chunk} placeholder.
+            refine_prompt_template (PromptTemplate): The template for refining
+            the accumulator across all subsequence chunks. Must contain the
+            {chunk} and {accumulator} placeholders.
+            synthesize_prompt_template (Optional[PromptTemplate], optional): An
+            optional template to synthesize the final version of the
+            accumulator. Must contain the {accumulator} placeholder.
+        """
+        self._model = model
+        self._initial_prompt_template = initial_prompt_template
+        self._refine_prompt_template = refine_prompt_template
+        self._synthesize_prompt_template = synthesize_prompt_template
+    def evaluate(self, chunks: List[str]) -> str:
+        """Evaluates a list of two or more chunks.
+        Args:
+            chunks (List[str]): A list of chunks to be evaluated. Each chunk is
+            inserted into the initial_prompt_template and refine_prompt_template
+            and must therefore fit within the LLM's context window and still
+            leave room for the rest of the prompt.
+        Returns:
+            str: The output of the refine process.
+        """
+        if len(chunks) < 2:
+            raise ValueError(
+                "The refine strategy is not needed to evaluate data "
+                "that fits within a single context window. "
+                "Consider using llm_classify instead."
+            )
+        model = self._model
+        initial_prompt = self._initial_prompt_template.format({"chunk": chunks[0]})
+        accumulator = model(initial_prompt)
+        for chunk in chunks[1:]:
+            refine_prompt = self._refine_prompt_template.format(
+                {"accumulator": accumulator, "chunk": chunk}
+            )
+            accumulator = model(refine_prompt)
+        if not self._synthesize_prompt_template:
+            return accumulator
+        reduce_prompt = self._synthesize_prompt_template.format({"accumulator": accumulator})
+        return model(reduce_prompt)

arize_phoenix-1.1.1/src/phoenix/experimental/evals/functions/__init__.py ADDED Viewed

@@ -0,0 +1,4 @@
+from .classify import llm_classify, run_relevance_eval
+from .generate import llm_generate
+__all__ = ["llm_classify", "run_relevance_eval", "llm_generate"]

arize-phoenix 0.0.50rc1__tar.gz → 1.1.1__tar.gz

Potentially problematic release.

arize-phoenix 0.0.50rc1tar.gz → 1.1.1tar.gz