PyPI - judgeval - Versions diffs - 0.0.22__tar.gz → 0.0.24__tar.gz - Mend

judgeval 0.0.22tar.gz → 0.0.24tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (165) hide show

judgeval-0.0.24/PKG-INFO ADDED Viewed

@@ -0,0 +1,156 @@
+Metadata-Version: 2.4
+Name: judgeval
+Version: 0.0.24
+Summary: Judgeval Package
+Project-URL: Homepage, https://github.com/JudgmentLabs/judgeval
+Project-URL: Issues, https://github.com/JudgmentLabs/judgeval/issues
+Author-email: Andrew Li <andrew@judgmentlabs.ai>, Alex Shan <alex@judgmentlabs.ai>, Joseph Camyre <joseph@judgmentlabs.ai>
+License-Expression: Apache-2.0
+License-File: LICENSE.md
+Classifier: Operating System :: OS Independent
+Classifier: Programming Language :: Python :: 3
+Requires-Python: >=3.11
+Requires-Dist: anthropic
+Requires-Dist: fastapi
+Requires-Dist: langchain
+Requires-Dist: langchain-anthropic
+Requires-Dist: langchain-core
+Requires-Dist: langchain-huggingface
+Requires-Dist: langchain-openai
+Requires-Dist: litellm
+Requires-Dist: nest-asyncio
+Requires-Dist: openai
+Requires-Dist: openpyxl
+Requires-Dist: pandas
+Requires-Dist: pika
+Requires-Dist: python-dotenv==1.0.1
+Requires-Dist: requests
+Requires-Dist: supabase
+Requires-Dist: together
+Requires-Dist: uvicorn
+Provides-Extra: dev
+Requires-Dist: pytest-asyncio>=0.25.0; extra == 'dev'
+Requires-Dist: pytest-mock>=3.14.0; extra == 'dev'
+Requires-Dist: pytest>=8.3.4; extra == 'dev'
+Requires-Dist: tavily-python; extra == 'dev'
+Description-Content-Type: text/markdown
+# Judgeval SDK
+Judgeval is an open-source framework for building evaluation pipelines for multi-step agent workflows, supporting both real-time and experimental evaluation setups. To learn more about Judgment or sign up for free, visit our [website](https://www.judgmentlabs.ai/) or check out our [developer docs](https://judgment.mintlify.app/getting_started).
+## Features
+- **Development and Production Evaluation Layer**: Offers a robust evaluation layer for multi-step agent applications, including unit-testing and performance monitoring.
+- **Plug-and-Evaluate**: Integrate LLM systems with 10+ research-backed metrics, including:
+  - Hallucination detection
+  - RAG retriever quality
+  - And more
+- **Custom Evaluation Pipelines**: Construct powerful custom evaluation pipelines tailored for your LLM systems.
+- **Monitoring in Production**: Utilize state-of-the-art real-time evaluation foundation models to monitor LLM systems effectively.
+## Installation
+   ```bash
+   pip install judgeval
+   ```
+## Quickstart: Evaluations
+You can evaluate your workflow execution data to measure quality metrics such as hallucination.
+Create a file named `evaluate.py` with the following code:
+   ```python
+    from judgeval import JudgmentClient
+    from judgeval.data import Example
+    from judgeval.scorers import FaithfulnessScorer
+    client = JudgmentClient()
+    example = Example(
+        input="What if these shoes don't fit?",
+        actual_output="We offer a 30-day full refund at no extra cost.",
+        retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."],
+    )
+    scorer = FaithfulnessScorer(threshold=0.5)
+    results = client.run_evaluation(
+        examples=[example],
+        scorers=[scorer],
+        model="gpt-4o",
+    )
+    print(results)
+   ```
+   Click [here](https://judgment.mintlify.app/getting_started#create-your-first-experiment) for a more detailed explanation
+## Quickstart: Traces
+Track your workflow execution for full observability with just a few lines of code.
+Create a file named `traces.py` with the following code:
+   ```python
+    from judgeval.common.tracer import Tracer, wrap
+    from openai import OpenAI
+    client = wrap(OpenAI())
+    judgment = Tracer(project_name="my_project")
+    @judgment.observe(span_type="tool")
+    def my_tool():
+        return "Hello world!"
+    @judgment.observe(span_type="function")
+    def main():
+        task_input = my_tool()
+        res = client.chat.completions.create(
+            model="gpt-4o",
+            messages=[{"role": "user", "content": f"{task_input}"}]
+        )
+        return res.choices[0].message.content
+   ```
+   Click [here](https://judgment.mintlify.app/getting_started#create-your-first-trace) for a more detailed explanation
+## Quickstart: Online Evaluations
+Apply performance monitoring to measure the quality of your systems in production, not just on historical data.
+Using the same traces.py file we created earlier:
+   ```python
+    from judgeval.common.tracer import Tracer, wrap
+    from judgeval.scorers import AnswerRelevancyScorer
+    from openai import OpenAI
+    client = wrap(OpenAI())
+    judgment = Tracer(project_name="my_project")
+    @judgment.observe(span_type="tool")
+    def my_tool():
+        return "Hello world!"
+    @judgment.observe(span_type="function")
+    def main():
+        task_input = my_tool()
+        res = client.chat.completions.create(
+            model="gpt-4o",
+            messages=[{"role": "user", "content": f"{task_input}"}]
+        ).choices[0].message.content
+        judgment.get_current_trace().async_evaluate(
+            scorers=[AnswerRelevancyScorer(threshold=0.5)],
+            input=task_input,
+            actual_output=res,
+            model="gpt-4o"
+        )
+        return res
+   ```
+   Click [here](https://judgment.mintlify.app/getting_started#create-your-first-online-evaluation) for a more detailed explanation
+## Documentation and Demos
+For more detailed documentation, please check out our [docs](https://judgment.mintlify.app/getting_started) and some of our [demo videos](https://www.youtube.com/@AlexShan-j3o) for reference!
+##

judgeval-0.0.24/README.md ADDED Viewed

@@ -0,0 +1,119 @@
+# Judgeval SDK
+Judgeval is an open-source framework for building evaluation pipelines for multi-step agent workflows, supporting both real-time and experimental evaluation setups. To learn more about Judgment or sign up for free, visit our [website](https://www.judgmentlabs.ai/) or check out our [developer docs](https://judgment.mintlify.app/getting_started).
+## Features
+- **Development and Production Evaluation Layer**: Offers a robust evaluation layer for multi-step agent applications, including unit-testing and performance monitoring.
+- **Plug-and-Evaluate**: Integrate LLM systems with 10+ research-backed metrics, including:
+  - Hallucination detection
+  - RAG retriever quality
+  - And more
+- **Custom Evaluation Pipelines**: Construct powerful custom evaluation pipelines tailored for your LLM systems.
+- **Monitoring in Production**: Utilize state-of-the-art real-time evaluation foundation models to monitor LLM systems effectively.
+## Installation
+   ```bash
+   pip install judgeval
+   ```
+## Quickstart: Evaluations
+You can evaluate your workflow execution data to measure quality metrics such as hallucination.
+Create a file named `evaluate.py` with the following code:
+   ```python
+    from judgeval import JudgmentClient
+    from judgeval.data import Example
+    from judgeval.scorers import FaithfulnessScorer
+    client = JudgmentClient()
+    example = Example(
+        input="What if these shoes don't fit?",
+        actual_output="We offer a 30-day full refund at no extra cost.",
+        retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."],
+    )
+    scorer = FaithfulnessScorer(threshold=0.5)
+    results = client.run_evaluation(
+        examples=[example],
+        scorers=[scorer],
+        model="gpt-4o",
+    )
+    print(results)
+   ```
+   Click [here](https://judgment.mintlify.app/getting_started#create-your-first-experiment) for a more detailed explanation
+## Quickstart: Traces
+Track your workflow execution for full observability with just a few lines of code.
+Create a file named `traces.py` with the following code:
+   ```python
+    from judgeval.common.tracer import Tracer, wrap
+    from openai import OpenAI
+    client = wrap(OpenAI())
+    judgment = Tracer(project_name="my_project")
+    @judgment.observe(span_type="tool")
+    def my_tool():
+        return "Hello world!"
+    @judgment.observe(span_type="function")
+    def main():
+        task_input = my_tool()
+        res = client.chat.completions.create(
+            model="gpt-4o",
+            messages=[{"role": "user", "content": f"{task_input}"}]
+        )
+        return res.choices[0].message.content
+   ```
+   Click [here](https://judgment.mintlify.app/getting_started#create-your-first-trace) for a more detailed explanation
+## Quickstart: Online Evaluations
+Apply performance monitoring to measure the quality of your systems in production, not just on historical data.
+Using the same traces.py file we created earlier:
+   ```python
+    from judgeval.common.tracer import Tracer, wrap
+    from judgeval.scorers import AnswerRelevancyScorer
+    from openai import OpenAI
+    client = wrap(OpenAI())
+    judgment = Tracer(project_name="my_project")
+    @judgment.observe(span_type="tool")
+    def my_tool():
+        return "Hello world!"
+    @judgment.observe(span_type="function")
+    def main():
+        task_input = my_tool()
+        res = client.chat.completions.create(
+            model="gpt-4o",
+            messages=[{"role": "user", "content": f"{task_input}"}]
+        ).choices[0].message.content
+        judgment.get_current_trace().async_evaluate(
+            scorers=[AnswerRelevancyScorer(threshold=0.5)],
+            input=task_input,
+            actual_output=res,
+            model="gpt-4o"
+        )
+        return res
+   ```
+   Click [here](https://judgment.mintlify.app/getting_started#create-your-first-online-evaluation) for a more detailed explanation
+## Documentation and Demos
+For more detailed documentation, please check out our [docs](https://judgment.mintlify.app/getting_started) and some of our [demo videos](https://www.youtube.com/@AlexShan-j3o) for reference!
+##

{judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/data_datasets.mdx RENAMED Viewed

@@ -3,19 +3,14 @@ title: Datasets
 ---
 ## Overview
 In most scenarios, you will have multiple `Example`s that you want to evaluate together.
-In `judgeval`, an evaluation dataset (`EvalDataset`) is a collection of `Example`s and/or `GroundTruthExample`s that you can scale evaluations across.
+In `judgeval`, an evaluation dataset (`EvalDataset`) is a collection of `Example`s that you can scale evaluations across.
-<Note>
-A `GroundTruthExample` is a specific type of `Example` that do not require the `actual_output` field.
-This is useful for creating datasets that can be **dynamically updated at evaluation time** by running your workflow on the GroundTruthExamples to create Examples.
-</Note>
 ## Creating a Dataset
-Creating an `EvalDataset` is as simple as supplying a list of `Example`s and/or `GroundTruthExample`s.
+Creating an `EvalDataset` is as simple as supplying a list of `Example`s.
 ```python create_dataset.py
-from judgeval.data import Example, GroundTruthExample
+from judgeval.data import Example
 from judgeval.data.datasets import EvalDataset
 examples = [
@@ -23,25 +18,19 @@ examples = [
     Example(input="...", actual_output="..."),
     ...
 ]
-ground_truth_examples = [
-    GroundTruthExample(input="..."),
-    GroundTruthExample(input="..."),
-    ...
-]
 dataset = EvalDataset(
-    examples=examples,
-    ground_truth_examples=ground_truth_examples
+    examples=examples
 )
 ```
-You can also add `Example`s and `GroundTruthExample`s to an existing `EvalDataset` using the `add_example` and `add_ground_truth_example` methods.
+You can also add `Example`s to an existing `EvalDataset` using the `add_example` method.
 ```python add_to_dataset.py
 ...
 dataset.add_example(Example(...))
-dataset.add_ground_truth(GroundTruthExample(...))
 ```
 ## Saving/Loading Datasets
@@ -81,12 +70,6 @@ You can save/load an `EvalDataset` with a JSON file. Your JSON file should have
             "actual_output": "..."
         },
         ...
-    ],
-    "ground_truths": [
-        {
-            "input": "..."
-        },
-        ...
     ]
 }
 ```
@@ -154,7 +137,7 @@ examples:
 ## Evaluate On Your Dataset
-You can use the `JudgmentClient` to evaluate the `Example`s and `GroundTruthExample`s in your dataset using scorers.
+You can use the `JudgmentClient` to evaluate the `Example`s in your dataset using scorers.
 ```python evaluate_dataset.py
 ...

{judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/data_examples.mdx RENAMED Viewed

@@ -4,14 +4,12 @@ title: Examples
 ## Overview
 An `Example` is a basic unit of data in `judgeval` that allows you to run evaluation scorers on your LLM system.
-An `Example` is composed of seven fields:
-- `input`
-- `actual_output`
-- [Optional] `expected_output`
-- [Optional] `retrieval_context`
-- [Optional] `context`
-- [Optional] `tools_called`
-- [Optional] `expected_tools`
+An `Example` is can be composed of a mixture of the following fields:
+- `input` [Optional]
+- `actual_output` [Optional]
+- `expected_output` [Optional]
+- `retrieval_context` [Optional]
+- `context` [Optional]
 **Here's a sample of creating an `Example`:**
@@ -24,8 +22,6 @@ example = Example(
     expected_output="Bill Gates and Paul Allen founded Microsoft in New Mexico in 1975.",
     retrieval_context=["Bill Gates co-founded Microsoft with Paul Allen in 1975."],
     context=["Bill Gates and Paul Allen are the founders of Microsoft."],
-    tools_called=["Google Search"],
-    expected_tools=["Google Search", "Perplexity"],
 )
 ```
@@ -39,7 +35,7 @@ Other fields are optional and depend on the type of evaluation. If you want to d
 ## Example Fields
-Here, we cover the seven fields that make up an `Example`.
+Here, we cover the possible fields that make up an `Example`.
 ### Input
 The `input` field represents a sample interaction between a user and your LLM system. The input should represent the direct input to your prompt template(s), and **SHOULD NOT CONTAIN** your prompt template itself.
@@ -137,48 +133,6 @@ example = Example(
 )
 ```
-<Note>
-`context` is the ideal retrieval result for a specific `input`, whereas `retrieval_context` is the actual retrieval result at runtime. While they are similar, they are not always interchangeable.
-</Note>
-### Tools Called
-The `tools_called` field is `Optional[List[str]]` and represents the tools that were called by the LLM system. This is particularly useful for evaluating whether agents are properly using tools available to them.
-```python tools_called.py
-# Sample app implementation
-import medical_chatbot
-question = "Is sparkling water healthy?"
-example = Example(
-    input=question,
-    actual_output=medical_chatbot.chat(question),
-    expected_output="Sparkling water is neither healthy nor unhealthy.",
-    context=["Sparkling water is a type of water that is carbonated."],
-    retrieval_context=["Sparkling water is carbonated and has no calories."],
-    tools_called=["Perplexity", "GoogleSearch"]
-)
-```
-### Expected Tools
-The `expected_tools` field is `Optional[List[str]]` and represents the tools that are expected to be called by the LLM system. This is particularly useful for evaluating whether agents are properly using tools available to them.
-```python expected_tools.py
-# Sample app implementation
-import medical_chatbot
-question = "Is sparkling water healthy?"
-example = Example(
-    input=question,
-    actual_output=medical_chatbot.chat(question),
-    expected_output="Sparkling water is neither healthy nor unhealthy.",
-    context=["Sparkling water is a type of water that is carbonated."],
-    retrieval_context=["Sparkling water is carbonated and has no calories."],
-    tools_called=["Perplexity", "GoogleSearch"],
-    expected_tools=["Perplexity", "DBQuery"]
-)
-```
 ## Conclusion
 Congratulations! 🎉

{judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/scorers/custom_scorers.mdx RENAMED Viewed

@@ -116,9 +116,9 @@ class SampleScorer(JudgevalScorer):
 ```
-### 4. Implement the `success_check()` method
+### 4. Implement the `_success_check()` method
-When executing an evaluation run, `judgeval` will check if your scorer has passed the `success_check()` method.
+When executing an evaluation run, `judgeval` will check if your scorer has passed the `_success_check()` method.
 You can implement this method in any way you want, but **it should return a `bool`.** Here's a perfectly valid implementation:
@@ -126,7 +126,7 @@ You can implement this method in any way you want, but **it should return a `boo
 class SampleScorer(JudgevalScorer):
     ...
-    def success_check(self, example):
+    def _success_check(self):
         if self.error is not None:
             return False
         return self.score >= self.threshold  # or you can do self.success if set

judgeval-0.0.24/docs/evaluation/scorers/groundedness.mdx ADDED Viewed

@@ -0,0 +1,65 @@
+---
+title: Groundedness
+description: ""
+---
+The `Groundedness` scorer is a default LLM judge scorer that measures whether the `actual_output` is aligned with both the task instructions in `input` and the knowledge base in `retrieval_context`.
+In practice, this scorer helps determine if your RAG pipeline's generator is producing hallucinations or misinterpreting task instructions.
+**For optimal Groundedness scoring, check out our leading evaluation foundation model research here! TODO add link here.**
+<Note>
+The `Groundedness` scorer is a binary metric (1 or 0) that evaluates both instruction adherence and factual accuracy.
+Unlike the `Faithfulness` scorer which measures the degree of contradiction with retrieval context, `Groundedness` provides a pass/fail assessment based on both the task instructions and knowledge base.
+</Note>
+## Required Fields
+To run the `Groundedness` scorer, you must include the following fields in your `Example`:
+- `input`
+- `actual_output`
+- `retrieval_context`
+## Scorer Breakdown
+`Groundedness` scores are binary (1 or 0) and determined by checking:
+1. Whether the `actual_output` correctly interprets the task instructions in `input`
+2. Whether the `actual_output` contains any contradictions with the knowledge base in `retrieval_context`
+A response is considered grounded (score = 1) only if it:
+- Correctly follows the task instructions
+- Does not contradict any information in the knowledge base
+- Does not introduce hallucinated facts not supported by the retrieval context
+If there are any contradictions or misinterpretations, the scorer will fail (score = 0).
+## Sample Implementation
+```python groundedness.py
+from judgeval import JudgmentClient
+from judgeval.data import Example
+from judgeval.scorers import GroundednessScorer
+client = JudgmentClient()
+example = Example(
+    input="You are a helpful assistant for a clothing store. Make sure to follow the company's policies surrounding returns.",
+    actual_output="We offer a 30-day return policy for all items, including socks!",
+    retrieval_context=["Return policy, all items: 30-day limit for full refund, no questions asked."]
+)
+scorer = GroundednessScorer()
+results = client.run_evaluation(
+    examples=[example],
+    scorers=[scorer],
+    model="gpt-4o",
+)
+print(results)
+```
+<Note>
+The `Groundedness` scorer uses an LLM judge, so you'll receive a reason for the score in the `reason` field of the results.
+This allows you to double-check the accuracy of the evaluation and understand how the score was calculated.
+</Note>

{judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/scorers/introduction.mdx RENAMED Viewed

@@ -12,11 +12,14 @@ Scorers act as measurement tools for evaluating LLM systems based on specific cr
 - [Contextual Precision](/evaluation/scorers/contextual_precision)
 - [Contextual Recall](/evaluation/scorers/contextual_recall)
 - [Contextual Relevancy](/evaluation/scorers/contextual_relevancy)
+- [Execution Order](/evaluation/scorers/execution_order)
 - [Faithfulness](/evaluation/scorers/faithfulness)
 - [Hallucination](/evaluation/scorers/hallucination)
-- [Summarization](/evaluation/scorers/summarization)
-- [Execution Order](/evaluation/scorers/execution_order)
 - [JSON Correctness](/evaluation/scorers/json_correctness)
+- [Summarization](/evaluation/scorers/summarization)
+We also understand that you may need to evaluate your LLM system with metrics that are not covered by our default scorers.
+To support this, we provide a flexible framework for creating these scorers:
 - [Custom Scorers](/evaluation/scorers/custom_scorers)
 - [Classifier Scorers](/evaluation/scorers/classifier_scorer)
@@ -24,13 +27,9 @@ Scorers act as measurement tools for evaluating LLM systems based on specific cr
 We're always adding new scorers to `judgeval`. If you have a suggestion, please [let us know](mailto:contact@judgmentlabs.ai)!
 </Tip>
-Scorers execute on `Example`s, `GroundTruthExample`s, and `EvalDataset`s, producing a **score between 0 and 1**.
+Scorers execute on `Example`s and `EvalDataset`s, producing a **numerical score**.
 This enables you to **use evaluations as unit tests** by setting a `threshold` to determine whether an evaluation was successful or not.
-<Note>
-Built-in scorers will succeed if the score is greater than or equal to the `threshold`.
-</Note>
 ## Categories of Scorers
 `judgeval` supports three categories of scorers.
 - **Default Scorers**: built-in scorers that are ready to use
@@ -57,17 +56,17 @@ If you find that none of the default scorers meet your evaluation needs, setting
 You can create a custom scorer by inheritng from the `JudgevalScorer` class and implementing three methods:
 - `score_example()`: produces a score for a single `Example`.
 - `a_score_example()`: async version of `score_example()`. You may use the same implementation logic as `score_example()`.
-- `success_check()`: determines whether an evaluation was successful.
+- `_success_check()`: determines whether an evaluation was successful.
 Custom scorers can be as simple or complex as you want, and **do not need to use LLMs**.
-For sample implementations, check out the `JudgevalScorer` [documentation page](/evaluation/scorers/custom_scorers).
+For sample implementations, check out the [Custom Scorers](/evaluation/scorers/custom_scorers) documentation page.
 ### Classifier Scorers
 Classifier scorers are a special type of custom scorer that can evaluate your LLM system using a natural language criteria.
-TODO update this section when SDK is updated
+They either be defined using our judgeval SDK or using the Judgment Platform directly. For more information, check out the [Classifier Scorers](/evaluation/scorers/classifier_scorer) documentation page.
 ## Running Scorers
@@ -80,22 +79,10 @@ client = JudgmentClient()
 results = client.run_evaluation(
     examples=[example],
     scorers=[scorer],
-    model="gpt-4o-mini",
+    model="gpt-4o",
 )
 ```
-If you want to execute a `JudgevalScorer` without running it through the `JudgmentClient`, you can score locally.
-Simply use the `score_example()` or `a_score_example()` method directly:
-```python direct_scoring.py
-...
-example = Example(input="...", actual_output="...")
-scorer = JudgevalScorer()  # Your scorer here
-score = scorer.score_example(example)
-```
 <Tip>
 To learn about how a certain default scorer works, check out its documentation page for a deep dive into how scores are calculated and what fields are required.
 </Tip>

{judgeval-0.0.22 → judgeval-0.0.24}/docs/getting_started.mdx RENAMED Viewed

@@ -62,7 +62,7 @@ Congratulations! Your evaluation should have passed. Let's break down what happe
 - The variable `input` mimics a user input and `actual_output` is a placeholder for what your LLM system returns based on the input.
 - The variable `retrieval_context` represents the retrieved context from your RAG knowledge base.
 - `FaithfulnessScorer(threshold=0.5)` is a scorer that checks if the output is hallucinated relative to the retrieved context.
-    - <Note>All scorers produce values betweeen 0 - 1; the threshold is used in the context of [unit testing](/evaluation/unit_testing).</Note>
+    - <Note>The threshold is used in the context of [unit testing](/evaluation/unit_testing).</Note>
 - We chose `gpt-4o` as our judge model to measure faithfulness. Judgment Labs offers ANY judge model for your evaluation needs.
 Consider trying out our state-of-the-art Osiris judge models for your next evaluation!
@@ -142,7 +142,7 @@ def main():
         messages=[{"role": "user", "content": f"{task_input}"}]
     ).choices[0].message.content
-    judgment.get_current_trace().async_evaluate(
+    judgment.async_evaluate(
         scorers=[AnswerRelevancyScorer(threshold=0.5)],
         input=task_input,
         actual_output=res,
@@ -280,14 +280,10 @@ results = client.run_evaluation(
 # Create Your First Dataset
 In most cases, you will not be running evaluations on a single example; instead, you will be scoring your LLM system on a dataset.
 Judgeval allows you to create datasets, save them, and run evaluations on them.
-An `EvalDataset` is a collection of `Example`s and/or `GroundTruthExample`s.
-<Note>
-A `GroundTruthExample` is an `Example` that has no `actual_output` field since it will be generated at test time.
-</Note>
+An `EvalDataset` is a collection of `Example`s.
 ```python create_dataset.py
-from judgeval.data import Example, GroundTruthExample, EvalDataset
+from judgeval.data import Example, EvalDataset
 example1 = Example(input="...", actual_output="...")
 example2 = Example(input="...", actual_output="...")

judgeval 0.0.22__tar.gz → 0.0.24__tar.gz

judgeval 0.0.22tar.gz → 0.0.24tar.gz