PyPI - judgeval - Versions diffs - 0.0.3__tar.gz → 0.0.4__tar.gz - Mend

judgeval 0.0.3tar.gz → 0.0.4tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (140) hide show

{judgeval-0.0.3 → judgeval-0.0.4}/.github/workflows/ci.yaml RENAMED Viewed

@@ -40,4 +40,5 @@ jobs:
       - name: Run tests
         run: |
+          cd src
           pipenv run pytest

{judgeval-0.0.3 → judgeval-0.0.4}/.gitignore RENAMED Viewed

@@ -8,6 +8,7 @@ __pycache__/
 # Testing files for competitor packages
 demo/test_competitors.py
+src/e2etests/customer_usecases/
 # Packages
 *.egg

{judgeval-0.0.3 → judgeval-0.0.4}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: judgeval
-Version: 0.0.3
+Version: 0.0.4
 Summary: Judgeval Package
 Project-URL: Homepage, https://github.com/JudgmentLabs/judgeval
 Project-URL: Issues, https://github.com/JudgmentLabs/judgeval/issues

{judgeval-0.0.3 → judgeval-0.0.4}/Pipfile RENAMED Viewed

@@ -16,6 +16,9 @@ openai = "*"
 together = "*"
 anthropic = "*"
 patronus = "*"
+asyncio = "*"
+nest-asyncio = "*"
+tavily-python = "*"
 [dev-packages]
 pytest = "*"

{judgeval-0.0.3 → judgeval-0.0.4}/docs/evaluation/introduction.mdx RENAMED Viewed

@@ -8,25 +8,6 @@ Evaluation is the process of **scoring** an LLM system's outputs with metrics; a
 - An evaluation dataset
 - Metrics we are interested in tracking
-The ideal fit of evaluation into an application workflow looks like this:
-![Alt text](/images/evaluation_diagram.png "Optional title")
-## Metrics
-`judgeval` comes with a set of 10+ built-in evaluation metrics. These metrics are accessible through `judgeval`'s `Scorer` interface.
-Every `Scorer` has a `threshold` parameter that you can use in the context of unit testing your app.
-```python scorer.py
-from judgeval.scorers import FaithfulnessScorer
-scorer = FaithfulnessScorer(threshold=1.0)
-```
-You can use scorers to evaluate your LLM system's outputs by using `Example`s.
-<Tip>
-We're always working on adding new scorers, so if you have a metric you'd like to add, please [let us know!](mailto:contact@judgmentlabs.ai)
-</Tip>
 ## Examples
@@ -54,7 +35,7 @@ Creating an Example allows you to evaluate using
 `judgeval`'s default scorers:
 ```python example.py
-from judgeval.judgment_client import JudgmentClient
+from judgeval import JudgmentClient
 from judgeval.scorers import FaithfulnessScorer
 client = JudgmentClient()
@@ -102,6 +83,23 @@ results = client.evaluate_dataset(
 )
 ```
+## Metrics
+`judgeval` comes with a set of 10+ built-in evaluation metrics. These metrics are accessible through `judgeval`'s `Scorer` interface.
+Every `Scorer` has a `threshold` parameter that you can use in the context of unit testing your app.
+```python scorer.py
+from judgeval.scorers import FaithfulnessScorer
+scorer = FaithfulnessScorer(threshold=1.0)
+```
+You can use scorers to evaluate your LLM system's outputs by using `Example`s.
+<Tip>
+We're always working on adding new scorers, so if you have a metric you'd like to add, please [let us know!](mailto:contact@judgmentlabs.ai)
+</Tip>
 **Congratulations!** 🎉
 You've learned the basics of building and running evaluations with `judgeval`.

judgeval-0.0.4/docs/evaluation/scorers/answer_correctness.mdx ADDED Viewed

@@ -0,0 +1,56 @@
+---
+title: Answer Correctness
+description: ""
+---
+The answer correctness scorer is a default LLM judge scorer that measures how correct/consistent the LLM system's `actual_output` is to the `expected_output`.
+In practice, this scorer helps determine whether your LLM application produces **answers that are consistent with golden/ground truth answers**.
+## Required Fields
+To run the answer relevancy scorer, you must include the following fields in your `Example`:
+- `input`
+- `actual_output`
+- `expected_output`
+## Scorer Breakdown
+`AnswerCorrectness` scores are calculated by extracting statements made in the `expected_output` and classifying how many are consistent/correct with respect to the `actual_output`.
+The score is calculated as:
+$$
+\text{correctness score} = \frac{\text{correct statements}}{\text{total statements}}
+$$
+## Sample Implementation
+```python answer_correctness.py
+from judgeval import JudgmentClient
+from judgeval.data import Example
+from judgeval.scorers import AnswerCorrectnessScorer
+client = JudgmentClient()
+example = Example(
+    input="What's your return policy for a pair of socks?",
+    # Replace this with your LLM system's output
+    actual_output="We offer a 30-day return policy for all items, including socks!",
+    # Replace this with your golden/ground truth answer
+    expected_output="Socks can be returned within 30 days of purchase.",
+)
+# supply your own threshold
+scorer = AnswerCorrectnessScorer(threshold=0.8)
+results = client.run_evaluation(
+    examples=[example],
+    scorers=[scorer],
+    model="gpt-4o",
+)
+print(results)
+```
+<Note>
+The `AnswerCorrectness` scorer uses an LLM judge, so you'll receive a reason for the score in the `reason` field of the results.
+This allows you to double-check the accuracy of the evaluation and understand how the score was calculated.
+</Note>

{judgeval-0.0.3 → judgeval-0.0.4}/docs/evaluation/scorers/answer_relevancy.mdx RENAMED Viewed

@@ -30,7 +30,7 @@ $$
 ## Sample Implementation
 ```python answer_relevancy.py
-from judgeval.judgment_client import JudgmentClient
+from judgeval import JudgmentClient
 from judgeval.data import Example
 from judgeval.scorers import AnswerRelevancyScorer

judgeval-0.0.4/docs/evaluation/scorers/classifier_scorer.mdx ADDED Viewed

@@ -0,0 +1,90 @@
+---
+title: Classifier Scorers
+description: ""
+---
+A `ClassifierScorer` is a powerful tool for evaluating your LLM system using natural language criteria.
+Classifier scorers are great for prototyping new evaluation criteria on a small set of examples before using them to benchmark your workflows at scale.
+## Creating a Classifier Scorer
+### `judgeval` SDK
+You can create a `ClassifierScorer` by providing a natural language description of your evaluation task/criteria and a set of choices that an LLM judge can choose from when evaluating an example.
+Here's an example of creating a `ClassifierScorer` that determines if a response is friendly or not:
+```python friendliness_scorer.py
+from judgeval.scorers import ClassifierScorer
+friendliness_scorer = ClassifierScorer(
+    name="Friendliness Scorer",
+    threshold=1.0,
+    conversation=[
+        {
+            "role": "system",
+            "content": "Is the response positive (Y/N)? The response is: {{actual_output}}."
+        }
+    ],
+    options={"Y": 1, "N": 0}
+)
+```
+<Tip>
+You can put variables from [`Example`s](/evaluation/data_examples) into your `conversation` by using the mustache `{{variable_name}}` syntax.
+</Tip>
+### `Judgment` Platform
+1. Navigate to the `Scorers` tab in the Judgment platform. You'll find this on via the sidebar on the left.
+2. Click the `Create Scorer` button in the top right corner.
+![Alt text](/images/create_scorer.png "Optional title")
+3. Here, you can create a custom scorer by using a criteria in natural language, supplying custom arguments from the [`Example`](evaluation/data_examples) class.
+Then, you supply a set of **choices** the scorer can select from when evaluating an example. Finally, you can test your scorer on samples in our playground.
+4. Once you're finished, you can save the scorer and use it in your evaluation runs just like any other scorer in `judgeval`.
+#### Example
+Here's an example of building a similar `ClassifierScorer` that checks if the LLM's tone is too aggressive.
+![Alt text](/images/create_aggressive_scorer.png "Optional title")
+## Using a Classifier Scorer
+Classifer scorers can be used in the same way as any other scorer in `judgeval`.
+They can also be run in conjunction with other scorers in a single evaluation run!
+```python run_classifier_scorer.py
+...
+results = client.run_evaluation(
+    examples=[example1],
+    scorers=[friendliness_scorer],
+    model="gpt-4o"
+)
+```
+### Saving Classifier Scorers
+Whether you create a `ClassifierScorer` via the `judgeval` SDK or the Judgment platform, you can save it to the `Judgment` platform for reuse in future evaluations.
+- If you create a `ClassifierScorer` via the `judgeval` SDK, you can save it by calling `client.push_classifier_scorer()`.
+- Similarly, you can load a `ClassifierScorer` by calling `client.fetch_classifier_scorer()`.
+- Each `ClassifierScorer` has a **unique slug** that you can use to identify it.
+```python
+from judgeval import JudgmentClient
+client = JudgmentClient()
+# Saving a ClassifierScorer from SDK to platform
+friendliness_slug = client.push_classifier_scorer(friendliness_scorer)
+# Loading a ClassifierScorer from platform to SDK
+classifier_scorer = client.fetch_classifier_scorer("classifier-scorer-slug")
+```
+TODO add image of slugs on the platform

{judgeval-0.0.3 → judgeval-0.0.4}/docs/evaluation/scorers/contextual_precision.mdx RENAMED Viewed

@@ -42,7 +42,7 @@ Our contextual precision scorer is based on Stanford NLP's [ARES](https://arxiv.
 ## Sample Implementation
 ```python contextual_precision.py
-from judgeval.judgment_client import JudgmentClient
+from judgeval import JudgmentClient
 from judgeval.data import Example
 from judgeval.scorers import ContextualPrecisionScorer

{judgeval-0.0.3 → judgeval-0.0.4}/docs/evaluation/scorers/contextual_recall.mdx RENAMED Viewed

@@ -41,7 +41,7 @@ Our contextual recall scorer is based on Stanford NLP's [ARES](https://arxiv.org
 ## Sample Implementation
 ```python contextual_recall.py
-from judgeval.judgment_client import JudgmentClient
+from judgeval import JudgmentClient
 from judgeval.data import Example
 from judgeval.scorers import ContextualRecallScorer

{judgeval-0.0.3 → judgeval-0.0.4}/docs/evaluation/scorers/contextual_relevancy.mdx RENAMED Viewed

@@ -31,7 +31,7 @@ Our contextual relevancy scorer is based on Stanford NLP's [ARES](https://arxiv.
 ## Sample Implementation
 ```python contextual_relevancy.py
-from judgeval.judgment_client import JudgmentClient
+from judgeval import JudgmentClient
 from judgeval.data import Example
 from judgeval.scorers import ContextualRelevancyScorer

{judgeval-0.0.3 → judgeval-0.0.4}/docs/evaluation/scorers/faithfulness.mdx RENAMED Viewed

@@ -37,10 +37,9 @@ $$
 ## Sample Implementation
 ```python faithfulness.py
-from judgeval.judgment_client import JudgmentClient
+from judgeval import JudgmentClient
 from judgeval.data import Example
-from judgeval.scorers import JudgmentScorer
-from judgeval.constants import APIScorer
+from judgeval.scorers import FaithfulnessScorer
 client = JudgmentClient()
 example = Example(
@@ -51,7 +50,7 @@ example = Example(
     retrieval_context=["Return policy, all items: 30-day limit for full refund, no questions asked."]
 )
 # supply your own threshold
-scorer = JudgmentScorer(threshold=0.8, score_type=APIScorer.FAITHFULNESS)
+scorer = FaithfulnessScorer(threshold=0.8)
 results = client.run_evaluation(
     examples=[example],

{judgeval-0.0.3 → judgeval-0.0.4}/docs/evaluation/scorers/hallucination.mdx RENAMED Viewed

@@ -30,10 +30,9 @@ $$
 ## Sample Implementation
 ```python hallucination.py
-from judgeval.judgment_client import JudgmentClient
+from judgeval import JudgmentClient
 from judgeval.data import Example
-from judgeval.scorers import JudgmentScorer
-from judgeval.constants import APIScorer
+from judgeval.scorers import HallucinationScorer
 client = JudgmentClient()
 example = Example(
@@ -44,7 +43,7 @@ example = Example(
     context=["**RETURN POLICY** all products returnable with no cost for 30-days after purchase (receipt required)."]
 )
 # supply your own threshold
-scorer = JudgmentScorer(threshold=0.8, score_type=APIScorer.HALLUCINATION)
+scorer = HallucinationScorer(threshold=0.8)
 results = client.run_evaluation(
     examples=[example],

{judgeval-0.0.3 → judgeval-0.0.4}/docs/evaluation/scorers/json_correctness.mdx RENAMED Viewed

@@ -35,17 +35,16 @@ $$
 ## Sample Implementation
 ```python json_correctness.py
-from judgeval.judgment_client import JudgmentClient
+from judgeval import JudgmentClient
 from judgeval.data import Example
-from judgeval.scorers import JudgmentScorer
-from judgeval.constants import APIScorer
+from judgeval.scorers import JSONCorrectnessScorer
 client = JudgmentClient()
 example = Example(
     input="Create a JSON object with the keys 'field1' (str) and 'field2' (int). Fill them with random values.",
     # Replace this with your LLM system's output
     actual_output="{'field1': 'value1', 'field2': 1}",
 )
-scorer = JudgmentScorer(threshold=0.8, score_type=APIScorer.JSON_CORRECTNESS)  # TODO update this
+scorer = JSONCorrectnessScorer(threshold=0.8)
 results = client.run_evaluation(
     examples=[example],
     scorers=[scorer],

{judgeval-0.0.3 → judgeval-0.0.4}/docs/evaluation/scorers/summarization.mdx RENAMED Viewed

@@ -40,10 +40,9 @@ $$
 ## Sample Implementation
 ```python summarization.py
-from judgeval.judgment_client import JudgmentClient
+from judgeval import JudgmentClient
 from judgeval.data import Example
-from judgeval.scorers import JudgmentScorer
-from judgeval.constants import APIScorer
+from judgeval.scorers import SummarizationScorer
 client = JudgmentClient()
 example = Example(
@@ -52,7 +51,7 @@ example = Example(
     actual_output="...",
 )
 # supply your own threshold
-scorer = JudgmentScorer(threshold=0.8, score_type=APIScorer.SUMMARIZATION)
+scorer = SummarizationScorer(threshold=0.8)
 results = client.run_evaluation(
     examples=[example],

{judgeval-0.0.3 → judgeval-0.0.4}/docs/evaluation/scorers/tool_correctness.mdx RENAMED Viewed

@@ -27,10 +27,9 @@ TODO add more docs here regarding tool ordering, exact match, or even correct to
 ## Sample Implementation
 ```python tool_correctness.py
-from judgeval.judgment_client import JudgmentClient
+from judgeval import JudgmentClient
 from judgeval.data import Example
-from judgeval.scorers import JudgmentScorer
-from judgeval.constants import APIScorer
+from judgeval.scorers import ToolCorrectnessScorer
 client = JudgmentClient()
 example = Example(
@@ -40,7 +39,7 @@ example = Example(
     expected_output=["DBQuery", "GoogleSearch"],
 )
 # supply your own threshold
-scorer = JudgmentScorer(threshold=0.8, score_type=APIScorer.TOOL_CORRECTNESS)
+scorer = ToolCorrectnessScorer(threshold=0.8)
 results = client.run_evaluation(
     examples=[example],

{judgeval-0.0.3 → judgeval-0.0.4}/docs/getting_started.mdx RENAMED Viewed

@@ -19,19 +19,23 @@ access our state-of-the-art judge models, and manage your evaluations/datasets o
 Once you have a key, you can set the environment variable `JUDGMENT_API_KEY` to your key.
 This allows the `JudgmentClient` to authenticate your requests to the Judgment API.
+```
+export JUDGMENT_API_KEY="your_key_here"
+```
 To receive a key, please email us at `contact@judgmentlabs.ai`.
 <Note>
 Running evaluations on Judgment Labs' infrastructure is recommended for
 large-scale evaluations. [Contact us](mailto:contact@judgmentlabs.ai) if you're dealing with
-sensitive data that has to reside in your private VPCs/On-Prem.
+sensitive data that has to reside in your private VPCs.
 </Note>
 # Create your first evaluation
 ```python sample_eval.py
-from judgeval.judgment_client import JudgmentClient
+from judgeval import JudgmentClient
 from judgeval.data import Example
 from judgeval.scorers import FaithfulnessScorer
@@ -58,16 +62,16 @@ Congratulations! Your evaluation should have passed. Let's break down what happe
 - The variable `retrieval_context` represents the retrieved context from your knowledge base and `FaithfulnessScorer(threshold=0.5)`
 is a scorer that checks if the output is hallucinated relative to the retrieved context.
 - Scorers give values betweeen 0 - 1 and we set the threshold for this scorer to 0.5 in the context of a unit test. If you are interested measuring rather than testing, you can ignore this threshold and reference the `score` field alone.
-- We chose `gpt-4o` as our judge model for faithfulness. Judgment Labs offers ANY judge model for your evaluation needs.
+- We chose `gpt-4o` as our judge model for faithfulness. Judgment Labs offers ANY judge model for your evaluation needs. Consider trying out our state-of-the-art judge models for your next evaluation!
 # Create Your First Scorer
-`judgeval` offers three kinds of LLM scorers for your evaluation needs: ready-made, prompt scorers, and custom scorers.
+`judgeval` offers three kinds of LLM scorers for your evaluation needs: ready-made, classifier scorers, and custom scorers.
 ## Ready-made Scorers
 Judgment Labs provides default implementations of 10+ research-backed metrics covering evaluation needs ranging from hallucination detection to RAG retrieval quality. To create a ready-made scorer, just import it directly from `judgeval.scorers`:
 ```python scorer_example.py
-from judgeval.judgment_client import JudgmentClient
+from judgeval import JudgmentClient
 from judgeval.data import Example
 from judgeval.scorers import FaithfulnessScorer
@@ -91,15 +95,29 @@ print(results)
 For a complete list of ready-made scorers, see the [scorers docs](/evaluation/scorers).
 </Note>
-## Prompt Scorers
+## Classifier Scorers
 `judgeval` allows you to create custom scorers using natural language. These can range from simple judges to powerful evaluators for your LLM systems.
+```python classifier_scorer.py
+from judgeval.scorers import ClassifierScorer
+classifier_scorer = ClassifierScorer(
+    name="Tone Scorer",
+    threshold=0.9,
+    conversation=[
+        {
+            "role": "system",
+            "content": "Is the response positive (Y/N)? The response is: {{actual_output}}."
+        }
+    ],
+    options={"Y": 1, "N": 0}
+)
 ```
-TODO
-```
+To learn more about `ClassifierScorer`s, click [here](/evaluation/scorers/classifier_scorer).
 ## Custom Scorers
-If you find that none of the ready-made scorers or prompt scorers fit your needs, you can easily create your own custom scorer.
+If you find that none of the ready-made scorers or classifier scorers fit your needs, you can easily create your own custom scorer.
 These can be as simple or complex as you need them to be and **_do not_** have to use an LLM judge model.
 Here's an example of computing BLEU scores:
@@ -148,7 +166,7 @@ If you're interested in measuring multiple metrics at once, you can group scorer
 regardless of the type of scorer.
 ```python multiple_scorers.py
-from judgeval.judgment_client import JudgmentClient
+from judgeval import JudgmentClient
 from judgeval.scorers import FaithfulnessScorer, SummarizationScorer
 client = JudgmentClient()
@@ -221,41 +239,6 @@ Work in progress!
 Work in progress!
-## Creating ClassifierScorers
-ClassifierScorers are **powerful** evaluators that can be created in minutes via Judgment's platform or SDK
-using **natural language criteria**.
-<Tip>
-For more information on what a ClassifierScorer is, click [here](/evaluation/scorers/classifier_scorer).
-</Tip>
-**Here's how to create a ClassifierScorer:**
-1. Navigate to the `Scorers` tab in the Judgment platform. You'll find this on via the sidebar on the left.
-2. Click the `Create Scorer` button in the top right corner.
-![Alt text](/images/create_scorer.png "Optional title")
-3. Here, you can create a custom scorer by using a criteria in natural language, supplying custom arguments from the [`Example`](evaluation/data_examples) class.
-Then, you supply a set of **choices** the scorer can select from when evaluating an example. Finally, you can test your scorer on samples in our playground.
-4. Once you're finished, you can save the scorer and use it in your evaluation runs just like any other scorer in `judgeval`.
-### Example
-Here's an example of building a `ClassifierScorer` that checks if the LLM's tone is too aggressive.
-This might be useful when building a customer support chatbot.
-![Alt text](images/create_aggressive_scorer.png "Optional title")
-<Tip>
-A great use of ClassifierScorers is to prototype an evaluation criteria on a small set of examples before
-using it to benchmark your workflow.
-To learn more about `ClassifierScorer`s, click [here](/evaluation/scorers/classifier_scorer).
-</Tip>
 ## Optimizing Your LLM System
 Evaluation is a **prerequisite** for optimizing your LLM systems. Measuring the quality of your LLM workflows
@@ -284,7 +267,9 @@ Beyond experimenting and measuring historical performance, `judgeval` supports m
 Using our `tracing` module, you can **track your LLM system outputs from end to end**, allowing you to visualize the flow of your LLM system.
 Additionally, you can **enable evaluations to run in real-time** using Judgment's state-of-the-art judge models.
-TODO add picture of tracing, or an embedded gif
+<div style={{display: 'flex', justifyContent: 'center'}}>
+  ![Alt text](/images/trace_screenshot.png "Image of a RAG pipeline trace")
+</div>
 There are many benefits of monitoring your LLM systems in production with `judgeval`, including:
 - Detecting hallucinations and other quality issues **before they reach your customers**

judgeval-0.0.4/docs/images/trace_screenshot.png ADDED Viewed

Binary file

judgeval-0.0.4/docs/judgment/introduction.mdx ADDED Viewed

@@ -0,0 +1,7 @@
+---
+title: Introduction
+---
+The Judgment platform is a tool for viewing and analyzing evaluations in **development and production**.

{judgeval-0.0.3 → judgeval-0.0.4}/docs/mint.json RENAMED Viewed

@@ -25,10 +25,6 @@
       "url": "https://github.com/judgmentlabs"
     }
   ],
-  "topbarCtaButton": {
-    "name": "Dashboard",
-    "url": "https://dashboard.mintlify.com"
-  },
   "tabs": [
     {
       "name": "Tutorials",
@@ -60,6 +56,7 @@
           "group": "Scorers",
           "pages": [
             "evaluation/scorers/introduction",
+            "evaluation/scorers/answer_correctness",
             "evaluation/scorers/answer_relevancy",
             "evaluation/scorers/contextual_precision",
             "evaluation/scorers/contextual_recall",
@@ -76,6 +73,14 @@
         "evaluation/judges"
       ]
     },
+    {
+      "group": "Monitoring",
+      "pages": [
+        "monitoring/introduction",
+        "monitoring/tracing",
+        "monitoring/production_insights"
+      ]
+    },
     {
       "group": "Judgment Platform",
       "pages": [

judgeval-0.0.4/docs/monitoring/tracing.mdx ADDED Viewed

File without changes

{judgeval-0.0.3 → judgeval-0.0.4}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "judgeval"
-version = "0.0.3"
+version = "0.0.4"
 authors = [
     { name="Andrew Li", email="andrew@judgmentlabs.ai" },
     { name="Alex Shan", email="alex@judgmentlabs.ai" },

judgeval 0.0.3__tar.gz → 0.0.4__tar.gz

judgeval 0.0.3tar.gz → 0.0.4tar.gz