judgeval 0.0.22__tar.gz → 0.0.24__tar.gz
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- judgeval-0.0.24/PKG-INFO +156 -0
- judgeval-0.0.24/README.md +119 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/data_datasets.mdx +7 -24
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/data_examples.mdx +7 -53
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/scorers/custom_scorers.mdx +3 -3
- judgeval-0.0.24/docs/evaluation/scorers/groundedness.mdx +65 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/scorers/introduction.mdx +10 -23
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/getting_started.mdx +4 -8
- judgeval-0.0.24/docs/integration/langgraph.mdx +53 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/judgment/introduction.mdx +4 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/monitoring/tracing.mdx +1 -1
- {judgeval-0.0.22 → judgeval-0.0.24}/pyproject.toml +1 -1
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/common/tracer.py +48 -252
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/data/__init__.py +1 -2
- judgeval-0.0.24/src/judgeval/integrations/langgraph.py +316 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorer.py +2 -2
- judgeval-0.0.22/PKG-INFO +0 -40
- judgeval-0.0.22/README.md +0 -3
- judgeval-0.0.22/docs/integration/langgraph.mdx +0 -28
- judgeval-0.0.22/src/demo/cookbooks/JNPR_Mist/test.py +0 -21
- judgeval-0.0.22/src/demo/cookbooks/linkd/text2sql.py +0 -14
- judgeval-0.0.22/src/demo/custom_example_demo/qodo_example.py +0 -39
- judgeval-0.0.22/src/demo/custom_example_demo/test.py +0 -16
- judgeval-0.0.22/src/judgeval/data/custom_example.py +0 -98
- judgeval-0.0.22/src/judgeval/data/datasets/utils.py +0 -0
- judgeval-0.0.22/src/judgeval/data/ground_truth.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/.github/workflows/ci.yaml +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/.gitignore +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/LICENSE.md +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/Pipfile +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/Pipfile.lock +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/README.md +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/api_reference/judgment_client.mdx +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/api_reference/trace.mdx +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/development.mdx +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/essentials/code.mdx +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/essentials/images.mdx +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/essentials/markdown.mdx +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/essentials/navigation.mdx +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/essentials/reusable-snippets.mdx +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/essentials/settings.mdx +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/introduction.mdx +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/judges.mdx +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/scorers/answer_correctness.mdx +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/scorers/answer_relevancy.mdx +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/scorers/classifier_scorer.mdx +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/scorers/comparison.mdx +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/scorers/contextual_precision.mdx +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/scorers/contextual_recall.mdx +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/scorers/contextual_relevancy.mdx +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/scorers/execution_order.mdx +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/scorers/faithfulness.mdx +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/scorers/hallucination.mdx +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/scorers/json_correctness.mdx +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/scorers/summarization.mdx +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/unit_testing.mdx +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/favicon.svg +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/images/basic_trace_example.png +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/images/checks-passed.png +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/images/create_aggressive_scorer.png +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/images/create_scorer.png +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/images/evaluation_diagram.png +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/images/hero-dark.svg +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/images/hero-light.svg +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/images/online_eval_fault.png +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/images/trace_ss.png +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/introduction.mdx +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/logo/dark.svg +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/logo/light.svg +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/mint.json +1 -1
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/monitoring/introduction.mdx +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/monitoring/production_insights.mdx +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/notebooks/create_dataset.ipynb +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/notebooks/create_scorer.ipynb +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/notebooks/demo.ipynb +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/notebooks/prompt_scorer.ipynb +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/notebooks/quickstart.ipynb +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/quickstart.mdx +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/docs/snippets/snippet-intro.mdx +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/pytest.ini +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/__init__.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/clients.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/common/__init__.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/common/exceptions.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/common/logger.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/common/utils.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/constants.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/data/api_example.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/data/datasets/__init__.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/data/datasets/dataset.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/data/datasets/eval_dataset_client.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/data/example.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/data/result.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/data/scorer_data.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/evaluation_run.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/judges/__init__.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/judges/base_judge.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/judges/litellm_judge.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/judges/mixture_of_judges.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/judges/together_judge.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/judges/utils.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/judgment_client.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/rules.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/run_evaluation.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/__init__.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/api_scorer.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/base_scorer.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/exceptions.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/__init__.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/api_scorers/__init__.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/api_scorers/answer_correctness.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/api_scorers/answer_relevancy.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/api_scorers/comparison.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/api_scorers/contextual_precision.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/api_scorers/contextual_recall.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/api_scorers/contextual_relevancy.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/api_scorers/execution_order.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/api_scorers/faithfulness.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/api_scorers/groundedness.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/api_scorers/hallucination.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/api_scorers/instruction_adherence.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/api_scorers/json_correctness.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/api_scorers/summarization.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/classifiers/__init__.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/classifiers/text2sql/__init__.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/classifiers/text2sql/text2sql_scorer.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/__init__.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/answer_correctness/__init__.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/answer_correctness/answer_correctness_scorer.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/answer_correctness/prompts.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/answer_relevancy/__init__.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/answer_relevancy/answer_relevancy_scorer.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/answer_relevancy/prompts.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/comparison/__init__.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/comparison/comparison_scorer.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/comparison/prompts.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/contextual_precision/__init__.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/contextual_precision/contextual_precision_scorer.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/contextual_precision/prompts.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/contextual_recall/__init__.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/contextual_recall/contextual_recall_scorer.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/contextual_recall/prompts.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/contextual_relevancy/__init__.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/contextual_relevancy/contextual_relevancy_scorer.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/contextual_relevancy/prompts.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/execution_order/__init__.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/execution_order/execution_order.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/faithfulness/__init__.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/faithfulness/faithfulness_scorer.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/faithfulness/prompts.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/hallucination/__init__.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/hallucination/hallucination_scorer.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/hallucination/prompts.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/instruction_adherence/instruction_adherence.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/instruction_adherence/prompt.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/json_correctness/__init__.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/json_correctness/json_correctness_scorer.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/summarization/__init__.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/summarization/prompts.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/summarization/summarization_scorer.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/prompt_scorer.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/score.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/utils.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/tracer/__init__.py +0 -0
- {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/utils/alerts.py +0 -0
judgeval-0.0.24/PKG-INFO
ADDED
@@ -0,0 +1,156 @@
|
|
1
|
+
Metadata-Version: 2.4
|
2
|
+
Name: judgeval
|
3
|
+
Version: 0.0.24
|
4
|
+
Summary: Judgeval Package
|
5
|
+
Project-URL: Homepage, https://github.com/JudgmentLabs/judgeval
|
6
|
+
Project-URL: Issues, https://github.com/JudgmentLabs/judgeval/issues
|
7
|
+
Author-email: Andrew Li <andrew@judgmentlabs.ai>, Alex Shan <alex@judgmentlabs.ai>, Joseph Camyre <joseph@judgmentlabs.ai>
|
8
|
+
License-Expression: Apache-2.0
|
9
|
+
License-File: LICENSE.md
|
10
|
+
Classifier: Operating System :: OS Independent
|
11
|
+
Classifier: Programming Language :: Python :: 3
|
12
|
+
Requires-Python: >=3.11
|
13
|
+
Requires-Dist: anthropic
|
14
|
+
Requires-Dist: fastapi
|
15
|
+
Requires-Dist: langchain
|
16
|
+
Requires-Dist: langchain-anthropic
|
17
|
+
Requires-Dist: langchain-core
|
18
|
+
Requires-Dist: langchain-huggingface
|
19
|
+
Requires-Dist: langchain-openai
|
20
|
+
Requires-Dist: litellm
|
21
|
+
Requires-Dist: nest-asyncio
|
22
|
+
Requires-Dist: openai
|
23
|
+
Requires-Dist: openpyxl
|
24
|
+
Requires-Dist: pandas
|
25
|
+
Requires-Dist: pika
|
26
|
+
Requires-Dist: python-dotenv==1.0.1
|
27
|
+
Requires-Dist: requests
|
28
|
+
Requires-Dist: supabase
|
29
|
+
Requires-Dist: together
|
30
|
+
Requires-Dist: uvicorn
|
31
|
+
Provides-Extra: dev
|
32
|
+
Requires-Dist: pytest-asyncio>=0.25.0; extra == 'dev'
|
33
|
+
Requires-Dist: pytest-mock>=3.14.0; extra == 'dev'
|
34
|
+
Requires-Dist: pytest>=8.3.4; extra == 'dev'
|
35
|
+
Requires-Dist: tavily-python; extra == 'dev'
|
36
|
+
Description-Content-Type: text/markdown
|
37
|
+
|
38
|
+
# Judgeval SDK
|
39
|
+
|
40
|
+
Judgeval is an open-source framework for building evaluation pipelines for multi-step agent workflows, supporting both real-time and experimental evaluation setups. To learn more about Judgment or sign up for free, visit our [website](https://www.judgmentlabs.ai/) or check out our [developer docs](https://judgment.mintlify.app/getting_started).
|
41
|
+
|
42
|
+
## Features
|
43
|
+
|
44
|
+
- **Development and Production Evaluation Layer**: Offers a robust evaluation layer for multi-step agent applications, including unit-testing and performance monitoring.
|
45
|
+
- **Plug-and-Evaluate**: Integrate LLM systems with 10+ research-backed metrics, including:
|
46
|
+
- Hallucination detection
|
47
|
+
- RAG retriever quality
|
48
|
+
- And more
|
49
|
+
- **Custom Evaluation Pipelines**: Construct powerful custom evaluation pipelines tailored for your LLM systems.
|
50
|
+
- **Monitoring in Production**: Utilize state-of-the-art real-time evaluation foundation models to monitor LLM systems effectively.
|
51
|
+
|
52
|
+
## Installation
|
53
|
+
|
54
|
+
```bash
|
55
|
+
pip install judgeval
|
56
|
+
```
|
57
|
+
|
58
|
+
## Quickstart: Evaluations
|
59
|
+
|
60
|
+
You can evaluate your workflow execution data to measure quality metrics such as hallucination.
|
61
|
+
|
62
|
+
Create a file named `evaluate.py` with the following code:
|
63
|
+
|
64
|
+
```python
|
65
|
+
from judgeval import JudgmentClient
|
66
|
+
from judgeval.data import Example
|
67
|
+
from judgeval.scorers import FaithfulnessScorer
|
68
|
+
|
69
|
+
client = JudgmentClient()
|
70
|
+
|
71
|
+
example = Example(
|
72
|
+
input="What if these shoes don't fit?",
|
73
|
+
actual_output="We offer a 30-day full refund at no extra cost.",
|
74
|
+
retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."],
|
75
|
+
)
|
76
|
+
|
77
|
+
scorer = FaithfulnessScorer(threshold=0.5)
|
78
|
+
results = client.run_evaluation(
|
79
|
+
examples=[example],
|
80
|
+
scorers=[scorer],
|
81
|
+
model="gpt-4o",
|
82
|
+
)
|
83
|
+
print(results)
|
84
|
+
```
|
85
|
+
Click [here](https://judgment.mintlify.app/getting_started#create-your-first-experiment) for a more detailed explanation
|
86
|
+
|
87
|
+
## Quickstart: Traces
|
88
|
+
|
89
|
+
Track your workflow execution for full observability with just a few lines of code.
|
90
|
+
|
91
|
+
Create a file named `traces.py` with the following code:
|
92
|
+
|
93
|
+
```python
|
94
|
+
from judgeval.common.tracer import Tracer, wrap
|
95
|
+
from openai import OpenAI
|
96
|
+
|
97
|
+
client = wrap(OpenAI())
|
98
|
+
judgment = Tracer(project_name="my_project")
|
99
|
+
|
100
|
+
@judgment.observe(span_type="tool")
|
101
|
+
def my_tool():
|
102
|
+
return "Hello world!"
|
103
|
+
|
104
|
+
@judgment.observe(span_type="function")
|
105
|
+
def main():
|
106
|
+
task_input = my_tool()
|
107
|
+
res = client.chat.completions.create(
|
108
|
+
model="gpt-4o",
|
109
|
+
messages=[{"role": "user", "content": f"{task_input}"}]
|
110
|
+
)
|
111
|
+
return res.choices[0].message.content
|
112
|
+
```
|
113
|
+
Click [here](https://judgment.mintlify.app/getting_started#create-your-first-trace) for a more detailed explanation
|
114
|
+
|
115
|
+
## Quickstart: Online Evaluations
|
116
|
+
|
117
|
+
Apply performance monitoring to measure the quality of your systems in production, not just on historical data.
|
118
|
+
|
119
|
+
Using the same traces.py file we created earlier:
|
120
|
+
|
121
|
+
```python
|
122
|
+
from judgeval.common.tracer import Tracer, wrap
|
123
|
+
from judgeval.scorers import AnswerRelevancyScorer
|
124
|
+
from openai import OpenAI
|
125
|
+
|
126
|
+
client = wrap(OpenAI())
|
127
|
+
judgment = Tracer(project_name="my_project")
|
128
|
+
|
129
|
+
@judgment.observe(span_type="tool")
|
130
|
+
def my_tool():
|
131
|
+
return "Hello world!"
|
132
|
+
|
133
|
+
@judgment.observe(span_type="function")
|
134
|
+
def main():
|
135
|
+
task_input = my_tool()
|
136
|
+
res = client.chat.completions.create(
|
137
|
+
model="gpt-4o",
|
138
|
+
messages=[{"role": "user", "content": f"{task_input}"}]
|
139
|
+
).choices[0].message.content
|
140
|
+
|
141
|
+
judgment.get_current_trace().async_evaluate(
|
142
|
+
scorers=[AnswerRelevancyScorer(threshold=0.5)],
|
143
|
+
input=task_input,
|
144
|
+
actual_output=res,
|
145
|
+
model="gpt-4o"
|
146
|
+
)
|
147
|
+
|
148
|
+
return res
|
149
|
+
```
|
150
|
+
Click [here](https://judgment.mintlify.app/getting_started#create-your-first-online-evaluation) for a more detailed explanation
|
151
|
+
|
152
|
+
## Documentation and Demos
|
153
|
+
|
154
|
+
For more detailed documentation, please check out our [docs](https://judgment.mintlify.app/getting_started) and some of our [demo videos](https://www.youtube.com/@AlexShan-j3o) for reference!
|
155
|
+
|
156
|
+
##
|
@@ -0,0 +1,119 @@
|
|
1
|
+
# Judgeval SDK
|
2
|
+
|
3
|
+
Judgeval is an open-source framework for building evaluation pipelines for multi-step agent workflows, supporting both real-time and experimental evaluation setups. To learn more about Judgment or sign up for free, visit our [website](https://www.judgmentlabs.ai/) or check out our [developer docs](https://judgment.mintlify.app/getting_started).
|
4
|
+
|
5
|
+
## Features
|
6
|
+
|
7
|
+
- **Development and Production Evaluation Layer**: Offers a robust evaluation layer for multi-step agent applications, including unit-testing and performance monitoring.
|
8
|
+
- **Plug-and-Evaluate**: Integrate LLM systems with 10+ research-backed metrics, including:
|
9
|
+
- Hallucination detection
|
10
|
+
- RAG retriever quality
|
11
|
+
- And more
|
12
|
+
- **Custom Evaluation Pipelines**: Construct powerful custom evaluation pipelines tailored for your LLM systems.
|
13
|
+
- **Monitoring in Production**: Utilize state-of-the-art real-time evaluation foundation models to monitor LLM systems effectively.
|
14
|
+
|
15
|
+
## Installation
|
16
|
+
|
17
|
+
```bash
|
18
|
+
pip install judgeval
|
19
|
+
```
|
20
|
+
|
21
|
+
## Quickstart: Evaluations
|
22
|
+
|
23
|
+
You can evaluate your workflow execution data to measure quality metrics such as hallucination.
|
24
|
+
|
25
|
+
Create a file named `evaluate.py` with the following code:
|
26
|
+
|
27
|
+
```python
|
28
|
+
from judgeval import JudgmentClient
|
29
|
+
from judgeval.data import Example
|
30
|
+
from judgeval.scorers import FaithfulnessScorer
|
31
|
+
|
32
|
+
client = JudgmentClient()
|
33
|
+
|
34
|
+
example = Example(
|
35
|
+
input="What if these shoes don't fit?",
|
36
|
+
actual_output="We offer a 30-day full refund at no extra cost.",
|
37
|
+
retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."],
|
38
|
+
)
|
39
|
+
|
40
|
+
scorer = FaithfulnessScorer(threshold=0.5)
|
41
|
+
results = client.run_evaluation(
|
42
|
+
examples=[example],
|
43
|
+
scorers=[scorer],
|
44
|
+
model="gpt-4o",
|
45
|
+
)
|
46
|
+
print(results)
|
47
|
+
```
|
48
|
+
Click [here](https://judgment.mintlify.app/getting_started#create-your-first-experiment) for a more detailed explanation
|
49
|
+
|
50
|
+
## Quickstart: Traces
|
51
|
+
|
52
|
+
Track your workflow execution for full observability with just a few lines of code.
|
53
|
+
|
54
|
+
Create a file named `traces.py` with the following code:
|
55
|
+
|
56
|
+
```python
|
57
|
+
from judgeval.common.tracer import Tracer, wrap
|
58
|
+
from openai import OpenAI
|
59
|
+
|
60
|
+
client = wrap(OpenAI())
|
61
|
+
judgment = Tracer(project_name="my_project")
|
62
|
+
|
63
|
+
@judgment.observe(span_type="tool")
|
64
|
+
def my_tool():
|
65
|
+
return "Hello world!"
|
66
|
+
|
67
|
+
@judgment.observe(span_type="function")
|
68
|
+
def main():
|
69
|
+
task_input = my_tool()
|
70
|
+
res = client.chat.completions.create(
|
71
|
+
model="gpt-4o",
|
72
|
+
messages=[{"role": "user", "content": f"{task_input}"}]
|
73
|
+
)
|
74
|
+
return res.choices[0].message.content
|
75
|
+
```
|
76
|
+
Click [here](https://judgment.mintlify.app/getting_started#create-your-first-trace) for a more detailed explanation
|
77
|
+
|
78
|
+
## Quickstart: Online Evaluations
|
79
|
+
|
80
|
+
Apply performance monitoring to measure the quality of your systems in production, not just on historical data.
|
81
|
+
|
82
|
+
Using the same traces.py file we created earlier:
|
83
|
+
|
84
|
+
```python
|
85
|
+
from judgeval.common.tracer import Tracer, wrap
|
86
|
+
from judgeval.scorers import AnswerRelevancyScorer
|
87
|
+
from openai import OpenAI
|
88
|
+
|
89
|
+
client = wrap(OpenAI())
|
90
|
+
judgment = Tracer(project_name="my_project")
|
91
|
+
|
92
|
+
@judgment.observe(span_type="tool")
|
93
|
+
def my_tool():
|
94
|
+
return "Hello world!"
|
95
|
+
|
96
|
+
@judgment.observe(span_type="function")
|
97
|
+
def main():
|
98
|
+
task_input = my_tool()
|
99
|
+
res = client.chat.completions.create(
|
100
|
+
model="gpt-4o",
|
101
|
+
messages=[{"role": "user", "content": f"{task_input}"}]
|
102
|
+
).choices[0].message.content
|
103
|
+
|
104
|
+
judgment.get_current_trace().async_evaluate(
|
105
|
+
scorers=[AnswerRelevancyScorer(threshold=0.5)],
|
106
|
+
input=task_input,
|
107
|
+
actual_output=res,
|
108
|
+
model="gpt-4o"
|
109
|
+
)
|
110
|
+
|
111
|
+
return res
|
112
|
+
```
|
113
|
+
Click [here](https://judgment.mintlify.app/getting_started#create-your-first-online-evaluation) for a more detailed explanation
|
114
|
+
|
115
|
+
## Documentation and Demos
|
116
|
+
|
117
|
+
For more detailed documentation, please check out our [docs](https://judgment.mintlify.app/getting_started) and some of our [demo videos](https://www.youtube.com/@AlexShan-j3o) for reference!
|
118
|
+
|
119
|
+
##
|
@@ -3,19 +3,14 @@ title: Datasets
|
|
3
3
|
---
|
4
4
|
## Overview
|
5
5
|
In most scenarios, you will have multiple `Example`s that you want to evaluate together.
|
6
|
-
In `judgeval`, an evaluation dataset (`EvalDataset`) is a collection of `Example`s
|
6
|
+
In `judgeval`, an evaluation dataset (`EvalDataset`) is a collection of `Example`s that you can scale evaluations across.
|
7
7
|
|
8
|
-
<Note>
|
9
|
-
A `GroundTruthExample` is a specific type of `Example` that do not require the `actual_output` field.
|
10
|
-
|
11
|
-
This is useful for creating datasets that can be **dynamically updated at evaluation time** by running your workflow on the GroundTruthExamples to create Examples.
|
12
|
-
</Note>
|
13
8
|
## Creating a Dataset
|
14
9
|
|
15
|
-
Creating an `EvalDataset` is as simple as supplying a list of `Example`s
|
10
|
+
Creating an `EvalDataset` is as simple as supplying a list of `Example`s.
|
16
11
|
|
17
12
|
```python create_dataset.py
|
18
|
-
from judgeval.data import Example
|
13
|
+
from judgeval.data import Example
|
19
14
|
from judgeval.data.datasets import EvalDataset
|
20
15
|
|
21
16
|
examples = [
|
@@ -23,25 +18,19 @@ examples = [
|
|
23
18
|
Example(input="...", actual_output="..."),
|
24
19
|
...
|
25
20
|
]
|
26
|
-
|
27
|
-
GroundTruthExample(input="..."),
|
28
|
-
GroundTruthExample(input="..."),
|
29
|
-
...
|
30
|
-
]
|
21
|
+
|
31
22
|
|
32
23
|
dataset = EvalDataset(
|
33
|
-
examples=examples
|
34
|
-
ground_truth_examples=ground_truth_examples
|
24
|
+
examples=examples
|
35
25
|
)
|
36
26
|
```
|
37
27
|
|
38
|
-
You can also add `Example`s
|
28
|
+
You can also add `Example`s to an existing `EvalDataset` using the `add_example` method.
|
39
29
|
|
40
30
|
```python add_to_dataset.py
|
41
31
|
...
|
42
32
|
|
43
33
|
dataset.add_example(Example(...))
|
44
|
-
dataset.add_ground_truth(GroundTruthExample(...))
|
45
34
|
```
|
46
35
|
|
47
36
|
## Saving/Loading Datasets
|
@@ -81,12 +70,6 @@ You can save/load an `EvalDataset` with a JSON file. Your JSON file should have
|
|
81
70
|
"actual_output": "..."
|
82
71
|
},
|
83
72
|
...
|
84
|
-
],
|
85
|
-
"ground_truths": [
|
86
|
-
{
|
87
|
-
"input": "..."
|
88
|
-
},
|
89
|
-
...
|
90
73
|
]
|
91
74
|
}
|
92
75
|
```
|
@@ -154,7 +137,7 @@ examples:
|
|
154
137
|
|
155
138
|
## Evaluate On Your Dataset
|
156
139
|
|
157
|
-
You can use the `JudgmentClient` to evaluate the `Example`s
|
140
|
+
You can use the `JudgmentClient` to evaluate the `Example`s in your dataset using scorers.
|
158
141
|
|
159
142
|
```python evaluate_dataset.py
|
160
143
|
...
|
@@ -4,14 +4,12 @@ title: Examples
|
|
4
4
|
|
5
5
|
## Overview
|
6
6
|
An `Example` is a basic unit of data in `judgeval` that allows you to run evaluation scorers on your LLM system.
|
7
|
-
An `Example` is composed of
|
8
|
-
- `input`
|
9
|
-
- `actual_output`
|
10
|
-
- [Optional]
|
11
|
-
- [Optional]
|
12
|
-
- [Optional]
|
13
|
-
- [Optional] `tools_called`
|
14
|
-
- [Optional] `expected_tools`
|
7
|
+
An `Example` is can be composed of a mixture of the following fields:
|
8
|
+
- `input` [Optional]
|
9
|
+
- `actual_output` [Optional]
|
10
|
+
- `expected_output` [Optional]
|
11
|
+
- `retrieval_context` [Optional]
|
12
|
+
- `context` [Optional]
|
15
13
|
|
16
14
|
**Here's a sample of creating an `Example`:**
|
17
15
|
|
@@ -24,8 +22,6 @@ example = Example(
|
|
24
22
|
expected_output="Bill Gates and Paul Allen founded Microsoft in New Mexico in 1975.",
|
25
23
|
retrieval_context=["Bill Gates co-founded Microsoft with Paul Allen in 1975."],
|
26
24
|
context=["Bill Gates and Paul Allen are the founders of Microsoft."],
|
27
|
-
tools_called=["Google Search"],
|
28
|
-
expected_tools=["Google Search", "Perplexity"],
|
29
25
|
)
|
30
26
|
```
|
31
27
|
|
@@ -39,7 +35,7 @@ Other fields are optional and depend on the type of evaluation. If you want to d
|
|
39
35
|
|
40
36
|
## Example Fields
|
41
37
|
|
42
|
-
Here, we cover the
|
38
|
+
Here, we cover the possible fields that make up an `Example`.
|
43
39
|
|
44
40
|
### Input
|
45
41
|
The `input` field represents a sample interaction between a user and your LLM system. The input should represent the direct input to your prompt template(s), and **SHOULD NOT CONTAIN** your prompt template itself.
|
@@ -137,48 +133,6 @@ example = Example(
|
|
137
133
|
)
|
138
134
|
```
|
139
135
|
|
140
|
-
<Note>
|
141
|
-
`context` is the ideal retrieval result for a specific `input`, whereas `retrieval_context` is the actual retrieval result at runtime. While they are similar, they are not always interchangeable.
|
142
|
-
</Note>
|
143
|
-
### Tools Called
|
144
|
-
|
145
|
-
The `tools_called` field is `Optional[List[str]]` and represents the tools that were called by the LLM system. This is particularly useful for evaluating whether agents are properly using tools available to them.
|
146
|
-
|
147
|
-
```python tools_called.py
|
148
|
-
# Sample app implementation
|
149
|
-
import medical_chatbot
|
150
|
-
|
151
|
-
question = "Is sparkling water healthy?"
|
152
|
-
example = Example(
|
153
|
-
input=question,
|
154
|
-
actual_output=medical_chatbot.chat(question),
|
155
|
-
expected_output="Sparkling water is neither healthy nor unhealthy.",
|
156
|
-
context=["Sparkling water is a type of water that is carbonated."],
|
157
|
-
retrieval_context=["Sparkling water is carbonated and has no calories."],
|
158
|
-
tools_called=["Perplexity", "GoogleSearch"]
|
159
|
-
)
|
160
|
-
```
|
161
|
-
|
162
|
-
### Expected Tools
|
163
|
-
|
164
|
-
The `expected_tools` field is `Optional[List[str]]` and represents the tools that are expected to be called by the LLM system. This is particularly useful for evaluating whether agents are properly using tools available to them.
|
165
|
-
|
166
|
-
```python expected_tools.py
|
167
|
-
# Sample app implementation
|
168
|
-
import medical_chatbot
|
169
|
-
|
170
|
-
question = "Is sparkling water healthy?"
|
171
|
-
example = Example(
|
172
|
-
input=question,
|
173
|
-
actual_output=medical_chatbot.chat(question),
|
174
|
-
expected_output="Sparkling water is neither healthy nor unhealthy.",
|
175
|
-
context=["Sparkling water is a type of water that is carbonated."],
|
176
|
-
retrieval_context=["Sparkling water is carbonated and has no calories."],
|
177
|
-
tools_called=["Perplexity", "GoogleSearch"],
|
178
|
-
expected_tools=["Perplexity", "DBQuery"]
|
179
|
-
)
|
180
|
-
```
|
181
|
-
|
182
136
|
## Conclusion
|
183
137
|
|
184
138
|
Congratulations! 🎉
|
@@ -116,9 +116,9 @@ class SampleScorer(JudgevalScorer):
|
|
116
116
|
```
|
117
117
|
|
118
118
|
|
119
|
-
### 4. Implement the `
|
119
|
+
### 4. Implement the `_success_check()` method
|
120
120
|
|
121
|
-
When executing an evaluation run, `judgeval` will check if your scorer has passed the `
|
121
|
+
When executing an evaluation run, `judgeval` will check if your scorer has passed the `_success_check()` method.
|
122
122
|
|
123
123
|
You can implement this method in any way you want, but **it should return a `bool`.** Here's a perfectly valid implementation:
|
124
124
|
|
@@ -126,7 +126,7 @@ You can implement this method in any way you want, but **it should return a `boo
|
|
126
126
|
class SampleScorer(JudgevalScorer):
|
127
127
|
...
|
128
128
|
|
129
|
-
def
|
129
|
+
def _success_check(self):
|
130
130
|
if self.error is not None:
|
131
131
|
return False
|
132
132
|
return self.score >= self.threshold # or you can do self.success if set
|
@@ -0,0 +1,65 @@
|
|
1
|
+
---
|
2
|
+
title: Groundedness
|
3
|
+
description: ""
|
4
|
+
---
|
5
|
+
|
6
|
+
The `Groundedness` scorer is a default LLM judge scorer that measures whether the `actual_output` is aligned with both the task instructions in `input` and the knowledge base in `retrieval_context`.
|
7
|
+
In practice, this scorer helps determine if your RAG pipeline's generator is producing hallucinations or misinterpreting task instructions.
|
8
|
+
|
9
|
+
**For optimal Groundedness scoring, check out our leading evaluation foundation model research here! TODO add link here.**
|
10
|
+
|
11
|
+
<Note>
|
12
|
+
The `Groundedness` scorer is a binary metric (1 or 0) that evaluates both instruction adherence and factual accuracy.
|
13
|
+
|
14
|
+
Unlike the `Faithfulness` scorer which measures the degree of contradiction with retrieval context, `Groundedness` provides a pass/fail assessment based on both the task instructions and knowledge base.
|
15
|
+
</Note>
|
16
|
+
|
17
|
+
## Required Fields
|
18
|
+
|
19
|
+
To run the `Groundedness` scorer, you must include the following fields in your `Example`:
|
20
|
+
- `input`
|
21
|
+
- `actual_output`
|
22
|
+
- `retrieval_context`
|
23
|
+
|
24
|
+
## Scorer Breakdown
|
25
|
+
|
26
|
+
`Groundedness` scores are binary (1 or 0) and determined by checking:
|
27
|
+
1. Whether the `actual_output` correctly interprets the task instructions in `input`
|
28
|
+
2. Whether the `actual_output` contains any contradictions with the knowledge base in `retrieval_context`
|
29
|
+
|
30
|
+
A response is considered grounded (score = 1) only if it:
|
31
|
+
- Correctly follows the task instructions
|
32
|
+
- Does not contradict any information in the knowledge base
|
33
|
+
- Does not introduce hallucinated facts not supported by the retrieval context
|
34
|
+
|
35
|
+
If there are any contradictions or misinterpretations, the scorer will fail (score = 0).
|
36
|
+
|
37
|
+
## Sample Implementation
|
38
|
+
|
39
|
+
```python groundedness.py
|
40
|
+
from judgeval import JudgmentClient
|
41
|
+
from judgeval.data import Example
|
42
|
+
from judgeval.scorers import GroundednessScorer
|
43
|
+
|
44
|
+
client = JudgmentClient()
|
45
|
+
example = Example(
|
46
|
+
input="You are a helpful assistant for a clothing store. Make sure to follow the company's policies surrounding returns.",
|
47
|
+
actual_output="We offer a 30-day return policy for all items, including socks!",
|
48
|
+
retrieval_context=["Return policy, all items: 30-day limit for full refund, no questions asked."]
|
49
|
+
)
|
50
|
+
scorer = GroundednessScorer()
|
51
|
+
|
52
|
+
results = client.run_evaluation(
|
53
|
+
examples=[example],
|
54
|
+
scorers=[scorer],
|
55
|
+
model="gpt-4o",
|
56
|
+
)
|
57
|
+
print(results)
|
58
|
+
```
|
59
|
+
|
60
|
+
<Note>
|
61
|
+
The `Groundedness` scorer uses an LLM judge, so you'll receive a reason for the score in the `reason` field of the results.
|
62
|
+
This allows you to double-check the accuracy of the evaluation and understand how the score was calculated.
|
63
|
+
</Note>
|
64
|
+
|
65
|
+
|
@@ -12,11 +12,14 @@ Scorers act as measurement tools for evaluating LLM systems based on specific cr
|
|
12
12
|
- [Contextual Precision](/evaluation/scorers/contextual_precision)
|
13
13
|
- [Contextual Recall](/evaluation/scorers/contextual_recall)
|
14
14
|
- [Contextual Relevancy](/evaluation/scorers/contextual_relevancy)
|
15
|
+
- [Execution Order](/evaluation/scorers/execution_order)
|
15
16
|
- [Faithfulness](/evaluation/scorers/faithfulness)
|
16
17
|
- [Hallucination](/evaluation/scorers/hallucination)
|
17
|
-
- [Summarization](/evaluation/scorers/summarization)
|
18
|
-
- [Execution Order](/evaluation/scorers/execution_order)
|
19
18
|
- [JSON Correctness](/evaluation/scorers/json_correctness)
|
19
|
+
- [Summarization](/evaluation/scorers/summarization)
|
20
|
+
|
21
|
+
We also understand that you may need to evaluate your LLM system with metrics that are not covered by our default scorers.
|
22
|
+
To support this, we provide a flexible framework for creating these scorers:
|
20
23
|
- [Custom Scorers](/evaluation/scorers/custom_scorers)
|
21
24
|
- [Classifier Scorers](/evaluation/scorers/classifier_scorer)
|
22
25
|
|
@@ -24,13 +27,9 @@ Scorers act as measurement tools for evaluating LLM systems based on specific cr
|
|
24
27
|
We're always adding new scorers to `judgeval`. If you have a suggestion, please [let us know](mailto:contact@judgmentlabs.ai)!
|
25
28
|
</Tip>
|
26
29
|
|
27
|
-
Scorers execute on `Example`s
|
30
|
+
Scorers execute on `Example`s and `EvalDataset`s, producing a **numerical score**.
|
28
31
|
This enables you to **use evaluations as unit tests** by setting a `threshold` to determine whether an evaluation was successful or not.
|
29
32
|
|
30
|
-
<Note>
|
31
|
-
Built-in scorers will succeed if the score is greater than or equal to the `threshold`.
|
32
|
-
</Note>
|
33
|
-
|
34
33
|
## Categories of Scorers
|
35
34
|
`judgeval` supports three categories of scorers.
|
36
35
|
- **Default Scorers**: built-in scorers that are ready to use
|
@@ -57,17 +56,17 @@ If you find that none of the default scorers meet your evaluation needs, setting
|
|
57
56
|
You can create a custom scorer by inheritng from the `JudgevalScorer` class and implementing three methods:
|
58
57
|
- `score_example()`: produces a score for a single `Example`.
|
59
58
|
- `a_score_example()`: async version of `score_example()`. You may use the same implementation logic as `score_example()`.
|
60
|
-
- `
|
59
|
+
- `_success_check()`: determines whether an evaluation was successful.
|
61
60
|
|
62
61
|
Custom scorers can be as simple or complex as you want, and **do not need to use LLMs**.
|
63
|
-
For sample implementations, check out the
|
62
|
+
For sample implementations, check out the [Custom Scorers](/evaluation/scorers/custom_scorers) documentation page.
|
64
63
|
|
65
64
|
|
66
65
|
### Classifier Scorers
|
67
66
|
|
68
67
|
Classifier scorers are a special type of custom scorer that can evaluate your LLM system using a natural language criteria.
|
69
68
|
|
70
|
-
|
69
|
+
They either be defined using our judgeval SDK or using the Judgment Platform directly. For more information, check out the [Classifier Scorers](/evaluation/scorers/classifier_scorer) documentation page.
|
71
70
|
|
72
71
|
## Running Scorers
|
73
72
|
|
@@ -80,22 +79,10 @@ client = JudgmentClient()
|
|
80
79
|
results = client.run_evaluation(
|
81
80
|
examples=[example],
|
82
81
|
scorers=[scorer],
|
83
|
-
model="gpt-4o
|
82
|
+
model="gpt-4o",
|
84
83
|
)
|
85
84
|
```
|
86
85
|
|
87
|
-
If you want to execute a `JudgevalScorer` without running it through the `JudgmentClient`, you can score locally.
|
88
|
-
Simply use the `score_example()` or `a_score_example()` method directly:
|
89
|
-
|
90
|
-
```python direct_scoring.py
|
91
|
-
...
|
92
|
-
|
93
|
-
example = Example(input="...", actual_output="...")
|
94
|
-
|
95
|
-
scorer = JudgevalScorer() # Your scorer here
|
96
|
-
score = scorer.score_example(example)
|
97
|
-
```
|
98
|
-
|
99
86
|
<Tip>
|
100
87
|
To learn about how a certain default scorer works, check out its documentation page for a deep dive into how scores are calculated and what fields are required.
|
101
88
|
</Tip>
|
@@ -62,7 +62,7 @@ Congratulations! Your evaluation should have passed. Let's break down what happe
|
|
62
62
|
- The variable `input` mimics a user input and `actual_output` is a placeholder for what your LLM system returns based on the input.
|
63
63
|
- The variable `retrieval_context` represents the retrieved context from your RAG knowledge base.
|
64
64
|
- `FaithfulnessScorer(threshold=0.5)` is a scorer that checks if the output is hallucinated relative to the retrieved context.
|
65
|
-
- <Note>
|
65
|
+
- <Note>The threshold is used in the context of [unit testing](/evaluation/unit_testing).</Note>
|
66
66
|
- We chose `gpt-4o` as our judge model to measure faithfulness. Judgment Labs offers ANY judge model for your evaluation needs.
|
67
67
|
Consider trying out our state-of-the-art Osiris judge models for your next evaluation!
|
68
68
|
|
@@ -142,7 +142,7 @@ def main():
|
|
142
142
|
messages=[{"role": "user", "content": f"{task_input}"}]
|
143
143
|
).choices[0].message.content
|
144
144
|
|
145
|
-
judgment.
|
145
|
+
judgment.async_evaluate(
|
146
146
|
scorers=[AnswerRelevancyScorer(threshold=0.5)],
|
147
147
|
input=task_input,
|
148
148
|
actual_output=res,
|
@@ -280,14 +280,10 @@ results = client.run_evaluation(
|
|
280
280
|
# Create Your First Dataset
|
281
281
|
In most cases, you will not be running evaluations on a single example; instead, you will be scoring your LLM system on a dataset.
|
282
282
|
Judgeval allows you to create datasets, save them, and run evaluations on them.
|
283
|
-
An `EvalDataset` is a collection of `Example`s
|
284
|
-
|
285
|
-
<Note>
|
286
|
-
A `GroundTruthExample` is an `Example` that has no `actual_output` field since it will be generated at test time.
|
287
|
-
</Note>
|
283
|
+
An `EvalDataset` is a collection of `Example`s.
|
288
284
|
|
289
285
|
```python create_dataset.py
|
290
|
-
from judgeval.data import Example,
|
286
|
+
from judgeval.data import Example, EvalDataset
|
291
287
|
|
292
288
|
example1 = Example(input="...", actual_output="...")
|
293
289
|
example2 = Example(input="...", actual_output="...")
|