judgeval 0.0.22__tar.gz → 0.0.24__tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (165) hide show
  1. judgeval-0.0.24/PKG-INFO +156 -0
  2. judgeval-0.0.24/README.md +119 -0
  3. {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/data_datasets.mdx +7 -24
  4. {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/data_examples.mdx +7 -53
  5. {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/scorers/custom_scorers.mdx +3 -3
  6. judgeval-0.0.24/docs/evaluation/scorers/groundedness.mdx +65 -0
  7. {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/scorers/introduction.mdx +10 -23
  8. {judgeval-0.0.22 → judgeval-0.0.24}/docs/getting_started.mdx +4 -8
  9. judgeval-0.0.24/docs/integration/langgraph.mdx +53 -0
  10. {judgeval-0.0.22 → judgeval-0.0.24}/docs/judgment/introduction.mdx +4 -0
  11. {judgeval-0.0.22 → judgeval-0.0.24}/docs/monitoring/tracing.mdx +1 -1
  12. {judgeval-0.0.22 → judgeval-0.0.24}/pyproject.toml +1 -1
  13. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/common/tracer.py +48 -252
  14. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/data/__init__.py +1 -2
  15. judgeval-0.0.24/src/judgeval/integrations/langgraph.py +316 -0
  16. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorer.py +2 -2
  17. judgeval-0.0.22/PKG-INFO +0 -40
  18. judgeval-0.0.22/README.md +0 -3
  19. judgeval-0.0.22/docs/integration/langgraph.mdx +0 -28
  20. judgeval-0.0.22/src/demo/cookbooks/JNPR_Mist/test.py +0 -21
  21. judgeval-0.0.22/src/demo/cookbooks/linkd/text2sql.py +0 -14
  22. judgeval-0.0.22/src/demo/custom_example_demo/qodo_example.py +0 -39
  23. judgeval-0.0.22/src/demo/custom_example_demo/test.py +0 -16
  24. judgeval-0.0.22/src/judgeval/data/custom_example.py +0 -98
  25. judgeval-0.0.22/src/judgeval/data/datasets/utils.py +0 -0
  26. judgeval-0.0.22/src/judgeval/data/ground_truth.py +0 -0
  27. {judgeval-0.0.22 → judgeval-0.0.24}/.github/workflows/ci.yaml +0 -0
  28. {judgeval-0.0.22 → judgeval-0.0.24}/.gitignore +0 -0
  29. {judgeval-0.0.22 → judgeval-0.0.24}/LICENSE.md +0 -0
  30. {judgeval-0.0.22 → judgeval-0.0.24}/Pipfile +0 -0
  31. {judgeval-0.0.22 → judgeval-0.0.24}/Pipfile.lock +0 -0
  32. {judgeval-0.0.22 → judgeval-0.0.24}/docs/README.md +0 -0
  33. {judgeval-0.0.22 → judgeval-0.0.24}/docs/api_reference/judgment_client.mdx +0 -0
  34. {judgeval-0.0.22 → judgeval-0.0.24}/docs/api_reference/trace.mdx +0 -0
  35. {judgeval-0.0.22 → judgeval-0.0.24}/docs/development.mdx +0 -0
  36. {judgeval-0.0.22 → judgeval-0.0.24}/docs/essentials/code.mdx +0 -0
  37. {judgeval-0.0.22 → judgeval-0.0.24}/docs/essentials/images.mdx +0 -0
  38. {judgeval-0.0.22 → judgeval-0.0.24}/docs/essentials/markdown.mdx +0 -0
  39. {judgeval-0.0.22 → judgeval-0.0.24}/docs/essentials/navigation.mdx +0 -0
  40. {judgeval-0.0.22 → judgeval-0.0.24}/docs/essentials/reusable-snippets.mdx +0 -0
  41. {judgeval-0.0.22 → judgeval-0.0.24}/docs/essentials/settings.mdx +0 -0
  42. {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/introduction.mdx +0 -0
  43. {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/judges.mdx +0 -0
  44. {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/scorers/answer_correctness.mdx +0 -0
  45. {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/scorers/answer_relevancy.mdx +0 -0
  46. {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/scorers/classifier_scorer.mdx +0 -0
  47. {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/scorers/comparison.mdx +0 -0
  48. {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/scorers/contextual_precision.mdx +0 -0
  49. {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/scorers/contextual_recall.mdx +0 -0
  50. {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/scorers/contextual_relevancy.mdx +0 -0
  51. {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/scorers/execution_order.mdx +0 -0
  52. {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/scorers/faithfulness.mdx +0 -0
  53. {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/scorers/hallucination.mdx +0 -0
  54. {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/scorers/json_correctness.mdx +0 -0
  55. {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/scorers/summarization.mdx +0 -0
  56. {judgeval-0.0.22 → judgeval-0.0.24}/docs/evaluation/unit_testing.mdx +0 -0
  57. {judgeval-0.0.22 → judgeval-0.0.24}/docs/favicon.svg +0 -0
  58. {judgeval-0.0.22 → judgeval-0.0.24}/docs/images/basic_trace_example.png +0 -0
  59. {judgeval-0.0.22 → judgeval-0.0.24}/docs/images/checks-passed.png +0 -0
  60. {judgeval-0.0.22 → judgeval-0.0.24}/docs/images/create_aggressive_scorer.png +0 -0
  61. {judgeval-0.0.22 → judgeval-0.0.24}/docs/images/create_scorer.png +0 -0
  62. {judgeval-0.0.22 → judgeval-0.0.24}/docs/images/evaluation_diagram.png +0 -0
  63. {judgeval-0.0.22 → judgeval-0.0.24}/docs/images/hero-dark.svg +0 -0
  64. {judgeval-0.0.22 → judgeval-0.0.24}/docs/images/hero-light.svg +0 -0
  65. {judgeval-0.0.22 → judgeval-0.0.24}/docs/images/online_eval_fault.png +0 -0
  66. {judgeval-0.0.22 → judgeval-0.0.24}/docs/images/trace_ss.png +0 -0
  67. {judgeval-0.0.22 → judgeval-0.0.24}/docs/introduction.mdx +0 -0
  68. {judgeval-0.0.22 → judgeval-0.0.24}/docs/logo/dark.svg +0 -0
  69. {judgeval-0.0.22 → judgeval-0.0.24}/docs/logo/light.svg +0 -0
  70. {judgeval-0.0.22 → judgeval-0.0.24}/docs/mint.json +1 -1
  71. {judgeval-0.0.22 → judgeval-0.0.24}/docs/monitoring/introduction.mdx +0 -0
  72. {judgeval-0.0.22 → judgeval-0.0.24}/docs/monitoring/production_insights.mdx +0 -0
  73. {judgeval-0.0.22 → judgeval-0.0.24}/docs/notebooks/create_dataset.ipynb +0 -0
  74. {judgeval-0.0.22 → judgeval-0.0.24}/docs/notebooks/create_scorer.ipynb +0 -0
  75. {judgeval-0.0.22 → judgeval-0.0.24}/docs/notebooks/demo.ipynb +0 -0
  76. {judgeval-0.0.22 → judgeval-0.0.24}/docs/notebooks/prompt_scorer.ipynb +0 -0
  77. {judgeval-0.0.22 → judgeval-0.0.24}/docs/notebooks/quickstart.ipynb +0 -0
  78. {judgeval-0.0.22 → judgeval-0.0.24}/docs/quickstart.mdx +0 -0
  79. {judgeval-0.0.22 → judgeval-0.0.24}/docs/snippets/snippet-intro.mdx +0 -0
  80. {judgeval-0.0.22 → judgeval-0.0.24}/pytest.ini +0 -0
  81. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/__init__.py +0 -0
  82. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/clients.py +0 -0
  83. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/common/__init__.py +0 -0
  84. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/common/exceptions.py +0 -0
  85. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/common/logger.py +0 -0
  86. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/common/utils.py +0 -0
  87. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/constants.py +0 -0
  88. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/data/api_example.py +0 -0
  89. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/data/datasets/__init__.py +0 -0
  90. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/data/datasets/dataset.py +0 -0
  91. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/data/datasets/eval_dataset_client.py +0 -0
  92. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/data/example.py +0 -0
  93. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/data/result.py +0 -0
  94. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/data/scorer_data.py +0 -0
  95. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/evaluation_run.py +0 -0
  96. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/judges/__init__.py +0 -0
  97. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/judges/base_judge.py +0 -0
  98. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/judges/litellm_judge.py +0 -0
  99. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/judges/mixture_of_judges.py +0 -0
  100. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/judges/together_judge.py +0 -0
  101. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/judges/utils.py +0 -0
  102. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/judgment_client.py +0 -0
  103. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/rules.py +0 -0
  104. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/run_evaluation.py +0 -0
  105. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/__init__.py +0 -0
  106. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/api_scorer.py +0 -0
  107. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/base_scorer.py +0 -0
  108. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/exceptions.py +0 -0
  109. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/__init__.py +0 -0
  110. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/api_scorers/__init__.py +0 -0
  111. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/api_scorers/answer_correctness.py +0 -0
  112. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/api_scorers/answer_relevancy.py +0 -0
  113. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/api_scorers/comparison.py +0 -0
  114. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/api_scorers/contextual_precision.py +0 -0
  115. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/api_scorers/contextual_recall.py +0 -0
  116. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/api_scorers/contextual_relevancy.py +0 -0
  117. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/api_scorers/execution_order.py +0 -0
  118. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/api_scorers/faithfulness.py +0 -0
  119. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/api_scorers/groundedness.py +0 -0
  120. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/api_scorers/hallucination.py +0 -0
  121. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/api_scorers/instruction_adherence.py +0 -0
  122. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/api_scorers/json_correctness.py +0 -0
  123. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/api_scorers/summarization.py +0 -0
  124. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/classifiers/__init__.py +0 -0
  125. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/classifiers/text2sql/__init__.py +0 -0
  126. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/classifiers/text2sql/text2sql_scorer.py +0 -0
  127. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/__init__.py +0 -0
  128. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/answer_correctness/__init__.py +0 -0
  129. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/answer_correctness/answer_correctness_scorer.py +0 -0
  130. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/answer_correctness/prompts.py +0 -0
  131. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/answer_relevancy/__init__.py +0 -0
  132. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/answer_relevancy/answer_relevancy_scorer.py +0 -0
  133. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/answer_relevancy/prompts.py +0 -0
  134. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/comparison/__init__.py +0 -0
  135. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/comparison/comparison_scorer.py +0 -0
  136. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/comparison/prompts.py +0 -0
  137. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/contextual_precision/__init__.py +0 -0
  138. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/contextual_precision/contextual_precision_scorer.py +0 -0
  139. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/contextual_precision/prompts.py +0 -0
  140. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/contextual_recall/__init__.py +0 -0
  141. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/contextual_recall/contextual_recall_scorer.py +0 -0
  142. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/contextual_recall/prompts.py +0 -0
  143. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/contextual_relevancy/__init__.py +0 -0
  144. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/contextual_relevancy/contextual_relevancy_scorer.py +0 -0
  145. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/contextual_relevancy/prompts.py +0 -0
  146. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/execution_order/__init__.py +0 -0
  147. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/execution_order/execution_order.py +0 -0
  148. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/faithfulness/__init__.py +0 -0
  149. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/faithfulness/faithfulness_scorer.py +0 -0
  150. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/faithfulness/prompts.py +0 -0
  151. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/hallucination/__init__.py +0 -0
  152. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/hallucination/hallucination_scorer.py +0 -0
  153. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/hallucination/prompts.py +0 -0
  154. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/instruction_adherence/instruction_adherence.py +0 -0
  155. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/instruction_adherence/prompt.py +0 -0
  156. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/json_correctness/__init__.py +0 -0
  157. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/json_correctness/json_correctness_scorer.py +0 -0
  158. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/summarization/__init__.py +0 -0
  159. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/summarization/prompts.py +0 -0
  160. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/judgeval_scorers/local_implementations/summarization/summarization_scorer.py +0 -0
  161. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/prompt_scorer.py +0 -0
  162. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/score.py +0 -0
  163. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/scorers/utils.py +0 -0
  164. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/tracer/__init__.py +0 -0
  165. {judgeval-0.0.22 → judgeval-0.0.24}/src/judgeval/utils/alerts.py +0 -0
@@ -0,0 +1,156 @@
1
+ Metadata-Version: 2.4
2
+ Name: judgeval
3
+ Version: 0.0.24
4
+ Summary: Judgeval Package
5
+ Project-URL: Homepage, https://github.com/JudgmentLabs/judgeval
6
+ Project-URL: Issues, https://github.com/JudgmentLabs/judgeval/issues
7
+ Author-email: Andrew Li <andrew@judgmentlabs.ai>, Alex Shan <alex@judgmentlabs.ai>, Joseph Camyre <joseph@judgmentlabs.ai>
8
+ License-Expression: Apache-2.0
9
+ License-File: LICENSE.md
10
+ Classifier: Operating System :: OS Independent
11
+ Classifier: Programming Language :: Python :: 3
12
+ Requires-Python: >=3.11
13
+ Requires-Dist: anthropic
14
+ Requires-Dist: fastapi
15
+ Requires-Dist: langchain
16
+ Requires-Dist: langchain-anthropic
17
+ Requires-Dist: langchain-core
18
+ Requires-Dist: langchain-huggingface
19
+ Requires-Dist: langchain-openai
20
+ Requires-Dist: litellm
21
+ Requires-Dist: nest-asyncio
22
+ Requires-Dist: openai
23
+ Requires-Dist: openpyxl
24
+ Requires-Dist: pandas
25
+ Requires-Dist: pika
26
+ Requires-Dist: python-dotenv==1.0.1
27
+ Requires-Dist: requests
28
+ Requires-Dist: supabase
29
+ Requires-Dist: together
30
+ Requires-Dist: uvicorn
31
+ Provides-Extra: dev
32
+ Requires-Dist: pytest-asyncio>=0.25.0; extra == 'dev'
33
+ Requires-Dist: pytest-mock>=3.14.0; extra == 'dev'
34
+ Requires-Dist: pytest>=8.3.4; extra == 'dev'
35
+ Requires-Dist: tavily-python; extra == 'dev'
36
+ Description-Content-Type: text/markdown
37
+
38
+ # Judgeval SDK
39
+
40
+ Judgeval is an open-source framework for building evaluation pipelines for multi-step agent workflows, supporting both real-time and experimental evaluation setups. To learn more about Judgment or sign up for free, visit our [website](https://www.judgmentlabs.ai/) or check out our [developer docs](https://judgment.mintlify.app/getting_started).
41
+
42
+ ## Features
43
+
44
+ - **Development and Production Evaluation Layer**: Offers a robust evaluation layer for multi-step agent applications, including unit-testing and performance monitoring.
45
+ - **Plug-and-Evaluate**: Integrate LLM systems with 10+ research-backed metrics, including:
46
+ - Hallucination detection
47
+ - RAG retriever quality
48
+ - And more
49
+ - **Custom Evaluation Pipelines**: Construct powerful custom evaluation pipelines tailored for your LLM systems.
50
+ - **Monitoring in Production**: Utilize state-of-the-art real-time evaluation foundation models to monitor LLM systems effectively.
51
+
52
+ ## Installation
53
+
54
+ ```bash
55
+ pip install judgeval
56
+ ```
57
+
58
+ ## Quickstart: Evaluations
59
+
60
+ You can evaluate your workflow execution data to measure quality metrics such as hallucination.
61
+
62
+ Create a file named `evaluate.py` with the following code:
63
+
64
+ ```python
65
+ from judgeval import JudgmentClient
66
+ from judgeval.data import Example
67
+ from judgeval.scorers import FaithfulnessScorer
68
+
69
+ client = JudgmentClient()
70
+
71
+ example = Example(
72
+ input="What if these shoes don't fit?",
73
+ actual_output="We offer a 30-day full refund at no extra cost.",
74
+ retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."],
75
+ )
76
+
77
+ scorer = FaithfulnessScorer(threshold=0.5)
78
+ results = client.run_evaluation(
79
+ examples=[example],
80
+ scorers=[scorer],
81
+ model="gpt-4o",
82
+ )
83
+ print(results)
84
+ ```
85
+ Click [here](https://judgment.mintlify.app/getting_started#create-your-first-experiment) for a more detailed explanation
86
+
87
+ ## Quickstart: Traces
88
+
89
+ Track your workflow execution for full observability with just a few lines of code.
90
+
91
+ Create a file named `traces.py` with the following code:
92
+
93
+ ```python
94
+ from judgeval.common.tracer import Tracer, wrap
95
+ from openai import OpenAI
96
+
97
+ client = wrap(OpenAI())
98
+ judgment = Tracer(project_name="my_project")
99
+
100
+ @judgment.observe(span_type="tool")
101
+ def my_tool():
102
+ return "Hello world!"
103
+
104
+ @judgment.observe(span_type="function")
105
+ def main():
106
+ task_input = my_tool()
107
+ res = client.chat.completions.create(
108
+ model="gpt-4o",
109
+ messages=[{"role": "user", "content": f"{task_input}"}]
110
+ )
111
+ return res.choices[0].message.content
112
+ ```
113
+ Click [here](https://judgment.mintlify.app/getting_started#create-your-first-trace) for a more detailed explanation
114
+
115
+ ## Quickstart: Online Evaluations
116
+
117
+ Apply performance monitoring to measure the quality of your systems in production, not just on historical data.
118
+
119
+ Using the same traces.py file we created earlier:
120
+
121
+ ```python
122
+ from judgeval.common.tracer import Tracer, wrap
123
+ from judgeval.scorers import AnswerRelevancyScorer
124
+ from openai import OpenAI
125
+
126
+ client = wrap(OpenAI())
127
+ judgment = Tracer(project_name="my_project")
128
+
129
+ @judgment.observe(span_type="tool")
130
+ def my_tool():
131
+ return "Hello world!"
132
+
133
+ @judgment.observe(span_type="function")
134
+ def main():
135
+ task_input = my_tool()
136
+ res = client.chat.completions.create(
137
+ model="gpt-4o",
138
+ messages=[{"role": "user", "content": f"{task_input}"}]
139
+ ).choices[0].message.content
140
+
141
+ judgment.get_current_trace().async_evaluate(
142
+ scorers=[AnswerRelevancyScorer(threshold=0.5)],
143
+ input=task_input,
144
+ actual_output=res,
145
+ model="gpt-4o"
146
+ )
147
+
148
+ return res
149
+ ```
150
+ Click [here](https://judgment.mintlify.app/getting_started#create-your-first-online-evaluation) for a more detailed explanation
151
+
152
+ ## Documentation and Demos
153
+
154
+ For more detailed documentation, please check out our [docs](https://judgment.mintlify.app/getting_started) and some of our [demo videos](https://www.youtube.com/@AlexShan-j3o) for reference!
155
+
156
+ ##
@@ -0,0 +1,119 @@
1
+ # Judgeval SDK
2
+
3
+ Judgeval is an open-source framework for building evaluation pipelines for multi-step agent workflows, supporting both real-time and experimental evaluation setups. To learn more about Judgment or sign up for free, visit our [website](https://www.judgmentlabs.ai/) or check out our [developer docs](https://judgment.mintlify.app/getting_started).
4
+
5
+ ## Features
6
+
7
+ - **Development and Production Evaluation Layer**: Offers a robust evaluation layer for multi-step agent applications, including unit-testing and performance monitoring.
8
+ - **Plug-and-Evaluate**: Integrate LLM systems with 10+ research-backed metrics, including:
9
+ - Hallucination detection
10
+ - RAG retriever quality
11
+ - And more
12
+ - **Custom Evaluation Pipelines**: Construct powerful custom evaluation pipelines tailored for your LLM systems.
13
+ - **Monitoring in Production**: Utilize state-of-the-art real-time evaluation foundation models to monitor LLM systems effectively.
14
+
15
+ ## Installation
16
+
17
+ ```bash
18
+ pip install judgeval
19
+ ```
20
+
21
+ ## Quickstart: Evaluations
22
+
23
+ You can evaluate your workflow execution data to measure quality metrics such as hallucination.
24
+
25
+ Create a file named `evaluate.py` with the following code:
26
+
27
+ ```python
28
+ from judgeval import JudgmentClient
29
+ from judgeval.data import Example
30
+ from judgeval.scorers import FaithfulnessScorer
31
+
32
+ client = JudgmentClient()
33
+
34
+ example = Example(
35
+ input="What if these shoes don't fit?",
36
+ actual_output="We offer a 30-day full refund at no extra cost.",
37
+ retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."],
38
+ )
39
+
40
+ scorer = FaithfulnessScorer(threshold=0.5)
41
+ results = client.run_evaluation(
42
+ examples=[example],
43
+ scorers=[scorer],
44
+ model="gpt-4o",
45
+ )
46
+ print(results)
47
+ ```
48
+ Click [here](https://judgment.mintlify.app/getting_started#create-your-first-experiment) for a more detailed explanation
49
+
50
+ ## Quickstart: Traces
51
+
52
+ Track your workflow execution for full observability with just a few lines of code.
53
+
54
+ Create a file named `traces.py` with the following code:
55
+
56
+ ```python
57
+ from judgeval.common.tracer import Tracer, wrap
58
+ from openai import OpenAI
59
+
60
+ client = wrap(OpenAI())
61
+ judgment = Tracer(project_name="my_project")
62
+
63
+ @judgment.observe(span_type="tool")
64
+ def my_tool():
65
+ return "Hello world!"
66
+
67
+ @judgment.observe(span_type="function")
68
+ def main():
69
+ task_input = my_tool()
70
+ res = client.chat.completions.create(
71
+ model="gpt-4o",
72
+ messages=[{"role": "user", "content": f"{task_input}"}]
73
+ )
74
+ return res.choices[0].message.content
75
+ ```
76
+ Click [here](https://judgment.mintlify.app/getting_started#create-your-first-trace) for a more detailed explanation
77
+
78
+ ## Quickstart: Online Evaluations
79
+
80
+ Apply performance monitoring to measure the quality of your systems in production, not just on historical data.
81
+
82
+ Using the same traces.py file we created earlier:
83
+
84
+ ```python
85
+ from judgeval.common.tracer import Tracer, wrap
86
+ from judgeval.scorers import AnswerRelevancyScorer
87
+ from openai import OpenAI
88
+
89
+ client = wrap(OpenAI())
90
+ judgment = Tracer(project_name="my_project")
91
+
92
+ @judgment.observe(span_type="tool")
93
+ def my_tool():
94
+ return "Hello world!"
95
+
96
+ @judgment.observe(span_type="function")
97
+ def main():
98
+ task_input = my_tool()
99
+ res = client.chat.completions.create(
100
+ model="gpt-4o",
101
+ messages=[{"role": "user", "content": f"{task_input}"}]
102
+ ).choices[0].message.content
103
+
104
+ judgment.get_current_trace().async_evaluate(
105
+ scorers=[AnswerRelevancyScorer(threshold=0.5)],
106
+ input=task_input,
107
+ actual_output=res,
108
+ model="gpt-4o"
109
+ )
110
+
111
+ return res
112
+ ```
113
+ Click [here](https://judgment.mintlify.app/getting_started#create-your-first-online-evaluation) for a more detailed explanation
114
+
115
+ ## Documentation and Demos
116
+
117
+ For more detailed documentation, please check out our [docs](https://judgment.mintlify.app/getting_started) and some of our [demo videos](https://www.youtube.com/@AlexShan-j3o) for reference!
118
+
119
+ ##
@@ -3,19 +3,14 @@ title: Datasets
3
3
  ---
4
4
  ## Overview
5
5
  In most scenarios, you will have multiple `Example`s that you want to evaluate together.
6
- In `judgeval`, an evaluation dataset (`EvalDataset`) is a collection of `Example`s and/or `GroundTruthExample`s that you can scale evaluations across.
6
+ In `judgeval`, an evaluation dataset (`EvalDataset`) is a collection of `Example`s that you can scale evaluations across.
7
7
 
8
- <Note>
9
- A `GroundTruthExample` is a specific type of `Example` that do not require the `actual_output` field.
10
-
11
- This is useful for creating datasets that can be **dynamically updated at evaluation time** by running your workflow on the GroundTruthExamples to create Examples.
12
- </Note>
13
8
  ## Creating a Dataset
14
9
 
15
- Creating an `EvalDataset` is as simple as supplying a list of `Example`s and/or `GroundTruthExample`s.
10
+ Creating an `EvalDataset` is as simple as supplying a list of `Example`s.
16
11
 
17
12
  ```python create_dataset.py
18
- from judgeval.data import Example, GroundTruthExample
13
+ from judgeval.data import Example
19
14
  from judgeval.data.datasets import EvalDataset
20
15
 
21
16
  examples = [
@@ -23,25 +18,19 @@ examples = [
23
18
  Example(input="...", actual_output="..."),
24
19
  ...
25
20
  ]
26
- ground_truth_examples = [
27
- GroundTruthExample(input="..."),
28
- GroundTruthExample(input="..."),
29
- ...
30
- ]
21
+
31
22
 
32
23
  dataset = EvalDataset(
33
- examples=examples,
34
- ground_truth_examples=ground_truth_examples
24
+ examples=examples
35
25
  )
36
26
  ```
37
27
 
38
- You can also add `Example`s and `GroundTruthExample`s to an existing `EvalDataset` using the `add_example` and `add_ground_truth_example` methods.
28
+ You can also add `Example`s to an existing `EvalDataset` using the `add_example` method.
39
29
 
40
30
  ```python add_to_dataset.py
41
31
  ...
42
32
 
43
33
  dataset.add_example(Example(...))
44
- dataset.add_ground_truth(GroundTruthExample(...))
45
34
  ```
46
35
 
47
36
  ## Saving/Loading Datasets
@@ -81,12 +70,6 @@ You can save/load an `EvalDataset` with a JSON file. Your JSON file should have
81
70
  "actual_output": "..."
82
71
  },
83
72
  ...
84
- ],
85
- "ground_truths": [
86
- {
87
- "input": "..."
88
- },
89
- ...
90
73
  ]
91
74
  }
92
75
  ```
@@ -154,7 +137,7 @@ examples:
154
137
 
155
138
  ## Evaluate On Your Dataset
156
139
 
157
- You can use the `JudgmentClient` to evaluate the `Example`s and `GroundTruthExample`s in your dataset using scorers.
140
+ You can use the `JudgmentClient` to evaluate the `Example`s in your dataset using scorers.
158
141
 
159
142
  ```python evaluate_dataset.py
160
143
  ...
@@ -4,14 +4,12 @@ title: Examples
4
4
 
5
5
  ## Overview
6
6
  An `Example` is a basic unit of data in `judgeval` that allows you to run evaluation scorers on your LLM system.
7
- An `Example` is composed of seven fields:
8
- - `input`
9
- - `actual_output`
10
- - [Optional] `expected_output`
11
- - [Optional] `retrieval_context`
12
- - [Optional] `context`
13
- - [Optional] `tools_called`
14
- - [Optional] `expected_tools`
7
+ An `Example` is can be composed of a mixture of the following fields:
8
+ - `input` [Optional]
9
+ - `actual_output` [Optional]
10
+ - `expected_output` [Optional]
11
+ - `retrieval_context` [Optional]
12
+ - `context` [Optional]
15
13
 
16
14
  **Here's a sample of creating an `Example`:**
17
15
 
@@ -24,8 +22,6 @@ example = Example(
24
22
  expected_output="Bill Gates and Paul Allen founded Microsoft in New Mexico in 1975.",
25
23
  retrieval_context=["Bill Gates co-founded Microsoft with Paul Allen in 1975."],
26
24
  context=["Bill Gates and Paul Allen are the founders of Microsoft."],
27
- tools_called=["Google Search"],
28
- expected_tools=["Google Search", "Perplexity"],
29
25
  )
30
26
  ```
31
27
 
@@ -39,7 +35,7 @@ Other fields are optional and depend on the type of evaluation. If you want to d
39
35
 
40
36
  ## Example Fields
41
37
 
42
- Here, we cover the seven fields that make up an `Example`.
38
+ Here, we cover the possible fields that make up an `Example`.
43
39
 
44
40
  ### Input
45
41
  The `input` field represents a sample interaction between a user and your LLM system. The input should represent the direct input to your prompt template(s), and **SHOULD NOT CONTAIN** your prompt template itself.
@@ -137,48 +133,6 @@ example = Example(
137
133
  )
138
134
  ```
139
135
 
140
- <Note>
141
- `context` is the ideal retrieval result for a specific `input`, whereas `retrieval_context` is the actual retrieval result at runtime. While they are similar, they are not always interchangeable.
142
- </Note>
143
- ### Tools Called
144
-
145
- The `tools_called` field is `Optional[List[str]]` and represents the tools that were called by the LLM system. This is particularly useful for evaluating whether agents are properly using tools available to them.
146
-
147
- ```python tools_called.py
148
- # Sample app implementation
149
- import medical_chatbot
150
-
151
- question = "Is sparkling water healthy?"
152
- example = Example(
153
- input=question,
154
- actual_output=medical_chatbot.chat(question),
155
- expected_output="Sparkling water is neither healthy nor unhealthy.",
156
- context=["Sparkling water is a type of water that is carbonated."],
157
- retrieval_context=["Sparkling water is carbonated and has no calories."],
158
- tools_called=["Perplexity", "GoogleSearch"]
159
- )
160
- ```
161
-
162
- ### Expected Tools
163
-
164
- The `expected_tools` field is `Optional[List[str]]` and represents the tools that are expected to be called by the LLM system. This is particularly useful for evaluating whether agents are properly using tools available to them.
165
-
166
- ```python expected_tools.py
167
- # Sample app implementation
168
- import medical_chatbot
169
-
170
- question = "Is sparkling water healthy?"
171
- example = Example(
172
- input=question,
173
- actual_output=medical_chatbot.chat(question),
174
- expected_output="Sparkling water is neither healthy nor unhealthy.",
175
- context=["Sparkling water is a type of water that is carbonated."],
176
- retrieval_context=["Sparkling water is carbonated and has no calories."],
177
- tools_called=["Perplexity", "GoogleSearch"],
178
- expected_tools=["Perplexity", "DBQuery"]
179
- )
180
- ```
181
-
182
136
  ## Conclusion
183
137
 
184
138
  Congratulations! 🎉
@@ -116,9 +116,9 @@ class SampleScorer(JudgevalScorer):
116
116
  ```
117
117
 
118
118
 
119
- ### 4. Implement the `success_check()` method
119
+ ### 4. Implement the `_success_check()` method
120
120
 
121
- When executing an evaluation run, `judgeval` will check if your scorer has passed the `success_check()` method.
121
+ When executing an evaluation run, `judgeval` will check if your scorer has passed the `_success_check()` method.
122
122
 
123
123
  You can implement this method in any way you want, but **it should return a `bool`.** Here's a perfectly valid implementation:
124
124
 
@@ -126,7 +126,7 @@ You can implement this method in any way you want, but **it should return a `boo
126
126
  class SampleScorer(JudgevalScorer):
127
127
  ...
128
128
 
129
- def success_check(self, example):
129
+ def _success_check(self):
130
130
  if self.error is not None:
131
131
  return False
132
132
  return self.score >= self.threshold # or you can do self.success if set
@@ -0,0 +1,65 @@
1
+ ---
2
+ title: Groundedness
3
+ description: ""
4
+ ---
5
+
6
+ The `Groundedness` scorer is a default LLM judge scorer that measures whether the `actual_output` is aligned with both the task instructions in `input` and the knowledge base in `retrieval_context`.
7
+ In practice, this scorer helps determine if your RAG pipeline's generator is producing hallucinations or misinterpreting task instructions.
8
+
9
+ **For optimal Groundedness scoring, check out our leading evaluation foundation model research here! TODO add link here.**
10
+
11
+ <Note>
12
+ The `Groundedness` scorer is a binary metric (1 or 0) that evaluates both instruction adherence and factual accuracy.
13
+
14
+ Unlike the `Faithfulness` scorer which measures the degree of contradiction with retrieval context, `Groundedness` provides a pass/fail assessment based on both the task instructions and knowledge base.
15
+ </Note>
16
+
17
+ ## Required Fields
18
+
19
+ To run the `Groundedness` scorer, you must include the following fields in your `Example`:
20
+ - `input`
21
+ - `actual_output`
22
+ - `retrieval_context`
23
+
24
+ ## Scorer Breakdown
25
+
26
+ `Groundedness` scores are binary (1 or 0) and determined by checking:
27
+ 1. Whether the `actual_output` correctly interprets the task instructions in `input`
28
+ 2. Whether the `actual_output` contains any contradictions with the knowledge base in `retrieval_context`
29
+
30
+ A response is considered grounded (score = 1) only if it:
31
+ - Correctly follows the task instructions
32
+ - Does not contradict any information in the knowledge base
33
+ - Does not introduce hallucinated facts not supported by the retrieval context
34
+
35
+ If there are any contradictions or misinterpretations, the scorer will fail (score = 0).
36
+
37
+ ## Sample Implementation
38
+
39
+ ```python groundedness.py
40
+ from judgeval import JudgmentClient
41
+ from judgeval.data import Example
42
+ from judgeval.scorers import GroundednessScorer
43
+
44
+ client = JudgmentClient()
45
+ example = Example(
46
+ input="You are a helpful assistant for a clothing store. Make sure to follow the company's policies surrounding returns.",
47
+ actual_output="We offer a 30-day return policy for all items, including socks!",
48
+ retrieval_context=["Return policy, all items: 30-day limit for full refund, no questions asked."]
49
+ )
50
+ scorer = GroundednessScorer()
51
+
52
+ results = client.run_evaluation(
53
+ examples=[example],
54
+ scorers=[scorer],
55
+ model="gpt-4o",
56
+ )
57
+ print(results)
58
+ ```
59
+
60
+ <Note>
61
+ The `Groundedness` scorer uses an LLM judge, so you'll receive a reason for the score in the `reason` field of the results.
62
+ This allows you to double-check the accuracy of the evaluation and understand how the score was calculated.
63
+ </Note>
64
+
65
+
@@ -12,11 +12,14 @@ Scorers act as measurement tools for evaluating LLM systems based on specific cr
12
12
  - [Contextual Precision](/evaluation/scorers/contextual_precision)
13
13
  - [Contextual Recall](/evaluation/scorers/contextual_recall)
14
14
  - [Contextual Relevancy](/evaluation/scorers/contextual_relevancy)
15
+ - [Execution Order](/evaluation/scorers/execution_order)
15
16
  - [Faithfulness](/evaluation/scorers/faithfulness)
16
17
  - [Hallucination](/evaluation/scorers/hallucination)
17
- - [Summarization](/evaluation/scorers/summarization)
18
- - [Execution Order](/evaluation/scorers/execution_order)
19
18
  - [JSON Correctness](/evaluation/scorers/json_correctness)
19
+ - [Summarization](/evaluation/scorers/summarization)
20
+
21
+ We also understand that you may need to evaluate your LLM system with metrics that are not covered by our default scorers.
22
+ To support this, we provide a flexible framework for creating these scorers:
20
23
  - [Custom Scorers](/evaluation/scorers/custom_scorers)
21
24
  - [Classifier Scorers](/evaluation/scorers/classifier_scorer)
22
25
 
@@ -24,13 +27,9 @@ Scorers act as measurement tools for evaluating LLM systems based on specific cr
24
27
  We're always adding new scorers to `judgeval`. If you have a suggestion, please [let us know](mailto:contact@judgmentlabs.ai)!
25
28
  </Tip>
26
29
 
27
- Scorers execute on `Example`s, `GroundTruthExample`s, and `EvalDataset`s, producing a **score between 0 and 1**.
30
+ Scorers execute on `Example`s and `EvalDataset`s, producing a **numerical score**.
28
31
  This enables you to **use evaluations as unit tests** by setting a `threshold` to determine whether an evaluation was successful or not.
29
32
 
30
- <Note>
31
- Built-in scorers will succeed if the score is greater than or equal to the `threshold`.
32
- </Note>
33
-
34
33
  ## Categories of Scorers
35
34
  `judgeval` supports three categories of scorers.
36
35
  - **Default Scorers**: built-in scorers that are ready to use
@@ -57,17 +56,17 @@ If you find that none of the default scorers meet your evaluation needs, setting
57
56
  You can create a custom scorer by inheritng from the `JudgevalScorer` class and implementing three methods:
58
57
  - `score_example()`: produces a score for a single `Example`.
59
58
  - `a_score_example()`: async version of `score_example()`. You may use the same implementation logic as `score_example()`.
60
- - `success_check()`: determines whether an evaluation was successful.
59
+ - `_success_check()`: determines whether an evaluation was successful.
61
60
 
62
61
  Custom scorers can be as simple or complex as you want, and **do not need to use LLMs**.
63
- For sample implementations, check out the `JudgevalScorer` [documentation page](/evaluation/scorers/custom_scorers).
62
+ For sample implementations, check out the [Custom Scorers](/evaluation/scorers/custom_scorers) documentation page.
64
63
 
65
64
 
66
65
  ### Classifier Scorers
67
66
 
68
67
  Classifier scorers are a special type of custom scorer that can evaluate your LLM system using a natural language criteria.
69
68
 
70
- TODO update this section when SDK is updated
69
+ They either be defined using our judgeval SDK or using the Judgment Platform directly. For more information, check out the [Classifier Scorers](/evaluation/scorers/classifier_scorer) documentation page.
71
70
 
72
71
  ## Running Scorers
73
72
 
@@ -80,22 +79,10 @@ client = JudgmentClient()
80
79
  results = client.run_evaluation(
81
80
  examples=[example],
82
81
  scorers=[scorer],
83
- model="gpt-4o-mini",
82
+ model="gpt-4o",
84
83
  )
85
84
  ```
86
85
 
87
- If you want to execute a `JudgevalScorer` without running it through the `JudgmentClient`, you can score locally.
88
- Simply use the `score_example()` or `a_score_example()` method directly:
89
-
90
- ```python direct_scoring.py
91
- ...
92
-
93
- example = Example(input="...", actual_output="...")
94
-
95
- scorer = JudgevalScorer() # Your scorer here
96
- score = scorer.score_example(example)
97
- ```
98
-
99
86
  <Tip>
100
87
  To learn about how a certain default scorer works, check out its documentation page for a deep dive into how scores are calculated and what fields are required.
101
88
  </Tip>
@@ -62,7 +62,7 @@ Congratulations! Your evaluation should have passed. Let's break down what happe
62
62
  - The variable `input` mimics a user input and `actual_output` is a placeholder for what your LLM system returns based on the input.
63
63
  - The variable `retrieval_context` represents the retrieved context from your RAG knowledge base.
64
64
  - `FaithfulnessScorer(threshold=0.5)` is a scorer that checks if the output is hallucinated relative to the retrieved context.
65
- - <Note>All scorers produce values betweeen 0 - 1; the threshold is used in the context of [unit testing](/evaluation/unit_testing).</Note>
65
+ - <Note>The threshold is used in the context of [unit testing](/evaluation/unit_testing).</Note>
66
66
  - We chose `gpt-4o` as our judge model to measure faithfulness. Judgment Labs offers ANY judge model for your evaluation needs.
67
67
  Consider trying out our state-of-the-art Osiris judge models for your next evaluation!
68
68
 
@@ -142,7 +142,7 @@ def main():
142
142
  messages=[{"role": "user", "content": f"{task_input}"}]
143
143
  ).choices[0].message.content
144
144
 
145
- judgment.get_current_trace().async_evaluate(
145
+ judgment.async_evaluate(
146
146
  scorers=[AnswerRelevancyScorer(threshold=0.5)],
147
147
  input=task_input,
148
148
  actual_output=res,
@@ -280,14 +280,10 @@ results = client.run_evaluation(
280
280
  # Create Your First Dataset
281
281
  In most cases, you will not be running evaluations on a single example; instead, you will be scoring your LLM system on a dataset.
282
282
  Judgeval allows you to create datasets, save them, and run evaluations on them.
283
- An `EvalDataset` is a collection of `Example`s and/or `GroundTruthExample`s.
284
-
285
- <Note>
286
- A `GroundTruthExample` is an `Example` that has no `actual_output` field since it will be generated at test time.
287
- </Note>
283
+ An `EvalDataset` is a collection of `Example`s.
288
284
 
289
285
  ```python create_dataset.py
290
- from judgeval.data import Example, GroundTruthExample, EvalDataset
286
+ from judgeval.data import Example, EvalDataset
291
287
 
292
288
  example1 = Example(input="...", actual_output="...")
293
289
  example2 = Example(input="...", actual_output="...")