judgeval 0.0.36__py3-none-any.whl → 0.0.38__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,57 @@
1
+ import yaml
2
+ from judgeval.common.logger import (
3
+ debug,
4
+ info,
5
+ error,
6
+ example_logging_context
7
+ )
8
+
9
+ from judgeval.data import Example
10
+
11
+
12
+ def add_from_yaml(file_path: str) -> None:
13
+ debug(f"Loading dataset from YAML file: {file_path}")
14
+ """
15
+ Adds examples from a YAML file.
16
+
17
+ The format of the YAML file is expected to be a dictionary with one key: "examples".
18
+ The value of the key is a list of dictionaries, where each dictionary represents an example.
19
+
20
+ The YAML file is expected to have the following format:
21
+ examples:
22
+ - input: "test input"
23
+ actual_output: "test output"
24
+ expected_output: "expected output"
25
+ context:
26
+ - "context1"
27
+ - "context2"
28
+ retrieval_context:
29
+ - "retrieval1"
30
+ additional_metadata:
31
+ key: "value"
32
+ tools_called:
33
+ - "tool1"
34
+ expected_tools:
35
+ - {tool_name: "tool1", parameters: {"query": "test query 1"}}
36
+ - {tool_name: "tool2", parameters: {"query": "test query 2"}}
37
+ name: "test example"
38
+ example_id: null
39
+ timestamp: "20241230_160117"
40
+ trace_id: "123"
41
+ """
42
+ try:
43
+ with open(file_path, "r") as file:
44
+ payload = yaml.safe_load(file)
45
+ if payload is None:
46
+ raise ValueError("The YAML file is empty.")
47
+ examples = payload.get("examples", [])
48
+ except FileNotFoundError:
49
+ error(f"YAML file not found: {file_path}")
50
+ raise FileNotFoundError(f"The file {file_path} was not found.")
51
+ except yaml.YAMLError:
52
+ error(f"Invalid YAML file: {file_path}")
53
+ raise ValueError(f"The file {file_path} is not a valid YAML file.")
54
+
55
+ info(f"Added {len(examples)} examples from YAML")
56
+ new_examples = [Example(**e) for e in examples]
57
+ return new_examples
@@ -0,0 +1,247 @@
1
+ Metadata-Version: 2.4
2
+ Name: judgeval
3
+ Version: 0.0.38
4
+ Summary: Judgeval Package
5
+ Project-URL: Homepage, https://github.com/JudgmentLabs/judgeval
6
+ Project-URL: Issues, https://github.com/JudgmentLabs/judgeval/issues
7
+ Author-email: Andrew Li <andrew@judgmentlabs.ai>, Alex Shan <alex@judgmentlabs.ai>, Joseph Camyre <joseph@judgmentlabs.ai>
8
+ License-Expression: Apache-2.0
9
+ License-File: LICENSE.md
10
+ Classifier: Operating System :: OS Independent
11
+ Classifier: Programming Language :: Python :: 3
12
+ Requires-Python: >=3.11
13
+ Requires-Dist: anthropic
14
+ Requires-Dist: boto3
15
+ Requires-Dist: google-genai
16
+ Requires-Dist: langchain-anthropic
17
+ Requires-Dist: langchain-core
18
+ Requires-Dist: langchain-huggingface
19
+ Requires-Dist: langchain-openai
20
+ Requires-Dist: litellm==1.38.12
21
+ Requires-Dist: nest-asyncio
22
+ Requires-Dist: openai
23
+ Requires-Dist: pandas
24
+ Requires-Dist: python-dotenv==1.0.1
25
+ Requires-Dist: requests
26
+ Requires-Dist: together
27
+ Description-Content-Type: text/markdown
28
+
29
+ <div align="center">
30
+
31
+ <img src="assets/logo-light.svg#gh-light-mode-only" alt="Judgment Logo" width="400" />
32
+ <img src="assets/logo-dark.svg#gh-dark-mode-only" alt="Judgment Logo" width="400" />
33
+
34
+ **Build monitoring & evaluation pipelines for complex agents**
35
+
36
+ <img src="assets/experiments_page.png" alt="Judgment Platform Experiments Page" width="800" />
37
+
38
+ <br>
39
+
40
+ ## [🌐 Landing Page](https://www.judgmentlabs.ai/) • [Twitter/X](https://x.com/JudgmentLabs) • [💼 LinkedIn](https://www.linkedin.com/company/judgmentlabs) • [📚 Docs](https://judgment.mintlify.app/getting_started) • [🚀 Demos](https://www.youtube.com/@AlexShan-j3o) • [🎮 Discord](https://discord.gg/taAufyhf)
41
+ </div>
42
+
43
+ ## Judgeval: open-source testing, monitoring, and optimization for AI agents
44
+
45
+ Judgeval offers robust tooling for evaluating and tracing LLM agent systems. It is dev-friendly and open-source (licensed under Apache 2.0).
46
+
47
+ Judgeval gets you started in five minutes, after which you'll be ready to use all of its features as your agent becomes more complex. Judgeval is natively connected to the [Judgment Platform](https://www.judgmentlabs.ai/) for free and you can export your data and self-host at any time.
48
+
49
+ We support tracing agents built with LangGraph, OpenAI SDK, Anthropic, ... and allow custom eval integrations for any use case. Check out our quickstarts below or our [setup guide](https://judgment.mintlify.app/getting_started) to get started.
50
+
51
+ Judgeval is created and maintained by [Judgment Labs](https://judgmentlabs.ai/).
52
+
53
+ ## 📋 Table of Contents
54
+ * [✨ Features](#-features)
55
+ * [🔍 Tracing](#-tracing)
56
+ * [🧪 Evals](#-evals)
57
+ * [📡 Monitoring](#-monitoring)
58
+ * [📊 Datasets](#-datasets)
59
+ * [💡 Insights](#-insights)
60
+ * [🛠️ Installation](#️-installation)
61
+ * [🏁 Get Started](#-get-started)
62
+ * [🏢 Self-Hosting](#-self-hosting)
63
+ * [📚 Cookbooks](#-cookbooks)
64
+ * [⭐ Star Us on GitHub](#-star-us-on-github)
65
+ * [❤️ Contributors](#️-contributors)
66
+
67
+ <!-- Created by https://github.com/ekalinin/github-markdown-toc -->
68
+
69
+
70
+ ## ✨ Features
71
+
72
+ | | |
73
+ |:---|:---:|
74
+ | <h3>🔍 Tracing</h3>Automatic agent tracing integrated with common frameworks (LangGraph, OpenAI, Anthropic): **tracking inputs/outputs, latency, and cost** at every step.<br><br>Online evals can be applied to traces to measure quality on production data in real-time.<br><br>Export trace data to the Judgment Platform or your own S3 buckets, {Parquet, JSON, YAML} files, or data warehouse.<br><br>**Useful for:**<br>• 🐛 Debugging agent runs <br>• 👤 Tracking user activity <br>• 🔬 Pinpointing performance bottlenecks| <p align="center"><img src="assets/trace_screenshot.png" alt="Tracing visualization" width="1200"/></p> |
75
+ | <h3>🧪 Evals</h3>15+ research-backed metrics including tool call accuracy, hallucinations, instruction adherence, and retrieval context recall.<br><br>Build custom evaluators that connect with our metric-tracking infrastructure. <br><br>**Useful for:**<br>• ⚠️ Unit-testing <br>• 🔬 Experimental prompt testing<br>• 🛡️ Online guardrails <br><br> | <p align="center"><img src="assets/experiments_page.png" alt="Evaluation metrics" width="800"/></p> |
76
+ | <h3>📡 Monitoring</h3>Real-time performance tracking of your agents in production environments. **Track all your metrics in one place.**<br><br>Set up **Slack/email alerts** for critical metrics and receive notifications when thresholds are exceeded.<br><br> **Useful for:** <br>•📉 Identifying degradation early <br>•📈 Visualizing performance trends across versions and time | <p align="center"><img src="assets/monitoring_screenshot.png" alt="Monitoring Dashboard" width="1200"/></p> |
77
+ | <h3>📊 Datasets</h3>Export trace data or import external testcases to datasets hosted on Judgment's Platform. Move datasets to/from Parquet, S3, etc. <br><br>Run evals on datasets as unit tests or to A/B test different agent configurations. <br><br> **Useful for:**<br>• 🔄 Scaled analysis for A/B tests <br>• 🗃️ Filtered collections of agent runtime data| <p align="center"><img src="assets/datasets_preview_screenshot.png" alt="Dataset management" width="1200"/></p> |
78
+ | <h3>💡 Insights</h3>Cluster on your data to reveal common use cases and failure modes.<br><br>Trace failures to their exact source with Judgment's Osiris agent, which localizes errors to specific components for precise fixes.<br><br> **Useful for:**<br>•🔮 Surfacing common inputs that lead to error<br>•🤖 Investigating agent/user behavior for optimization <br>| <p align="center"><img src="assets/dataset_clustering_screenshot_dm.png" alt="Insights dashboard" width="1200"/></p> |
79
+
80
+ ## 🛠️ Installation
81
+
82
+ Get started with Judgeval by installing our SDK using pip:
83
+
84
+ ```bash
85
+ pip install judgeval
86
+ ```
87
+
88
+ Ensure you have your `JUDGMENT_API_KEY` and `JUDGMENT_ORG_ID` environment variables set to connect to the [Judgment platform](https://app.judgmentlabs.ai/).
89
+
90
+ **If you don't have keys, [create an account](https://app.judgmentlabs.ai/register) on the platform!**
91
+
92
+ ## 🏁 Get Started
93
+
94
+ Here's how you can quickly start using Judgeval:
95
+
96
+ ### 🛰️ Tracing
97
+
98
+ Track your agent execution with full observability with just a few lines of code.
99
+ Create a file named `traces.py` with the following code:
100
+
101
+ ```python
102
+ from judgeval.common.tracer import Tracer, wrap
103
+ from openai import OpenAI
104
+
105
+ client = wrap(OpenAI())
106
+ judgment = Tracer(project_name="my_project")
107
+
108
+ @judgment.observe(span_type="tool")
109
+ def my_tool():
110
+ return "What's the capital of the U.S.?"
111
+
112
+ @judgment.observe(span_type="function")
113
+ def main():
114
+ task_input = my_tool()
115
+ res = client.chat.completions.create(
116
+ model="gpt-4.1",
117
+ messages=[{"role": "user", "content": f"{task_input}"}]
118
+ )
119
+ return res.choices[0].message.content
120
+
121
+ main()
122
+ ```
123
+
124
+ [Click here](https://judgment.mintlify.app/getting_started#create-your-first-trace) for a more detailed explanation.
125
+
126
+ ### 📝 Offline Evaluations
127
+
128
+ You can evaluate your agent's execution to measure quality metrics such as hallucination.
129
+ Create a file named `evaluate.py` with the following code:
130
+
131
+ ```python evaluate.py
132
+ from judgeval import JudgmentClient
133
+ from judgeval.data import Example
134
+ from judgeval.scorers import FaithfulnessScorer
135
+
136
+ client = JudgmentClient()
137
+
138
+ example = Example(
139
+ input="What if these shoes don't fit?",
140
+ actual_output="We offer a 30-day full refund at no extra cost.",
141
+ retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."],
142
+ )
143
+
144
+ scorer = FaithfulnessScorer(threshold=0.5)
145
+ results = client.run_evaluation(
146
+ examples=[example],
147
+ scorers=[scorer],
148
+ model="gpt-4.1",
149
+ )
150
+ print(results)
151
+ ```
152
+
153
+ [Click here](https://judgment.mintlify.app/getting_started#create-your-first-experiment) for a more detailed explanation.
154
+
155
+ ### 📡 Online Evaluations
156
+
157
+ Attach performance monitoring on traces to measure the quality of your systems in production.
158
+
159
+ Using the same `traces.py` file we created earlier, modify `main` function:
160
+
161
+ ```python
162
+ from judgeval.common.tracer import Tracer, wrap
163
+ from judgeval.scorers import AnswerRelevancyScorer
164
+ from openai import OpenAI
165
+
166
+ client = wrap(OpenAI())
167
+ judgment = Tracer(project_name="my_project")
168
+
169
+ @judgment.observe(span_type="tool")
170
+ def my_tool():
171
+ return "Hello world!"
172
+
173
+ @judgment.observe(span_type="function")
174
+ def main():
175
+ task_input = my_tool()
176
+ res = client.chat.completions.create(
177
+ model="gpt-4.1",
178
+ messages=[{"role": "user", "content": f"{task_input}"}]
179
+ ).choices[0].message.content
180
+
181
+ judgment.async_evaluate(
182
+ scorers=[AnswerRelevancyScorer(threshold=0.5)],
183
+ input=task_input,
184
+ actual_output=res,
185
+ model="gpt-4.1"
186
+ )
187
+ print("Online evaluation submitted.")
188
+ return res
189
+
190
+ main()
191
+ ```
192
+
193
+ [Click here](https://judgment.mintlify.app/getting_started#create-your-first-online-evaluation) for a more detailed explanation.
194
+
195
+ ## 🏢 Self-Hosting
196
+
197
+ Run Judgment on your own infrastructure: we provide comprehensive self-hosting capabilities that give you full control over the backend and data plane that Judgeval interfaces with.
198
+
199
+ ### Key Features
200
+ * Deploy Judgment on your own AWS account
201
+ * Store data in your own Supabase instance
202
+ * Access Judgment through your own custom domain
203
+
204
+ ### Getting Started
205
+ 1. Check out our [self-hosting documentation](https://judgment.mintlify.app/self_hosting/get_started) for detailed setup instructions, along with how your self-hosted instance can be accessed
206
+ 2. Use the [Judgment CLI](https://github.com/JudgmentLabs/judgment-cli) to deploy your self-hosted environment
207
+ 3. After your self-hosted instance is setup, make sure the `JUDGMENT_API_URL` environmental variable is set to your self-hosted backend endpoint
208
+
209
+ ## 📚 Cookbooks
210
+
211
+ Have your own? We're happy to feature it if you create a PR or message us on [Discord](https://discord.gg/taAufyhf).
212
+
213
+ You can access our repo of cookbooks [here](https://github.com/JudgmentLabs/judgment-cookbook). Here are some highlights:
214
+
215
+ ### Sample Agents
216
+
217
+ #### 💰 [LangGraph Financial QA Agent](https://github.com/JudgmentLabs/judgment-cookbook/blob/main/cookbooks/financial_agent/demo.py)
218
+ A LangGraph-based agent for financial queries, featuring RAG capabilities with a vector database for contextual data retrieval and evaluation of its reasoning and data accuracy.
219
+
220
+ #### ✈️ [OpenAI Travel Agent](https://github.com/JudgmentLabs/judgment-cookbook/blob/main/cookbooks/openai_travel_agent/agent.py)
221
+ A travel planning agent using OpenAI API calls, custom tool functions, and RAG with a vector database for up-to-date and contextual travel information. Evaluated for itinerary quality and information relevance.
222
+
223
+ ### Custom Evaluators
224
+
225
+ #### 🔍 [PII Detection](https://github.com/JudgmentLabs/judgment-cookbook/blob/main/cookbooks/classifier_scorer/pii_checker.py)
226
+ Detecting and evaluating Personal Identifiable Information (PII) leakage.
227
+
228
+ #### 📧 [Cold Email Generation](https://github.com/JudgmentLabs/judgment-cookbook/blob/main/cookbooks/custom_scorers/cold_email_scorer.py)
229
+
230
+ Evaluates if a cold email generator properly utilizes all relevant information about the target recipient.
231
+
232
+ ## ⭐ Star Us on GitHub
233
+
234
+ If you find Judgeval useful, please consider giving us a star on GitHub! Your support helps us grow our community and continue improving the product.
235
+
236
+
237
+ ## ❤️ Contributors
238
+
239
+ There are many ways to contribute to Judgeval:
240
+
241
+ - Submit [bug reports](https://github.com/JudgmentLabs/judgeval/issues) and [feature requests](https://github.com/JudgmentLabs/judgeval/issues)
242
+ - Review the documentation and submit [Pull Requests](https://github.com/JudgmentLabs/judgeval/pulls) to improve it
243
+ - Speaking or writing about Judgment and letting us know!
244
+
245
+ <!-- Contributors collage -->
246
+ [![Contributors](https://contributors-img.web.app/image?repo=JudgmentLabs/judgeval)](https://github.com/JudgmentLabs/judgeval/graphs/contributors)
247
+
@@ -1,43 +1,43 @@
1
1
  judgeval/__init__.py,sha256=x9HWt4waJwJMAqTuJSg2MezF9Zg-macEjeU-ajbly-8,330
2
2
  judgeval/clients.py,sha256=6VQmEqmfCngUdS2MuPBIpHvtDFqOENm8-_BmMvjLyRQ,944
3
- judgeval/constants.py,sha256=Gc1xpft2BkFRUIjj-puCzILsG1EUOEs8V-bUWP9b1WM,5508
4
- judgeval/evaluation_run.py,sha256=WGzx-Ug2qhSmunFo8NrmSstBRsOUc5KpKq0Lc51rqsM,6739
5
- judgeval/judgment_client.py,sha256=slYLE80FqEIsqgShMtML4I64p-RrEfELbMgZnlXhxP0,22515
3
+ judgeval/constants.py,sha256=qemyUNf5G5-W6YQ9tNkxbFa7L7XR6cDtWCVFKRwT3TM,5519
4
+ judgeval/evaluation_run.py,sha256=V9xMyiJ7e9lqHRblaeeMh6oyx1MEtGwfSxYtbi-EeXY,6746
5
+ judgeval/judgment_client.py,sha256=ozNMDeM3lNnaNq4zY40x3z1TwHYL1e25BlxGnSYO0yw,23275
6
6
  judgeval/rules.py,sha256=jkh1cXXcUf8oRY7xJUZfcQBYWn_rjUW4GvrhRt15PeU,20265
7
- judgeval/run_evaluation.py,sha256=1G-KYNHowfMKTD5j3cDd4EuEme00AqZkn6wpP3zMKUo,30241
7
+ judgeval/run_evaluation.py,sha256=-7oiebkggP7lf6nVRxqDKE3QkuPSA0sAVkZl_n2nZtI,32437
8
8
  judgeval/version_check.py,sha256=bvJEidB7rAeXozoUbN9Yb97QOR_s2hgvpvj74jJ5HlY,943
9
9
  judgeval/common/__init__.py,sha256=7d24BRxtncpMj3AAJCj8RS7TqgjXmW777HVZH6-3sBs,289
10
10
  judgeval/common/exceptions.py,sha256=U-TxHLn7oVMezsMuoYouNDb2XuS8RCggfntYf5_6u4E,565
11
11
  judgeval/common/logger.py,sha256=KO75wWXCxhUHUMvLaTU31ZzOk6tkZBa7heQ7y0f-zFE,6062
12
12
  judgeval/common/s3_storage.py,sha256=W8wq9S7qJZdqdBR4sk3aEZ4K3-pz40DOoolOJrWs9Vo,3768
13
- judgeval/common/tracer.py,sha256=bkN0Jol0mNosJeEJMtjM54jJDhEYL3OSBtkS4FB1m8E,105461
14
- judgeval/common/utils.py,sha256=LUQV5JfDr6wj7xHAJoNq-gofNZ6mjXbeKrGKzBME1KM,33533
15
- judgeval/data/__init__.py,sha256=xuKx_KCVHGp6CXvQuVmKl3v7pJp-qDaz0NccKxwjtO0,481
13
+ judgeval/common/tracer.py,sha256=EkWkg2AsS5FIj2ffh912qZZ9ew5h3hu2rynPBDsMszw,80463
14
+ judgeval/common/utils.py,sha256=w1SjpDtB1DTJapFSAvLzr_a3gGI45iacEoxIUnQXx4Q,34087
15
+ judgeval/data/__init__.py,sha256=Q4WiIva20U_NgxGr-MU-9FWN_eFzUZBVgCsBmoo7IM8,501
16
16
  judgeval/data/custom_example.py,sha256=QRBqiRiZS8UgVeTRHY0r1Jzm6yAYsyg6zmHxQGxdiQs,739
17
- judgeval/data/example.py,sha256=cJrmPGLel_P2sy1UaRvuVSAi35EnA9XMR11Lhp4aDLo,5930
18
- judgeval/data/result.py,sha256=Gb9tiSDsk1amXgh0cFG6JmlW_BMKxS2kuTwNA0rrHjA,3184
17
+ judgeval/data/example.py,sha256=MD0rA9oNI4cyaRgz7I7EOKv0gD2dp22Q_5z-NWdFHhE,6891
18
+ judgeval/data/result.py,sha256=KfU9lhAKG_Xo2eGDm2uKVVRZpf177IDASg1cIwedJwE,3184
19
19
  judgeval/data/scorer_data.py,sha256=JVlaTx1EP2jw2gh3Vgx1CSEsvIFABAN26IquKyxwiJQ,3273
20
- judgeval/data/sequence.py,sha256=FmKVdzQP5VTujRCHDWk097MKRR-rJgbsdrxyCKee6tA,1994
21
- judgeval/data/sequence_run.py,sha256=RmYjfWKMWg-pcF5PLeiWfrhuDkjDZi5VEmAIEXN3Ib0,2104
20
+ judgeval/data/trace.py,sha256=IjL06YNElxTuJC0HrPUh69rtXkfkSpzDoZdNiXFUvwY,5043
21
+ judgeval/data/trace_run.py,sha256=G_OsHNK_nZzJKhtdiyWp7GFyyns5AOJZ956GM_4jXM0,2192
22
22
  judgeval/data/datasets/__init__.py,sha256=IdNKhQv9yYZ_op0rdBacrFaFVmiiYQ3JTzXzxOTsEVQ,176
23
- judgeval/data/datasets/dataset.py,sha256=RmZ28oyDPfRsCx4k5ftMscoq0M0LN78MW6ofTiM81BI,13134
24
- judgeval/data/datasets/eval_dataset_client.py,sha256=uirHpkpLOfygXIz0xKAGTPx1qjbBTzdLFQK6yyoZduU,17544
25
- judgeval/integrations/langgraph.py,sha256=7LpWDpb8wgOkeRJvlr2COvF_O1f01zm-cwsI5trKoiw,123150
23
+ judgeval/data/datasets/dataset.py,sha256=oU9hvZTifK2x8em3FhL3oIqgHOByfJWH6C_9rIKnL5g,12773
24
+ judgeval/data/datasets/eval_dataset_client.py,sha256=3RBfkaMrkudjnmY_qFwY4I-2mOPE3XK4WxkfSweLB-Q,15016
25
+ judgeval/integrations/langgraph.py,sha256=L9zPPWVLGL2HWuwHPqM5Kic4S7EfQ_Y1Y3YKBJNfGCA,23004
26
26
  judgeval/judges/__init__.py,sha256=6X7VSwrwsdxGBNxCyapVRWGghhKOy3MVxFNMQ62kCXM,308
27
27
  judgeval/judges/base_judge.py,sha256=ch_S7uBB7lyv44Lf1d7mIGFpveOO58zOkkpImKgd9_4,994
28
- judgeval/judges/litellm_judge.py,sha256=EIL58Teptv8DzZUO3yP2RDQCDq-aoBB6HPZzPdK6KTg,2424
29
- judgeval/judges/mixture_of_judges.py,sha256=IJoi4Twk8ze1CJWVEp69k6TSqTCTGrmVYQ0qdffer60,15549
28
+ judgeval/judges/litellm_judge.py,sha256=DhB6px9ELZL3gbMb2w4FkBliuTlaCVIcjE8v149G6NM,2425
29
+ judgeval/judges/mixture_of_judges.py,sha256=D97h8L-6saPwwppVwitrIdlMAjizzxGWeVOfNyVnXZA,15550
30
30
  judgeval/judges/together_judge.py,sha256=l00hhPerAZXg3oYBd8cyMtWsOTNt_0FIqoxhKJKQe3k,2302
31
- judgeval/judges/utils.py,sha256=9lvUxziGV86ISvVFxYBWc09TWFyAQgUTyPf_a9mD5Rs,2686
32
- judgeval/scorers/__init__.py,sha256=Mk-mWUt_gNpJqY_WIEuQynD6fxc34fWSRSuobMSrj94,1238
31
+ judgeval/judges/utils.py,sha256=vL-15_udU94JHUAiyrAvHAKMj6Fqypg01ek4YH5zVCM,2687
32
+ judgeval/scorers/__init__.py,sha256=-4GLkYiLKI_BxpoIfgadCFEUfqJcBWZLAtfrInjZT0Q,1282
33
33
  judgeval/scorers/api_scorer.py,sha256=NQ_CrrUPhSUk1k2Q8rKpCG_TU2FT32sFEqvb-Yi54B0,2688
34
34
  judgeval/scorers/exceptions.py,sha256=eGW5CuJgZ5YJBFrE4FHDSF651PO1dKAZ379mJ8gOsfo,178
35
35
  judgeval/scorers/judgeval_scorer.py,sha256=79-JJurqHP-qTaWNWInx4SjvQYwXc9lvfPPNgwsh2yA,6773
36
36
  judgeval/scorers/prompt_scorer.py,sha256=PaAs2qRolw1P3_I061Xvk9qzvF4O-JR8g_39RqXnHcM,17728
37
- judgeval/scorers/score.py,sha256=fZuaZPumqkLCWcZdpTn3bJeHPNHXaDqgyb0WBp2EYgE,18742
37
+ judgeval/scorers/score.py,sha256=m9luk5ZLeUCual5CpI-9ZR9nqR3eC9wJLVT87SFPN6g,18747
38
38
  judgeval/scorers/utils.py,sha256=iHQVTlIANbmCTXz9kTeSdOytgUZ_T74Re61ajqsk_WQ,6827
39
39
  judgeval/scorers/judgeval_scorers/__init__.py,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
40
- judgeval/scorers/judgeval_scorers/api_scorers/__init__.py,sha256=_sDUBxSG536KGqXNi6dFpaYKghjEAadxBxaaxV9HuuE,1764
40
+ judgeval/scorers/judgeval_scorers/api_scorers/__init__.py,sha256=QhHKpl6kNEXxuwriSEwQ5gIIxb7NeHZ1H_7SAZhQiQk,1872
41
41
  judgeval/scorers/judgeval_scorers/api_scorers/answer_correctness.py,sha256=Fnd9CVIOZ73sWEWymsU5eBrrZqPFjMZ0BKpeW-PDyTg,711
42
42
  judgeval/scorers/judgeval_scorers/api_scorers/answer_relevancy.py,sha256=oETeN9K0HSIRdL2SDqn82Vskpwh5SlKnZvs5VDm2OBU,658
43
43
  judgeval/scorers/judgeval_scorers/api_scorers/comparison.py,sha256=kuzf9OWvpY38yYSwlBgneLkUZwJNM4FQqvbS66keA90,1249
@@ -52,12 +52,14 @@ judgeval/scorers/judgeval_scorers/api_scorers/hallucination.py,sha256=k5gDOki-8K
52
52
  judgeval/scorers/judgeval_scorers/api_scorers/instruction_adherence.py,sha256=XnSGEkQfwVqaqnHEGMCsxNiHVzrsrej48uDbLoWc8CQ,678
53
53
  judgeval/scorers/judgeval_scorers/api_scorers/json_correctness.py,sha256=mMKEuR87_yanEuZJ5YSGFMHDD_oLVZ6-rQuciFaDOMA,1095
54
54
  judgeval/scorers/judgeval_scorers/api_scorers/summarization.py,sha256=QmWB8bVbDYHY5FcF0rYZE_3c2XXgMLRmR6aXJWfdMC4,655
55
+ judgeval/scorers/judgeval_scorers/api_scorers/tool_order.py,sha256=urm8LgkeZA7e-ePWo6AToKGheQYSp6MOpKon5NF5EJw,570
55
56
  judgeval/scorers/judgeval_scorers/classifiers/__init__.py,sha256=Qt81W5ZCwMvBAne0LfQDb8xvg5iOG1vEYP7WizgwAZo,67
56
57
  judgeval/scorers/judgeval_scorers/classifiers/text2sql/__init__.py,sha256=8iTzMvou1Dr8pybul6lZHKjc9Ye2-0_racRGYkhEdTY,74
57
58
  judgeval/scorers/judgeval_scorers/classifiers/text2sql/text2sql_scorer.py,sha256=ly72Z7s_c8NID6-nQnuW8qEGEW2MqdvpJ-5WfXzbAQg,2579
58
59
  judgeval/tracer/__init__.py,sha256=wy3DYpH8U_z0GO_K_gOSkK0tTTD-u5eLDo0T5xIBoAc,147
59
60
  judgeval/utils/alerts.py,sha256=O19Xj7DA0YVjl8PWiuH4zfdZeu3yiLVvHfY8ah2wG0g,2759
60
- judgeval-0.0.36.dist-info/METADATA,sha256=oexg66X9idECkevPAF2VkuQJBt-hYHvKmsZz5p5Y-LI,6097
61
- judgeval-0.0.36.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
62
- judgeval-0.0.36.dist-info/licenses/LICENSE.md,sha256=tKmCg7k5QOmxPK19XMfzim04QiQJPmgIm0pAn55IJwk,11352
63
- judgeval-0.0.36.dist-info/RECORD,,
61
+ judgeval/utils/data_utils.py,sha256=pB4GBWi8XoM2zSR2NlLXH5kqcQ029BVhDxaVKkdmiBY,1860
62
+ judgeval-0.0.38.dist-info/METADATA,sha256=jlCQMfdz2Ni9nRi9cOu5svHnLqIinll2odC37dqkE3U,11860
63
+ judgeval-0.0.38.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
64
+ judgeval-0.0.38.dist-info/licenses/LICENSE.md,sha256=tKmCg7k5QOmxPK19XMfzim04QiQJPmgIm0pAn55IJwk,11352
65
+ judgeval-0.0.38.dist-info/RECORD,,
judgeval/data/sequence.py DELETED
@@ -1,49 +0,0 @@
1
- from pydantic import BaseModel, Field, field_validator, model_validator
2
- from typing import List, Optional, Union, Any
3
- from judgeval.data.example import Example
4
- from judgeval.scorers import JudgevalScorer, APIJudgmentScorer
5
- from uuid import uuid4
6
- from datetime import datetime, timezone
7
-
8
- class Sequence(BaseModel):
9
- """
10
- A sequence is a list of either Examples or nested Sequence objects.
11
- """
12
- sequence_id: str = Field(default_factory=lambda: str(uuid4()))
13
- name: Optional[str] = "Sequence"
14
- created_at: str = Field(default_factory=lambda: datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S"))
15
- items: List[Union["Sequence", Example]]
16
- scorers: Optional[Any] = None
17
- parent_sequence_id: Optional[str] = None
18
- sequence_order: Optional[int] = 0
19
- root_sequence_id: Optional[str] = None
20
- inputs: Optional[str] = None
21
- output: Optional[str] = None
22
-
23
- @field_validator("scorers")
24
- def validate_scorer(cls, v):
25
- for scorer in v or []:
26
- if not isinstance(scorer, APIJudgmentScorer) and not isinstance(scorer, JudgevalScorer):
27
- raise ValueError(f"Invalid scorer type: {type(scorer)}")
28
- return v
29
-
30
- @model_validator(mode="after")
31
- def populate_sequence_metadata(self) -> "Sequence":
32
- """Recursively set parent_sequence_id, root_sequence_id, and sequence_order."""
33
- # If root_sequence_id isn't already set, assign it to self
34
- if self.root_sequence_id is None:
35
- self.root_sequence_id = self.sequence_id
36
-
37
- for idx, item in enumerate(self.items):
38
- item.sequence_order = idx
39
- if isinstance(item, Sequence):
40
- item.parent_sequence_id = self.sequence_id
41
- item.root_sequence_id = self.root_sequence_id
42
- item.populate_sequence_metadata()
43
- return self
44
-
45
- class Config:
46
- arbitrary_types_allowed = True
47
-
48
- # Update forward references so that "Sequence" inside items is resolved.
49
- Sequence.model_rebuild()
@@ -1,169 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: judgeval
3
- Version: 0.0.36
4
- Summary: Judgeval Package
5
- Project-URL: Homepage, https://github.com/JudgmentLabs/judgeval
6
- Project-URL: Issues, https://github.com/JudgmentLabs/judgeval/issues
7
- Author-email: Andrew Li <andrew@judgmentlabs.ai>, Alex Shan <alex@judgmentlabs.ai>, Joseph Camyre <joseph@judgmentlabs.ai>
8
- License-Expression: Apache-2.0
9
- License-File: LICENSE.md
10
- Classifier: Operating System :: OS Independent
11
- Classifier: Programming Language :: Python :: 3
12
- Requires-Python: >=3.11
13
- Requires-Dist: anthropic
14
- Requires-Dist: fastapi
15
- Requires-Dist: google-genai
16
- Requires-Dist: langchain
17
- Requires-Dist: langchain-anthropic
18
- Requires-Dist: langchain-core
19
- Requires-Dist: langchain-huggingface
20
- Requires-Dist: langchain-openai
21
- Requires-Dist: litellm==1.38.12
22
- Requires-Dist: nest-asyncio
23
- Requires-Dist: openai
24
- Requires-Dist: openpyxl
25
- Requires-Dist: pandas
26
- Requires-Dist: pika
27
- Requires-Dist: python-dotenv==1.0.1
28
- Requires-Dist: requests
29
- Requires-Dist: supabase
30
- Requires-Dist: together
31
- Requires-Dist: uvicorn
32
- Provides-Extra: dev
33
- Requires-Dist: pytest-asyncio>=0.25.0; extra == 'dev'
34
- Requires-Dist: pytest-mock>=3.14.0; extra == 'dev'
35
- Requires-Dist: pytest>=8.3.4; extra == 'dev'
36
- Requires-Dist: tavily-python; extra == 'dev'
37
- Description-Content-Type: text/markdown
38
-
39
- # Judgeval SDK
40
-
41
- Judgeval is an open-source framework for building evaluation pipelines for multi-step agent workflows, supporting both real-time and experimental evaluation setups. To learn more about Judgment or sign up for free, visit our [website](https://www.judgmentlabs.ai/) or check out our [developer docs](https://judgment.mintlify.app/getting_started).
42
-
43
- ## Features
44
-
45
- - **Development and Production Evaluation Layer**: Offers a robust evaluation layer for multi-step agent applications, including unit-testing and performance monitoring.
46
- - **Plug-and-Evaluate**: Integrate LLM systems with 10+ research-backed metrics, including:
47
- - Hallucination detection
48
- - RAG retriever quality
49
- - And more
50
- - **Custom Evaluation Pipelines**: Construct powerful custom evaluation pipelines tailored for your LLM systems.
51
- - **Monitoring in Production**: Utilize state-of-the-art real-time evaluation foundation models to monitor LLM systems effectively.
52
-
53
- ## Installation
54
-
55
- ```bash
56
- pip install judgeval
57
- ```
58
-
59
- ## Quickstart: Evaluations
60
-
61
- You can evaluate your workflow execution data to measure quality metrics such as hallucination.
62
-
63
- Create a file named `evaluate.py` with the following code:
64
-
65
- ```python
66
- from judgeval import JudgmentClient
67
- from judgeval.data import Example
68
- from judgeval.scorers import FaithfulnessScorer
69
-
70
- client = JudgmentClient()
71
-
72
- example = Example(
73
- input="What if these shoes don't fit?",
74
- actual_output="We offer a 30-day full refund at no extra cost.",
75
- retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."],
76
- )
77
-
78
- scorer = FaithfulnessScorer(threshold=0.5)
79
- results = client.run_evaluation(
80
- examples=[example],
81
- scorers=[scorer],
82
- model="gpt-4o",
83
- )
84
- print(results)
85
- ```
86
- Click [here](https://judgment.mintlify.app/getting_started#create-your-first-experiment) for a more detailed explanation
87
-
88
- ## Quickstart: Traces
89
-
90
- Track your workflow execution for full observability with just a few lines of code.
91
-
92
- Create a file named `traces.py` with the following code:
93
-
94
- ```python
95
- from judgeval.common.tracer import Tracer, wrap
96
- from openai import OpenAI
97
-
98
- # Basic initialization
99
- client = wrap(OpenAI())
100
- judgment = Tracer(project_name="my_project")
101
-
102
- # Or with S3 storage enabled
103
- # NOTE: Make sure AWS creds correspond to an account with write access to the specified S3 bucket
104
- judgment = Tracer(
105
- project_name="my_project",
106
- use_s3=True,
107
- s3_bucket_name="my-traces-bucket", # Bucket created automatically if it doesn't exist
108
- s3_aws_access_key_id="your-access-key", # Optional: defaults to AWS_ACCESS_KEY_ID env var
109
- s3_aws_secret_access_key="your-secret-key", # Optional: defaults to AWS_SECRET_ACCESS_KEY env var
110
- s3_region_name="us-west-1" # Optional: defaults to AWS_REGION env var or "us-west-1"
111
- )
112
-
113
- @judgment.observe(span_type="tool")
114
- def my_tool():
115
- return "Hello world!"
116
-
117
- @judgment.observe(span_type="function")
118
- def main():
119
- task_input = my_tool()
120
- res = client.chat.completions.create(
121
- model="gpt-4o",
122
- messages=[{"role": "user", "content": f"{task_input}"}]
123
- )
124
- return res.choices[0].message.content
125
- ```
126
- Click [here](https://judgment.mintlify.app/getting_started#create-your-first-trace) for a more detailed explanation
127
-
128
- ## Quickstart: Online Evaluations
129
-
130
- Apply performance monitoring to measure the quality of your systems in production, not just on historical data.
131
-
132
- Using the same traces.py file we created earlier:
133
-
134
- ```python
135
- from judgeval.common.tracer import Tracer, wrap
136
- from judgeval.scorers import AnswerRelevancyScorer
137
- from openai import OpenAI
138
-
139
- client = wrap(OpenAI())
140
- judgment = Tracer(project_name="my_project")
141
-
142
- @judgment.observe(span_type="tool")
143
- def my_tool():
144
- return "Hello world!"
145
-
146
- @judgment.observe(span_type="function")
147
- def main():
148
- task_input = my_tool()
149
- res = client.chat.completions.create(
150
- model="gpt-4o",
151
- messages=[{"role": "user", "content": f"{task_input}"}]
152
- ).choices[0].message.content
153
-
154
- judgment.get_current_trace().async_evaluate(
155
- scorers=[AnswerRelevancyScorer(threshold=0.5)],
156
- input=task_input,
157
- actual_output=res,
158
- model="gpt-4o"
159
- )
160
-
161
- return res
162
- ```
163
- Click [here](https://judgment.mintlify.app/getting_started#create-your-first-online-evaluation) for a more detailed explanation
164
-
165
- ## Documentation and Demos
166
-
167
- For more detailed documentation, please check out our [docs](https://judgment.mintlify.app/getting_started) and some of our [demo videos](https://www.youtube.com/@AlexShan-j3o) for reference!
168
-
169
- ##