judgeval 0.0.37__py3-none-any.whl → 0.0.39__py3-none-any.whl
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- judgeval/common/tracer.py +132 -281
- judgeval/common/utils.py +1 -1
- judgeval/constants.py +2 -3
- judgeval/data/__init__.py +0 -2
- judgeval/data/datasets/dataset.py +2 -9
- judgeval/data/datasets/eval_dataset_client.py +1 -62
- judgeval/data/example.py +7 -7
- judgeval/data/result.py +3 -3
- judgeval/data/tool.py +19 -0
- judgeval/data/trace.py +5 -1
- judgeval/data/{sequence_run.py → trace_run.py} +4 -4
- judgeval/evaluation_run.py +1 -1
- judgeval/integrations/langgraph.py +187 -1768
- judgeval/judges/litellm_judge.py +1 -1
- judgeval/judges/mixture_of_judges.py +1 -1
- judgeval/judges/utils.py +1 -1
- judgeval/judgment_client.py +21 -25
- judgeval/run_evaluation.py +381 -107
- judgeval/scorers/judgeval_scorers/api_scorers/tool_order.py +4 -2
- judgeval-0.0.39.dist-info/METADATA +247 -0
- {judgeval-0.0.37.dist-info → judgeval-0.0.39.dist-info}/RECORD +23 -23
- judgeval/data/sequence.py +0 -50
- judgeval-0.0.37.dist-info/METADATA +0 -214
- {judgeval-0.0.37.dist-info → judgeval-0.0.39.dist-info}/WHEEL +0 -0
- {judgeval-0.0.37.dist-info → judgeval-0.0.39.dist-info}/licenses/LICENSE.md +0 -0
@@ -0,0 +1,247 @@
|
|
1
|
+
Metadata-Version: 2.4
|
2
|
+
Name: judgeval
|
3
|
+
Version: 0.0.39
|
4
|
+
Summary: Judgeval Package
|
5
|
+
Project-URL: Homepage, https://github.com/JudgmentLabs/judgeval
|
6
|
+
Project-URL: Issues, https://github.com/JudgmentLabs/judgeval/issues
|
7
|
+
Author-email: Andrew Li <andrew@judgmentlabs.ai>, Alex Shan <alex@judgmentlabs.ai>, Joseph Camyre <joseph@judgmentlabs.ai>
|
8
|
+
License-Expression: Apache-2.0
|
9
|
+
License-File: LICENSE.md
|
10
|
+
Classifier: Operating System :: OS Independent
|
11
|
+
Classifier: Programming Language :: Python :: 3
|
12
|
+
Requires-Python: >=3.11
|
13
|
+
Requires-Dist: anthropic
|
14
|
+
Requires-Dist: boto3
|
15
|
+
Requires-Dist: google-genai
|
16
|
+
Requires-Dist: langchain-anthropic
|
17
|
+
Requires-Dist: langchain-core
|
18
|
+
Requires-Dist: langchain-huggingface
|
19
|
+
Requires-Dist: langchain-openai
|
20
|
+
Requires-Dist: litellm==1.38.12
|
21
|
+
Requires-Dist: nest-asyncio
|
22
|
+
Requires-Dist: openai
|
23
|
+
Requires-Dist: pandas
|
24
|
+
Requires-Dist: python-dotenv==1.0.1
|
25
|
+
Requires-Dist: requests
|
26
|
+
Requires-Dist: together
|
27
|
+
Description-Content-Type: text/markdown
|
28
|
+
|
29
|
+
<div align="center">
|
30
|
+
|
31
|
+
<img src="assets/logo-light.svg#gh-light-mode-only" alt="Judgment Logo" width="400" />
|
32
|
+
<img src="assets/logo-dark.svg#gh-dark-mode-only" alt="Judgment Logo" width="400" />
|
33
|
+
|
34
|
+
**Build monitoring & evaluation pipelines for complex agents**
|
35
|
+
|
36
|
+
<img src="assets/experiments_page.png" alt="Judgment Platform Experiments Page" width="800" />
|
37
|
+
|
38
|
+
<br>
|
39
|
+
|
40
|
+
## [🌐 Landing Page](https://www.judgmentlabs.ai/) • [Twitter/X](https://x.com/JudgmentLabs) • [💼 LinkedIn](https://www.linkedin.com/company/judgmentlabs) • [📚 Docs](https://judgment.mintlify.app/getting_started) • [🚀 Demos](https://www.youtube.com/@AlexShan-j3o) • [🎮 Discord](https://discord.gg/taAufyhf)
|
41
|
+
</div>
|
42
|
+
|
43
|
+
## Judgeval: open-source testing, monitoring, and optimization for AI agents
|
44
|
+
|
45
|
+
Judgeval offers robust tooling for evaluating and tracing LLM agent systems. It is dev-friendly and open-source (licensed under Apache 2.0).
|
46
|
+
|
47
|
+
Judgeval gets you started in five minutes, after which you'll be ready to use all of its features as your agent becomes more complex. Judgeval is natively connected to the [Judgment Platform](https://www.judgmentlabs.ai/) for free and you can export your data and self-host at any time.
|
48
|
+
|
49
|
+
We support tracing agents built with LangGraph, OpenAI SDK, Anthropic, ... and allow custom eval integrations for any use case. Check out our quickstarts below or our [setup guide](https://judgment.mintlify.app/getting_started) to get started.
|
50
|
+
|
51
|
+
Judgeval is created and maintained by [Judgment Labs](https://judgmentlabs.ai/).
|
52
|
+
|
53
|
+
## 📋 Table of Contents
|
54
|
+
* [✨ Features](#-features)
|
55
|
+
* [🔍 Tracing](#-tracing)
|
56
|
+
* [🧪 Evals](#-evals)
|
57
|
+
* [📡 Monitoring](#-monitoring)
|
58
|
+
* [📊 Datasets](#-datasets)
|
59
|
+
* [💡 Insights](#-insights)
|
60
|
+
* [🛠️ Installation](#️-installation)
|
61
|
+
* [🏁 Get Started](#-get-started)
|
62
|
+
* [🏢 Self-Hosting](#-self-hosting)
|
63
|
+
* [📚 Cookbooks](#-cookbooks)
|
64
|
+
* [⭐ Star Us on GitHub](#-star-us-on-github)
|
65
|
+
* [❤️ Contributors](#️-contributors)
|
66
|
+
|
67
|
+
<!-- Created by https://github.com/ekalinin/github-markdown-toc -->
|
68
|
+
|
69
|
+
|
70
|
+
## ✨ Features
|
71
|
+
|
72
|
+
| | |
|
73
|
+
|:---|:---:|
|
74
|
+
| <h3>🔍 Tracing</h3>Automatic agent tracing integrated with common frameworks (LangGraph, OpenAI, Anthropic): **tracking inputs/outputs, latency, and cost** at every step.<br><br>Online evals can be applied to traces to measure quality on production data in real-time.<br><br>Export trace data to the Judgment Platform or your own S3 buckets, {Parquet, JSON, YAML} files, or data warehouse.<br><br>**Useful for:**<br>• 🐛 Debugging agent runs <br>• 👤 Tracking user activity <br>• 🔬 Pinpointing performance bottlenecks| <p align="center"><img src="assets/trace_screenshot.png" alt="Tracing visualization" width="1200"/></p> |
|
75
|
+
| <h3>🧪 Evals</h3>15+ research-backed metrics including tool call accuracy, hallucinations, instruction adherence, and retrieval context recall.<br><br>Build custom evaluators that connect with our metric-tracking infrastructure. <br><br>**Useful for:**<br>• ⚠️ Unit-testing <br>• 🔬 Experimental prompt testing<br>• 🛡️ Online guardrails <br><br> | <p align="center"><img src="assets/experiments_page.png" alt="Evaluation metrics" width="800"/></p> |
|
76
|
+
| <h3>📡 Monitoring</h3>Real-time performance tracking of your agents in production environments. **Track all your metrics in one place.**<br><br>Set up **Slack/email alerts** for critical metrics and receive notifications when thresholds are exceeded.<br><br> **Useful for:** <br>•📉 Identifying degradation early <br>•📈 Visualizing performance trends across versions and time | <p align="center"><img src="assets/monitoring_screenshot.png" alt="Monitoring Dashboard" width="1200"/></p> |
|
77
|
+
| <h3>📊 Datasets</h3>Export trace data or import external testcases to datasets hosted on Judgment's Platform. Move datasets to/from Parquet, S3, etc. <br><br>Run evals on datasets as unit tests or to A/B test different agent configurations. <br><br> **Useful for:**<br>• 🔄 Scaled analysis for A/B tests <br>• 🗃️ Filtered collections of agent runtime data| <p align="center"><img src="assets/datasets_preview_screenshot.png" alt="Dataset management" width="1200"/></p> |
|
78
|
+
| <h3>💡 Insights</h3>Cluster on your data to reveal common use cases and failure modes.<br><br>Trace failures to their exact source with Judgment's Osiris agent, which localizes errors to specific components for precise fixes.<br><br> **Useful for:**<br>•🔮 Surfacing common inputs that lead to error<br>•🤖 Investigating agent/user behavior for optimization <br>| <p align="center"><img src="assets/dataset_clustering_screenshot_dm.png" alt="Insights dashboard" width="1200"/></p> |
|
79
|
+
|
80
|
+
## 🛠️ Installation
|
81
|
+
|
82
|
+
Get started with Judgeval by installing our SDK using pip:
|
83
|
+
|
84
|
+
```bash
|
85
|
+
pip install judgeval
|
86
|
+
```
|
87
|
+
|
88
|
+
Ensure you have your `JUDGMENT_API_KEY` and `JUDGMENT_ORG_ID` environment variables set to connect to the [Judgment platform](https://app.judgmentlabs.ai/).
|
89
|
+
|
90
|
+
**If you don't have keys, [create an account](https://app.judgmentlabs.ai/register) on the platform!**
|
91
|
+
|
92
|
+
## 🏁 Get Started
|
93
|
+
|
94
|
+
Here's how you can quickly start using Judgeval:
|
95
|
+
|
96
|
+
### 🛰️ Tracing
|
97
|
+
|
98
|
+
Track your agent execution with full observability with just a few lines of code.
|
99
|
+
Create a file named `traces.py` with the following code:
|
100
|
+
|
101
|
+
```python
|
102
|
+
from judgeval.common.tracer import Tracer, wrap
|
103
|
+
from openai import OpenAI
|
104
|
+
|
105
|
+
client = wrap(OpenAI())
|
106
|
+
judgment = Tracer(project_name="my_project")
|
107
|
+
|
108
|
+
@judgment.observe(span_type="tool")
|
109
|
+
def my_tool():
|
110
|
+
return "What's the capital of the U.S.?"
|
111
|
+
|
112
|
+
@judgment.observe(span_type="function")
|
113
|
+
def main():
|
114
|
+
task_input = my_tool()
|
115
|
+
res = client.chat.completions.create(
|
116
|
+
model="gpt-4.1",
|
117
|
+
messages=[{"role": "user", "content": f"{task_input}"}]
|
118
|
+
)
|
119
|
+
return res.choices[0].message.content
|
120
|
+
|
121
|
+
main()
|
122
|
+
```
|
123
|
+
|
124
|
+
[Click here](https://judgment.mintlify.app/getting_started#create-your-first-trace) for a more detailed explanation.
|
125
|
+
|
126
|
+
### 📝 Offline Evaluations
|
127
|
+
|
128
|
+
You can evaluate your agent's execution to measure quality metrics such as hallucination.
|
129
|
+
Create a file named `evaluate.py` with the following code:
|
130
|
+
|
131
|
+
```python evaluate.py
|
132
|
+
from judgeval import JudgmentClient
|
133
|
+
from judgeval.data import Example
|
134
|
+
from judgeval.scorers import FaithfulnessScorer
|
135
|
+
|
136
|
+
client = JudgmentClient()
|
137
|
+
|
138
|
+
example = Example(
|
139
|
+
input="What if these shoes don't fit?",
|
140
|
+
actual_output="We offer a 30-day full refund at no extra cost.",
|
141
|
+
retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."],
|
142
|
+
)
|
143
|
+
|
144
|
+
scorer = FaithfulnessScorer(threshold=0.5)
|
145
|
+
results = client.run_evaluation(
|
146
|
+
examples=[example],
|
147
|
+
scorers=[scorer],
|
148
|
+
model="gpt-4.1",
|
149
|
+
)
|
150
|
+
print(results)
|
151
|
+
```
|
152
|
+
|
153
|
+
[Click here](https://judgment.mintlify.app/getting_started#create-your-first-experiment) for a more detailed explanation.
|
154
|
+
|
155
|
+
### 📡 Online Evaluations
|
156
|
+
|
157
|
+
Attach performance monitoring on traces to measure the quality of your systems in production.
|
158
|
+
|
159
|
+
Using the same `traces.py` file we created earlier, modify `main` function:
|
160
|
+
|
161
|
+
```python
|
162
|
+
from judgeval.common.tracer import Tracer, wrap
|
163
|
+
from judgeval.scorers import AnswerRelevancyScorer
|
164
|
+
from openai import OpenAI
|
165
|
+
|
166
|
+
client = wrap(OpenAI())
|
167
|
+
judgment = Tracer(project_name="my_project")
|
168
|
+
|
169
|
+
@judgment.observe(span_type="tool")
|
170
|
+
def my_tool():
|
171
|
+
return "Hello world!"
|
172
|
+
|
173
|
+
@judgment.observe(span_type="function")
|
174
|
+
def main():
|
175
|
+
task_input = my_tool()
|
176
|
+
res = client.chat.completions.create(
|
177
|
+
model="gpt-4.1",
|
178
|
+
messages=[{"role": "user", "content": f"{task_input}"}]
|
179
|
+
).choices[0].message.content
|
180
|
+
|
181
|
+
judgment.async_evaluate(
|
182
|
+
scorers=[AnswerRelevancyScorer(threshold=0.5)],
|
183
|
+
input=task_input,
|
184
|
+
actual_output=res,
|
185
|
+
model="gpt-4.1"
|
186
|
+
)
|
187
|
+
print("Online evaluation submitted.")
|
188
|
+
return res
|
189
|
+
|
190
|
+
main()
|
191
|
+
```
|
192
|
+
|
193
|
+
[Click here](https://judgment.mintlify.app/getting_started#create-your-first-online-evaluation) for a more detailed explanation.
|
194
|
+
|
195
|
+
## 🏢 Self-Hosting
|
196
|
+
|
197
|
+
Run Judgment on your own infrastructure: we provide comprehensive self-hosting capabilities that give you full control over the backend and data plane that Judgeval interfaces with.
|
198
|
+
|
199
|
+
### Key Features
|
200
|
+
* Deploy Judgment on your own AWS account
|
201
|
+
* Store data in your own Supabase instance
|
202
|
+
* Access Judgment through your own custom domain
|
203
|
+
|
204
|
+
### Getting Started
|
205
|
+
1. Check out our [self-hosting documentation](https://judgment.mintlify.app/self_hosting/get_started) for detailed setup instructions, along with how your self-hosted instance can be accessed
|
206
|
+
2. Use the [Judgment CLI](https://github.com/JudgmentLabs/judgment-cli) to deploy your self-hosted environment
|
207
|
+
3. After your self-hosted instance is setup, make sure the `JUDGMENT_API_URL` environmental variable is set to your self-hosted backend endpoint
|
208
|
+
|
209
|
+
## 📚 Cookbooks
|
210
|
+
|
211
|
+
Have your own? We're happy to feature it if you create a PR or message us on [Discord](https://discord.gg/taAufyhf).
|
212
|
+
|
213
|
+
You can access our repo of cookbooks [here](https://github.com/JudgmentLabs/judgment-cookbook). Here are some highlights:
|
214
|
+
|
215
|
+
### Sample Agents
|
216
|
+
|
217
|
+
#### 💰 [LangGraph Financial QA Agent](https://github.com/JudgmentLabs/judgment-cookbook/blob/main/cookbooks/financial_agent/demo.py)
|
218
|
+
A LangGraph-based agent for financial queries, featuring RAG capabilities with a vector database for contextual data retrieval and evaluation of its reasoning and data accuracy.
|
219
|
+
|
220
|
+
#### ✈️ [OpenAI Travel Agent](https://github.com/JudgmentLabs/judgment-cookbook/blob/main/cookbooks/openai_travel_agent/agent.py)
|
221
|
+
A travel planning agent using OpenAI API calls, custom tool functions, and RAG with a vector database for up-to-date and contextual travel information. Evaluated for itinerary quality and information relevance.
|
222
|
+
|
223
|
+
### Custom Evaluators
|
224
|
+
|
225
|
+
#### 🔍 [PII Detection](https://github.com/JudgmentLabs/judgment-cookbook/blob/main/cookbooks/classifier_scorer/pii_checker.py)
|
226
|
+
Detecting and evaluating Personal Identifiable Information (PII) leakage.
|
227
|
+
|
228
|
+
#### 📧 [Cold Email Generation](https://github.com/JudgmentLabs/judgment-cookbook/blob/main/cookbooks/custom_scorers/cold_email_scorer.py)
|
229
|
+
|
230
|
+
Evaluates if a cold email generator properly utilizes all relevant information about the target recipient.
|
231
|
+
|
232
|
+
## ⭐ Star Us on GitHub
|
233
|
+
|
234
|
+
If you find Judgeval useful, please consider giving us a star on GitHub! Your support helps us grow our community and continue improving the product.
|
235
|
+
|
236
|
+
|
237
|
+
## ❤️ Contributors
|
238
|
+
|
239
|
+
There are many ways to contribute to Judgeval:
|
240
|
+
|
241
|
+
- Submit [bug reports](https://github.com/JudgmentLabs/judgeval/issues) and [feature requests](https://github.com/JudgmentLabs/judgeval/issues)
|
242
|
+
- Review the documentation and submit [Pull Requests](https://github.com/JudgmentLabs/judgeval/pulls) to improve it
|
243
|
+
- Speaking or writing about Judgment and letting us know!
|
244
|
+
|
245
|
+
<!-- Contributors collage -->
|
246
|
+
[](https://github.com/JudgmentLabs/judgeval/graphs/contributors)
|
247
|
+
|
@@ -1,35 +1,35 @@
|
|
1
1
|
judgeval/__init__.py,sha256=x9HWt4waJwJMAqTuJSg2MezF9Zg-macEjeU-ajbly-8,330
|
2
2
|
judgeval/clients.py,sha256=6VQmEqmfCngUdS2MuPBIpHvtDFqOENm8-_BmMvjLyRQ,944
|
3
|
-
judgeval/constants.py,sha256=
|
4
|
-
judgeval/evaluation_run.py,sha256=
|
5
|
-
judgeval/judgment_client.py,sha256=
|
3
|
+
judgeval/constants.py,sha256=aDEy51CUbzp_CWARFmw3Fie5fZ-2pkaYPc_gUEbvT4Y,5591
|
4
|
+
judgeval/evaluation_run.py,sha256=V9xMyiJ7e9lqHRblaeeMh6oyx1MEtGwfSxYtbi-EeXY,6746
|
5
|
+
judgeval/judgment_client.py,sha256=eQQ6J3iUPHfBu9v83-8F-yNMqf015b1NoGsbLOzy2s4,23375
|
6
6
|
judgeval/rules.py,sha256=jkh1cXXcUf8oRY7xJUZfcQBYWn_rjUW4GvrhRt15PeU,20265
|
7
|
-
judgeval/run_evaluation.py,sha256=
|
7
|
+
judgeval/run_evaluation.py,sha256=bYNbMubqOOUNlsplY5Iw9IpUxuuqsJHIs-RweWC45E4,47474
|
8
8
|
judgeval/version_check.py,sha256=bvJEidB7rAeXozoUbN9Yb97QOR_s2hgvpvj74jJ5HlY,943
|
9
9
|
judgeval/common/__init__.py,sha256=7d24BRxtncpMj3AAJCj8RS7TqgjXmW777HVZH6-3sBs,289
|
10
10
|
judgeval/common/exceptions.py,sha256=U-TxHLn7oVMezsMuoYouNDb2XuS8RCggfntYf5_6u4E,565
|
11
11
|
judgeval/common/logger.py,sha256=KO75wWXCxhUHUMvLaTU31ZzOk6tkZBa7heQ7y0f-zFE,6062
|
12
12
|
judgeval/common/s3_storage.py,sha256=W8wq9S7qJZdqdBR4sk3aEZ4K3-pz40DOoolOJrWs9Vo,3768
|
13
|
-
judgeval/common/tracer.py,sha256=
|
14
|
-
judgeval/common/utils.py,sha256=
|
15
|
-
judgeval/data/__init__.py,sha256
|
13
|
+
judgeval/common/tracer.py,sha256=EkWkg2AsS5FIj2ffh912qZZ9ew5h3hu2rynPBDsMszw,80463
|
14
|
+
judgeval/common/utils.py,sha256=w1SjpDtB1DTJapFSAvLzr_a3gGI45iacEoxIUnQXx4Q,34087
|
15
|
+
judgeval/data/__init__.py,sha256=Q4WiIva20U_NgxGr-MU-9FWN_eFzUZBVgCsBmoo7IM8,501
|
16
16
|
judgeval/data/custom_example.py,sha256=QRBqiRiZS8UgVeTRHY0r1Jzm6yAYsyg6zmHxQGxdiQs,739
|
17
|
-
judgeval/data/example.py,sha256=
|
18
|
-
judgeval/data/result.py,sha256=
|
17
|
+
judgeval/data/example.py,sha256=XptCg2dLMS46SfDYa4kLgq1zXnlDnhOmR15Ci_08p90,6882
|
18
|
+
judgeval/data/result.py,sha256=KfU9lhAKG_Xo2eGDm2uKVVRZpf177IDASg1cIwedJwE,3184
|
19
19
|
judgeval/data/scorer_data.py,sha256=JVlaTx1EP2jw2gh3Vgx1CSEsvIFABAN26IquKyxwiJQ,3273
|
20
|
-
judgeval/data/
|
21
|
-
judgeval/data/
|
22
|
-
judgeval/data/
|
20
|
+
judgeval/data/tool.py,sha256=x6YsdTTfeIwSn5f1xIDU3j1xJgSCzho0FW1ojR-L0Ac,612
|
21
|
+
judgeval/data/trace.py,sha256=euYIbwYsGqATWIeOZwBzNWS3hh3wefVzMJ7v5rHvG6c,5069
|
22
|
+
judgeval/data/trace_run.py,sha256=G_OsHNK_nZzJKhtdiyWp7GFyyns5AOJZ956GM_4jXM0,2192
|
23
23
|
judgeval/data/datasets/__init__.py,sha256=IdNKhQv9yYZ_op0rdBacrFaFVmiiYQ3JTzXzxOTsEVQ,176
|
24
|
-
judgeval/data/datasets/dataset.py,sha256=
|
25
|
-
judgeval/data/datasets/eval_dataset_client.py,sha256=
|
26
|
-
judgeval/integrations/langgraph.py,sha256=
|
24
|
+
judgeval/data/datasets/dataset.py,sha256=oU9hvZTifK2x8em3FhL3oIqgHOByfJWH6C_9rIKnL5g,12773
|
25
|
+
judgeval/data/datasets/eval_dataset_client.py,sha256=3RBfkaMrkudjnmY_qFwY4I-2mOPE3XK4WxkfSweLB-Q,15016
|
26
|
+
judgeval/integrations/langgraph.py,sha256=L9zPPWVLGL2HWuwHPqM5Kic4S7EfQ_Y1Y3YKBJNfGCA,23004
|
27
27
|
judgeval/judges/__init__.py,sha256=6X7VSwrwsdxGBNxCyapVRWGghhKOy3MVxFNMQ62kCXM,308
|
28
28
|
judgeval/judges/base_judge.py,sha256=ch_S7uBB7lyv44Lf1d7mIGFpveOO58zOkkpImKgd9_4,994
|
29
|
-
judgeval/judges/litellm_judge.py,sha256=
|
30
|
-
judgeval/judges/mixture_of_judges.py,sha256=
|
29
|
+
judgeval/judges/litellm_judge.py,sha256=DhB6px9ELZL3gbMb2w4FkBliuTlaCVIcjE8v149G6NM,2425
|
30
|
+
judgeval/judges/mixture_of_judges.py,sha256=D97h8L-6saPwwppVwitrIdlMAjizzxGWeVOfNyVnXZA,15550
|
31
31
|
judgeval/judges/together_judge.py,sha256=l00hhPerAZXg3oYBd8cyMtWsOTNt_0FIqoxhKJKQe3k,2302
|
32
|
-
judgeval/judges/utils.py,sha256=
|
32
|
+
judgeval/judges/utils.py,sha256=vL-15_udU94JHUAiyrAvHAKMj6Fqypg01ek4YH5zVCM,2687
|
33
33
|
judgeval/scorers/__init__.py,sha256=-4GLkYiLKI_BxpoIfgadCFEUfqJcBWZLAtfrInjZT0Q,1282
|
34
34
|
judgeval/scorers/api_scorer.py,sha256=NQ_CrrUPhSUk1k2Q8rKpCG_TU2FT32sFEqvb-Yi54B0,2688
|
35
35
|
judgeval/scorers/exceptions.py,sha256=eGW5CuJgZ5YJBFrE4FHDSF651PO1dKAZ379mJ8gOsfo,178
|
@@ -53,14 +53,14 @@ judgeval/scorers/judgeval_scorers/api_scorers/hallucination.py,sha256=k5gDOki-8K
|
|
53
53
|
judgeval/scorers/judgeval_scorers/api_scorers/instruction_adherence.py,sha256=XnSGEkQfwVqaqnHEGMCsxNiHVzrsrej48uDbLoWc8CQ,678
|
54
54
|
judgeval/scorers/judgeval_scorers/api_scorers/json_correctness.py,sha256=mMKEuR87_yanEuZJ5YSGFMHDD_oLVZ6-rQuciFaDOMA,1095
|
55
55
|
judgeval/scorers/judgeval_scorers/api_scorers/summarization.py,sha256=QmWB8bVbDYHY5FcF0rYZE_3c2XXgMLRmR6aXJWfdMC4,655
|
56
|
-
judgeval/scorers/judgeval_scorers/api_scorers/tool_order.py,sha256=
|
56
|
+
judgeval/scorers/judgeval_scorers/api_scorers/tool_order.py,sha256=urm8LgkeZA7e-ePWo6AToKGheQYSp6MOpKon5NF5EJw,570
|
57
57
|
judgeval/scorers/judgeval_scorers/classifiers/__init__.py,sha256=Qt81W5ZCwMvBAne0LfQDb8xvg5iOG1vEYP7WizgwAZo,67
|
58
58
|
judgeval/scorers/judgeval_scorers/classifiers/text2sql/__init__.py,sha256=8iTzMvou1Dr8pybul6lZHKjc9Ye2-0_racRGYkhEdTY,74
|
59
59
|
judgeval/scorers/judgeval_scorers/classifiers/text2sql/text2sql_scorer.py,sha256=ly72Z7s_c8NID6-nQnuW8qEGEW2MqdvpJ-5WfXzbAQg,2579
|
60
60
|
judgeval/tracer/__init__.py,sha256=wy3DYpH8U_z0GO_K_gOSkK0tTTD-u5eLDo0T5xIBoAc,147
|
61
61
|
judgeval/utils/alerts.py,sha256=O19Xj7DA0YVjl8PWiuH4zfdZeu3yiLVvHfY8ah2wG0g,2759
|
62
62
|
judgeval/utils/data_utils.py,sha256=pB4GBWi8XoM2zSR2NlLXH5kqcQ029BVhDxaVKkdmiBY,1860
|
63
|
-
judgeval-0.0.
|
64
|
-
judgeval-0.0.
|
65
|
-
judgeval-0.0.
|
66
|
-
judgeval-0.0.
|
63
|
+
judgeval-0.0.39.dist-info/METADATA,sha256=Q4wTRKXRoTozgF96BJFFoGwOoy-vLnAGs0HOQ9PCZ_k,11860
|
64
|
+
judgeval-0.0.39.dist-info/WHEEL,sha256=qtCwoSJWgHk21S1Kb4ihdzI2rlJ1ZKaIurTj_ngOhyQ,87
|
65
|
+
judgeval-0.0.39.dist-info/licenses/LICENSE.md,sha256=tKmCg7k5QOmxPK19XMfzim04QiQJPmgIm0pAn55IJwk,11352
|
66
|
+
judgeval-0.0.39.dist-info/RECORD,,
|
judgeval/data/sequence.py
DELETED
@@ -1,50 +0,0 @@
|
|
1
|
-
from pydantic import BaseModel, Field, field_validator, model_validator
|
2
|
-
from typing import List, Optional, Union, Any, Dict
|
3
|
-
from judgeval.data.example import Example
|
4
|
-
from judgeval.scorers import JudgevalScorer, APIJudgmentScorer
|
5
|
-
from uuid import uuid4
|
6
|
-
from datetime import datetime, timezone
|
7
|
-
|
8
|
-
class Sequence(BaseModel):
|
9
|
-
"""
|
10
|
-
A sequence is a list of either Examples or nested Sequence objects.
|
11
|
-
"""
|
12
|
-
sequence_id: str = Field(default_factory=lambda: str(uuid4()))
|
13
|
-
name: Optional[str] = "Sequence"
|
14
|
-
created_at: str = Field(default_factory=lambda: datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S"))
|
15
|
-
items: List[Union["Sequence", Example]] = []
|
16
|
-
scorers: Optional[Any] = None
|
17
|
-
parent_sequence_id: Optional[str] = None
|
18
|
-
sequence_order: Optional[int] = 0
|
19
|
-
root_sequence_id: Optional[str] = None
|
20
|
-
inputs: Optional[Dict[str, Any]] = None
|
21
|
-
output: Optional[Any] = None
|
22
|
-
expected_tools: Optional[List[Dict[str, Any]]] = None
|
23
|
-
|
24
|
-
@field_validator("scorers")
|
25
|
-
def validate_scorer(cls, v):
|
26
|
-
for scorer in v or []:
|
27
|
-
if not isinstance(scorer, APIJudgmentScorer) and not isinstance(scorer, JudgevalScorer):
|
28
|
-
raise ValueError(f"Invalid scorer type: {type(scorer)}")
|
29
|
-
return v
|
30
|
-
|
31
|
-
@model_validator(mode="after")
|
32
|
-
def populate_sequence_metadata(self) -> "Sequence":
|
33
|
-
"""Recursively set parent_sequence_id, root_sequence_id, and sequence_order."""
|
34
|
-
# If root_sequence_id isn't already set, assign it to self
|
35
|
-
if self.root_sequence_id is None:
|
36
|
-
self.root_sequence_id = self.sequence_id
|
37
|
-
|
38
|
-
for idx, item in enumerate(self.items):
|
39
|
-
item.sequence_order = idx
|
40
|
-
if isinstance(item, Sequence):
|
41
|
-
item.parent_sequence_id = self.sequence_id
|
42
|
-
item.root_sequence_id = self.root_sequence_id
|
43
|
-
item.populate_sequence_metadata()
|
44
|
-
return self
|
45
|
-
|
46
|
-
class Config:
|
47
|
-
arbitrary_types_allowed = True
|
48
|
-
|
49
|
-
# Update forward references so that "Sequence" inside items is resolved.
|
50
|
-
Sequence.model_rebuild()
|
@@ -1,214 +0,0 @@
|
|
1
|
-
Metadata-Version: 2.4
|
2
|
-
Name: judgeval
|
3
|
-
Version: 0.0.37
|
4
|
-
Summary: Judgeval Package
|
5
|
-
Project-URL: Homepage, https://github.com/JudgmentLabs/judgeval
|
6
|
-
Project-URL: Issues, https://github.com/JudgmentLabs/judgeval/issues
|
7
|
-
Author-email: Andrew Li <andrew@judgmentlabs.ai>, Alex Shan <alex@judgmentlabs.ai>, Joseph Camyre <joseph@judgmentlabs.ai>
|
8
|
-
License-Expression: Apache-2.0
|
9
|
-
License-File: LICENSE.md
|
10
|
-
Classifier: Operating System :: OS Independent
|
11
|
-
Classifier: Programming Language :: Python :: 3
|
12
|
-
Requires-Python: >=3.11
|
13
|
-
Requires-Dist: anthropic
|
14
|
-
Requires-Dist: boto3
|
15
|
-
Requires-Dist: google-genai
|
16
|
-
Requires-Dist: langchain-anthropic
|
17
|
-
Requires-Dist: langchain-core
|
18
|
-
Requires-Dist: langchain-huggingface
|
19
|
-
Requires-Dist: langchain-openai
|
20
|
-
Requires-Dist: litellm==1.38.12
|
21
|
-
Requires-Dist: nest-asyncio
|
22
|
-
Requires-Dist: openai
|
23
|
-
Requires-Dist: pandas
|
24
|
-
Requires-Dist: python-dotenv==1.0.1
|
25
|
-
Requires-Dist: requests
|
26
|
-
Requires-Dist: together
|
27
|
-
Description-Content-Type: text/markdown
|
28
|
-
|
29
|
-
<div align="center">
|
30
|
-
|
31
|
-
<img src="assets/logo-light.svg#gh-light-mode-only" alt="Judgment Logo" width="400" />
|
32
|
-
<img src="assets/logo-dark.svg#gh-dark-mode-only" alt="Judgment Logo" width="400" />
|
33
|
-
|
34
|
-
**Build monitoring & evaluation pipelines for complex agents**
|
35
|
-
|
36
|
-
[Website](https://www.judgmentlabs.ai/) • [Twitter/X](https://x.com/JudgmentLabs) • [LinkedIn](https://www.linkedin.com/company/judgmentlabs) • [Documentation](https://judgment.mintlify.app/getting_started) • [Demos](https://www.youtube.com/@AlexShan-j3o)
|
37
|
-
|
38
|
-
</div>
|
39
|
-
|
40
|
-
## 🚀 What is Judgeval?
|
41
|
-
|
42
|
-
Judgeval is an open-source tool for testing, monitoring, and optimizing AI agents. Judgeval is created and maintained by [Judgment Labs](https://judgmentlabs.ai/).
|
43
|
-
|
44
|
-
|
45
|
-
**🔍 Tracing**
|
46
|
-
* Automatic agent tracing for common agent frameworks and SDKs (LangGraph, OpenAI, Anthropic, etc.)
|
47
|
-
* Track input/output, latency, cost, token usage at every step
|
48
|
-
* Function tracing with `@judgment.observe` decorator
|
49
|
-
|
50
|
-
**🧪 Evals**
|
51
|
-
* Plug-and-measure 15+ metrics, including:
|
52
|
-
* Tool call accuracy
|
53
|
-
* Hallucinations
|
54
|
-
* Instruction adherence
|
55
|
-
* Retrieval context recall
|
56
|
-
|
57
|
-
Our metric implementations are research-backed by Stanford and Berkeley AI labs. Check out our [research](https://judgmentlabs.ai/research)!
|
58
|
-
* Build custom evaluators that seamlessly connect with our infrastructure!
|
59
|
-
* Use our evals for:
|
60
|
-
* ⚠️ Unit-testing your agent
|
61
|
-
* 🔬 Experimentally testing new prompts and models
|
62
|
-
* 🛡️ Online evaluations to guardrail your agent's actions and responses
|
63
|
-
|
64
|
-
**📊 Datasets**
|
65
|
-
* Export trace data to datasets hosted on Judgment's Platform and export to JSON, Parquet, S3, etc.
|
66
|
-
* Run evals on datasets as unit-tests or to A/B test agent configs
|
67
|
-
|
68
|
-
**💡 Insights**
|
69
|
-
* Error clustering groups agent failures to uncover failure patterns and speed up root cause analysis
|
70
|
-
* Trace agent failures to their exact source. Judgment's Osiris agent localizes errors to specific agent components, enabling precise, targeted fixes.
|
71
|
-
|
72
|
-
|
73
|
-
## 🛠️ Installation
|
74
|
-
|
75
|
-
Get started with Judgeval by installing our SDK using pip:
|
76
|
-
|
77
|
-
```bash
|
78
|
-
pip install judgeval
|
79
|
-
```
|
80
|
-
|
81
|
-
Ensure you have your `JUDGMENT_API_KEY` environment variable set to connect to the [Judgment platform](https://app.judgmentlabs.ai/). If you don't have a key, create an account on the platform!
|
82
|
-
|
83
|
-
## 🏁 Get Started
|
84
|
-
|
85
|
-
Here's how you can quickly start using Judgeval:
|
86
|
-
|
87
|
-
### 🛰️ Tracing
|
88
|
-
|
89
|
-
Track your agent execution with full observability with just a few lines of code.
|
90
|
-
Create a file named `traces.py` with the following code:
|
91
|
-
|
92
|
-
```python
|
93
|
-
from judgeval.common.tracer import Tracer, wrap
|
94
|
-
from openai import OpenAI
|
95
|
-
|
96
|
-
client = wrap(OpenAI())
|
97
|
-
judgment = Tracer(project_name="my_project")
|
98
|
-
|
99
|
-
@judgment.observe(span_type="tool")
|
100
|
-
def my_tool():
|
101
|
-
return "What's the capital of the U.S.?"
|
102
|
-
|
103
|
-
@judgment.observe(span_type="function")
|
104
|
-
def main():
|
105
|
-
task_input = my_tool()
|
106
|
-
res = client.chat.completions.create(
|
107
|
-
model="gpt-4.1",
|
108
|
-
messages=[{"role": "user", "content": f"{task_input}"}]
|
109
|
-
)
|
110
|
-
return res.choices[0].message.content
|
111
|
-
|
112
|
-
main()
|
113
|
-
```
|
114
|
-
|
115
|
-
[Click here](https://judgment.mintlify.app/getting_started#create-your-first-trace) for a more detailed explanation.
|
116
|
-
|
117
|
-
### 📝 Offline Evaluations
|
118
|
-
|
119
|
-
You can evaluate your agent's execution to measure quality metrics such as hallucination.
|
120
|
-
Create a file named `evaluate.py` with the following code:
|
121
|
-
|
122
|
-
```python evaluate.py
|
123
|
-
from judgeval import JudgmentClient
|
124
|
-
from judgeval.data import Example
|
125
|
-
from judgeval.scorers import FaithfulnessScorer
|
126
|
-
|
127
|
-
client = JudgmentClient()
|
128
|
-
|
129
|
-
example = Example(
|
130
|
-
input="What if these shoes don't fit?",
|
131
|
-
actual_output="We offer a 30-day full refund at no extra cost.",
|
132
|
-
retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."],
|
133
|
-
)
|
134
|
-
|
135
|
-
scorer = FaithfulnessScorer(threshold=0.5)
|
136
|
-
results = client.run_evaluation(
|
137
|
-
examples=[example],
|
138
|
-
scorers=[scorer],
|
139
|
-
model="gpt-4.1",
|
140
|
-
)
|
141
|
-
print(results)
|
142
|
-
```
|
143
|
-
|
144
|
-
[Click here](https://judgment.mintlify.app/getting_started#create-your-first-experiment) for a more detailed explanation.
|
145
|
-
|
146
|
-
### 📡 Online Evaluations
|
147
|
-
|
148
|
-
Apply performance monitoring to measure the quality of your systems in production, not just on traces.
|
149
|
-
|
150
|
-
Using the same `traces.py` file we created earlier, modify `main` function:
|
151
|
-
|
152
|
-
```python
|
153
|
-
from judgeval.common.tracer import Tracer, wrap
|
154
|
-
from judgeval.scorers import AnswerRelevancyScorer
|
155
|
-
from openai import OpenAI
|
156
|
-
|
157
|
-
client = wrap(OpenAI())
|
158
|
-
judgment = Tracer(project_name="my_project")
|
159
|
-
|
160
|
-
@judgment.observe(span_type="tool")
|
161
|
-
def my_tool():
|
162
|
-
return "Hello world!"
|
163
|
-
|
164
|
-
@judgment.observe(span_type="function")
|
165
|
-
def main():
|
166
|
-
task_input = my_tool()
|
167
|
-
res = client.chat.completions.create(
|
168
|
-
model="gpt-4.1",
|
169
|
-
messages=[{"role": "user", "content": f"{task_input}"}]
|
170
|
-
).choices[0].message.content
|
171
|
-
|
172
|
-
judgment.get_current_trace().async_evaluate(
|
173
|
-
scorers=[AnswerRelevancyScorer(threshold=0.5)],
|
174
|
-
input=task_input,
|
175
|
-
actual_output=res,
|
176
|
-
model="gpt-4.1"
|
177
|
-
)
|
178
|
-
print("Online evaluation submitted.")
|
179
|
-
return res
|
180
|
-
|
181
|
-
main()
|
182
|
-
```
|
183
|
-
|
184
|
-
[Click here](https://judgment.mintlify.app/getting_started#create-your-first-online-evaluation) for a more detailed explanation.
|
185
|
-
|
186
|
-
## 🏢 Self-Hosting
|
187
|
-
|
188
|
-
Run Judgment on your own infrastructure: we provide comprehensive self-hosting capabilities that give you full control over the backend and data plane that Judgeval interfaces with.
|
189
|
-
|
190
|
-
### Key Features
|
191
|
-
* Deploy Judgment on your own AWS account
|
192
|
-
* Store data in your own Supabase instance
|
193
|
-
* Access Judgment through your own custom domain
|
194
|
-
|
195
|
-
### Getting Started
|
196
|
-
1. Check out our [self-hosting documentation](https://judgment.mintlify.app/self_hosting/get_started) for detailed setup instructions, along with how your self-hosted instance can be accessed
|
197
|
-
2. Use the [Judgment CLI](https://github.com/JudgmentLabs/judgment-cli) to deploy your self-hosted environment
|
198
|
-
3. After your self-hosted instance is setup, make sure the `JUDGMENT_API_URL` environmental variable is set to your self-hosted backend endpoint
|
199
|
-
|
200
|
-
## ⭐ Star Us on GitHub
|
201
|
-
|
202
|
-
If you find Judgeval useful, please consider giving us a star on GitHub! Your support helps us grow our community and continue improving the product.
|
203
|
-
|
204
|
-
## 🤝 Contributing
|
205
|
-
|
206
|
-
There are many ways to contribute to Judgeval:
|
207
|
-
|
208
|
-
- Submit [bug reports](https://github.com/JudgmentLabs/judgeval/issues) and [feature requests](https://github.com/JudgmentLabs/judgeval/issues)
|
209
|
-
- Review the documentation and submit [Pull Requests](https://github.com/JudgmentLabs/judgeval/pulls) to improve it
|
210
|
-
- Speaking or writing about Judgment and letting us know!
|
211
|
-
|
212
|
-
## Documentation and Demos
|
213
|
-
|
214
|
-
For more detailed documentation, please check out our [developer docs](https://judgment.mintlify.app/getting_started) and some of our [demo videos](https://www.youtube.com/@AlexShan-j3o) for reference!
|
File without changes
|
File without changes
|