judgeval 0.0.39__py3-none-any.whl → 0.0.41__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,1450 @@
1
+ Metadata-Version: 2.4
2
+ Name: judgeval
3
+ Version: 0.0.41
4
+ Summary: Judgeval Package
5
+ Project-URL: Homepage, https://github.com/JudgmentLabs/judgeval
6
+ Project-URL: Issues, https://github.com/JudgmentLabs/judgeval/issues
7
+ Author-email: Andrew Li <andrew@judgmentlabs.ai>, Alex Shan <alex@judgmentlabs.ai>, Joseph Camyre <joseph@judgmentlabs.ai>
8
+ License-Expression: Apache-2.0
9
+ License-File: LICENSE.md
10
+ Classifier: Operating System :: OS Independent
11
+ Classifier: Programming Language :: Python :: 3
12
+ Requires-Python: >=3.11
13
+ Requires-Dist: anthropic
14
+ Requires-Dist: boto3
15
+ Requires-Dist: google-genai
16
+ Requires-Dist: langchain-anthropic
17
+ Requires-Dist: langchain-core
18
+ Requires-Dist: langchain-huggingface
19
+ Requires-Dist: langchain-openai
20
+ Requires-Dist: litellm==1.61.15
21
+ Requires-Dist: nest-asyncio
22
+ Requires-Dist: openai
23
+ Requires-Dist: pandas
24
+ Requires-Dist: python-dotenv==1.0.1
25
+ Requires-Dist: requests
26
+ Requires-Dist: together
27
+ Description-Content-Type: text/markdown
28
+
29
+ <div align="center">
30
+
31
+ <img src="assets/new_lightmode.svg#gh-light-mode-only" alt="Judgment Logo" width="400" />
32
+ <img src="assets/new_darkmode.svg#gh-dark-mode-only" alt="Judgment Logo" width="400" />
33
+
34
+ **Build monitoring & evaluation pipelines for complex agents**
35
+
36
+ <img src="assets/experiments_pagev2.png" alt="Judgment Platform Experiments Page" width="800" />
37
+
38
+ <br>
39
+
40
+ ## [🌐 Landing Page](https://www.judgmentlabs.ai/) • [📚 Docs](https://docs.judgmentlabs.ai/introduction) • [🚀 Demos](https://www.youtube.com/@AlexShan-j3o)
41
+
42
+ [![X](https://img.shields.io/badge/-X/Twitter-000?logo=x&logoColor=white)](https://x.com/JudgmentLabs)
43
+ [![LinkedIn](https://custom-icon-badges.demolab.com/badge/LinkedIn%20-0A66C2?logo=linkedin-white&logoColor=fff)](https://www.linkedin.com/company/judgmentlabs)
44
+ [![Discord](https://img.shields.io/badge/-Discord-5865F2?logo=discord&logoColor=white)](https://discord.gg/ZCnSXYug)
45
+
46
+ </div>
47
+
48
+ ## Judgeval: open-source testing, monitoring, and optimization for AI agents
49
+
50
+ Judgeval offers robust tooling for evaluating and tracing LLM agent systems. It is dev-friendly and open-source (licensed under Apache 2.0).
51
+
52
+ Judgeval gets you started in five minutes, after which you'll be ready to use all of its features as your agent becomes more complex. Judgeval is natively connected to the [Judgment Platform](https://www.judgmentlabs.ai/) for free and you can export your data and self-host at any time.
53
+
54
+ We support tracing agents built with LangGraph, OpenAI SDK, Anthropic, ... and allow custom eval integrations for any use case. Check out our quickstarts below or our [setup guide](https://docs.judgmentlabs.ai/getting-started) to get started.
55
+
56
+ Judgeval is created and maintained by [Judgment Labs](https://judgmentlabs.ai/).
57
+
58
+ ## 📋 Table of Contents
59
+ - [🌐 Landing Page • 📚 Docs • 🚀 Demos](#-landing-page----docs---demos)
60
+ - [Judgeval: open-source testing, monitoring, and optimization for AI agents](#judgeval-open-source-testing-monitoring-and-optimization-for-ai-agents)
61
+ - [📋 Table of Contents](#-table-of-contents)
62
+ - [✨ Features](#-features)
63
+ - [🛠️ Installation](#️-installation)
64
+ - [🏁 Get Started](#-get-started)
65
+ - [🛰️ Tracing](#️-tracing)
66
+ - [📝 Offline Evaluations](#-offline-evaluations)
67
+ - [📡 Online Evaluations](#-online-evaluations)
68
+ - [🏢 Self-Hosting](#-self-hosting)
69
+ - [Key Features](#key-features)
70
+ - [Getting Started](#getting-started)
71
+ - [📚 Cookbooks](#-cookbooks)
72
+ - [Sample Agents](#sample-agents)
73
+ - [💰 LangGraph Financial QA Agent](#-langgraph-financial-qa-agent)
74
+ - [✈️ OpenAI Travel Agent](#️-openai-travel-agent)
75
+ - [Custom Evaluators](#custom-evaluators)
76
+ - [🔍 PII Detection](#-pii-detection)
77
+ - [📧 Cold Email Generation](#-cold-email-generation)
78
+ - [💻 Development with Cursor](#-development-with-cursor)
79
+ - [⭐ Star Us on GitHub](#-star-us-on-github)
80
+ - [❤️ Contributors](#️-contributors)
81
+
82
+ <!-- Created by https://github.com/ekalinin/github-markdown-toc -->
83
+
84
+
85
+ ## ✨ Features
86
+
87
+ | | |
88
+ |:---|:---:|
89
+ | <h3>🔍 Tracing</h3>Automatic agent tracing integrated with common frameworks (LangGraph, OpenAI, Anthropic): **tracking inputs/outputs, latency, and cost** at every step.<br><br>Online evals can be applied to traces to measure quality on production data in real-time.<br><br>Export trace data to the Judgment Platform or your own S3 buckets, {Parquet, JSON, YAML} files, or data warehouse.<br><br>**Useful for:**<br>• 🐛 Debugging agent runs <br>• 👤 Tracking user activity <br>• 🔬 Pinpointing performance bottlenecks| <p align="center"><img src="assets/trace_screenshot.png" alt="Tracing visualization" width="1200"/></p> |
90
+ | <h3>🧪 Evals</h3>15+ research-backed metrics including tool call accuracy, hallucinations, instruction adherence, and retrieval context recall.<br><br>Build custom evaluators that connect with our metric-tracking infrastructure. <br><br>**Useful for:**<br>• ⚠️ Unit-testing <br>• 🔬 Experimental prompt testing<br>• 🛡️ Online guardrails <br><br> | <p align="center"><img src="assets/experiments_page.png" alt="Evaluation metrics" width="800"/></p> |
91
+ | <h3>📡 Monitoring</h3>Real-time performance tracking of your agents in production environments. **Track all your metrics in one place.**<br><br>Set up **Slack/email alerts** for critical metrics and receive notifications when thresholds are exceeded.<br><br> **Useful for:** <br>•📉 Identifying degradation early <br>•📈 Visualizing performance trends across versions and time | <p align="center"><img src="assets/monitoring_screenshot.png" alt="Monitoring Dashboard" width="1200"/></p> |
92
+ | <h3>📊 Datasets</h3>Export trace data or import external testcases to datasets hosted on Judgment's Platform. Move datasets to/from Parquet, S3, etc. <br><br>Run evals on datasets as unit tests or to A/B test different agent configurations. <br><br> **Useful for:**<br>• 🔄 Scaled analysis for A/B tests <br>• 🗃️ Filtered collections of agent runtime data| <p align="center"><img src="assets/datasets_preview_screenshot.png" alt="Dataset management" width="1200"/></p> |
93
+ | <h3>💡 Insights</h3>Cluster on your data to reveal common use cases and failure modes.<br><br>Trace failures to their exact source with Judgment's Osiris agent, which localizes errors to specific components for precise fixes.<br><br> **Useful for:**<br>•🔮 Surfacing common inputs that lead to error<br>•🤖 Investigating agent/user behavior for optimization <br>| <p align="center"><img src="assets/dataset_clustering_screenshot_dm.png" alt="Insights dashboard" width="1200"/></p> |
94
+
95
+ ## 🛠️ Installation
96
+
97
+ Get started with Judgeval by installing our SDK using pip:
98
+
99
+ ```bash
100
+ pip install judgeval
101
+ ```
102
+
103
+ Ensure you have your `JUDGMENT_API_KEY` and `JUDGMENT_ORG_ID` environment variables set to connect to the [Judgment platform](https://app.judgmentlabs.ai/).
104
+
105
+ **If you don't have keys, [create an account](https://app.judgmentlabs.ai/register) on the platform!**
106
+
107
+ ## 🏁 Get Started
108
+
109
+ Here's how you can quickly start using Judgeval:
110
+
111
+ ### 🛰️ Tracing
112
+
113
+ Track your agent execution with full observability with just a few lines of code.
114
+ Create a file named `traces.py` with the following code:
115
+
116
+ ```python
117
+ from judgeval.common.tracer import Tracer, wrap
118
+ from openai import OpenAI
119
+
120
+ client = wrap(OpenAI())
121
+ judgment = Tracer(project_name="my_project")
122
+
123
+ @judgment.observe(span_type="tool")
124
+ def my_tool():
125
+ return "What's the capital of the U.S.?"
126
+
127
+ @judgment.observe(span_type="function")
128
+ def main():
129
+ task_input = my_tool()
130
+ res = client.chat.completions.create(
131
+ model="gpt-4.1",
132
+ messages=[{"role": "user", "content": f"{task_input}"}]
133
+ )
134
+ return res.choices[0].message.content
135
+
136
+ main()
137
+ ```
138
+
139
+ [Click here](https://docs.judgmentlabs.ai/getting-started#create-your-first-trace) for a more detailed explanation.
140
+
141
+ ### 📝 Offline Evaluations
142
+
143
+ You can evaluate your agent's execution to measure quality metrics such as hallucination.
144
+ Create a file named `evaluate.py` with the following code:
145
+
146
+ ```python evaluate.py
147
+ from judgeval import JudgmentClient
148
+ from judgeval.data import Example
149
+ from judgeval.scorers import FaithfulnessScorer
150
+
151
+ client = JudgmentClient()
152
+
153
+ example = Example(
154
+ input="What if these shoes don't fit?",
155
+ actual_output="We offer a 30-day full refund at no extra cost.",
156
+ retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."],
157
+ )
158
+
159
+ scorer = FaithfulnessScorer(threshold=0.5)
160
+ results = client.run_evaluation(
161
+ examples=[example],
162
+ scorers=[scorer],
163
+ model="gpt-4.1",
164
+ )
165
+ print(results)
166
+ ```
167
+
168
+ [Click here](https://docs.judgmentlabs.ai/getting-started#create-your-first-experiment) for a more detailed explanation.
169
+
170
+ ### 📡 Online Evaluations
171
+
172
+ Attach performance monitoring on traces to measure the quality of your systems in production.
173
+
174
+ Using the same `traces.py` file we created earlier, modify `main` function:
175
+
176
+ ```python
177
+ from judgeval.common.tracer import Tracer, wrap
178
+ from judgeval.scorers import AnswerRelevancyScorer
179
+ from openai import OpenAI
180
+
181
+ client = wrap(OpenAI())
182
+ judgment = Tracer(project_name="my_project")
183
+
184
+ @judgment.observe(span_type="tool")
185
+ def my_tool():
186
+ return "Hello world!"
187
+
188
+ @judgment.observe(span_type="function")
189
+ def main():
190
+ task_input = my_tool()
191
+ res = client.chat.completions.create(
192
+ model="gpt-4.1",
193
+ messages=[{"role": "user", "content": f"{task_input}"}]
194
+ ).choices[0].message.content
195
+
196
+ judgment.async_evaluate(
197
+ scorers=[AnswerRelevancyScorer(threshold=0.5)],
198
+ input=task_input,
199
+ actual_output=res,
200
+ model="gpt-4.1"
201
+ )
202
+ print("Online evaluation submitted.")
203
+ return res
204
+
205
+ main()
206
+ ```
207
+
208
+ [Click here](https://docs.judgmentlabs.ai/getting-started#create-your-first-online-evaluation) for a more detailed explanation.
209
+
210
+ ## 🏢 Self-Hosting
211
+
212
+ Run Judgment on your own infrastructure: we provide comprehensive self-hosting capabilities that give you full control over the backend and data plane that Judgeval interfaces with.
213
+
214
+ ### Key Features
215
+ * Deploy Judgment on your own AWS account
216
+ * Store data in your own Supabase instance
217
+ * Access Judgment through your own custom domain
218
+
219
+ ### Getting Started
220
+ 1. Check out our [self-hosting documentation](https://docs.judgmentlabs.ai/self-hosting/get_started) for detailed setup instructions, along with how your self-hosted instance can be accessed
221
+ 2. Use the [Judgment CLI](https://github.com/JudgmentLabs/judgment-cli) to deploy your self-hosted environment
222
+ 3. After your self-hosted instance is setup, make sure the `JUDGMENT_API_URL` environmental variable is set to your self-hosted backend endpoint
223
+
224
+ ## 📚 Cookbooks
225
+
226
+ Have your own? We're happy to feature it if you create a PR or message us on [Discord](https://discord.gg/taAufyhf).
227
+
228
+ You can access our repo of cookbooks [here](https://github.com/JudgmentLabs/judgment-cookbook). Here are some highlights:
229
+
230
+ ### Sample Agents
231
+
232
+ #### 💰 [LangGraph Financial QA Agent](https://github.com/JudgmentLabs/judgment-cookbook/blob/main/cookbooks/financial_agent/demo.py)
233
+ A LangGraph-based agent for financial queries, featuring RAG capabilities with a vector database for contextual data retrieval and evaluation of its reasoning and data accuracy.
234
+
235
+ #### ✈️ [OpenAI Travel Agent](https://github.com/JudgmentLabs/judgment-cookbook/blob/main/cookbooks/openai_travel_agent/agent.py)
236
+ A travel planning agent using OpenAI API calls, custom tool functions, and RAG with a vector database for up-to-date and contextual travel information. Evaluated for itinerary quality and information relevance.
237
+
238
+ ### Custom Evaluators
239
+
240
+ #### 🔍 [PII Detection](https://github.com/JudgmentLabs/judgment-cookbook/blob/main/cookbooks/classifier_scorer/pii_checker.py)
241
+ Detecting and evaluating Personal Identifiable Information (PII) leakage.
242
+
243
+ #### 📧 [Cold Email Generation](https://github.com/JudgmentLabs/judgment-cookbook/blob/main/cookbooks/custom_scorers/cold_email_scorer.py)
244
+
245
+ Evaluates if a cold email generator properly utilizes all relevant information about the target recipient.
246
+
247
+ ## 💻 Development with Cursor
248
+ When building agents and LLM workflows in Cursor, providing proper context to your coding assistant helps ensure seamless integration with Judgment. This rule file supplies the essential context your coding assistant needs for successful implementation.
249
+
250
+ To implement this rule file, simply copy the text below and save it in a ".cursor/rules" directory in your project's root directory.
251
+
252
+ <details>
253
+
254
+ <summary>Cursor Rule File</summary>
255
+
256
+ ````
257
+ ---
258
+ You are an expert in helping users integrate Judgment with their codebase. When you are helping someone integrate Judgment tracing or evaluations with their agents/workflows, refer to this file.
259
+ ---
260
+
261
+ # Common Questions You May Get from the User (and How to Handle These Cases):
262
+
263
+ ## Sample Agent 1:
264
+ ```
265
+ from uuid import uuid4
266
+ import openai
267
+ import os
268
+ import asyncio
269
+ from tavily import TavilyClient
270
+ from dotenv import load_dotenv
271
+ import chromadb
272
+ from chromadb.utils import embedding_functions
273
+
274
+ destinations_data = [
275
+ {
276
+ "destination": "Paris, France",
277
+ "information": """
278
+ Paris is the capital city of France and a global center for art, fashion, and culture.
279
+ Key Information:
280
+ - Best visited during spring (March-May) or fall (September-November)
281
+ - Famous landmarks: Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, Arc de Triomphe
282
+ - Known for: French cuisine, café culture, fashion, art galleries
283
+ - Local transportation: Metro system is extensive and efficient
284
+ - Popular neighborhoods: Le Marais, Montmartre, Latin Quarter
285
+ - Cultural tips: Basic French phrases are appreciated; many restaurants close between lunch and dinner
286
+ - Must-try experiences: Seine River cruise, visiting local bakeries, Luxembourg Gardens
287
+ """
288
+ },
289
+ {
290
+ "destination": "Tokyo, Japan",
291
+ "information": """
292
+ Tokyo is Japan's bustling capital, blending ultramodern and traditional elements.
293
+ Key Information:
294
+ - Best visited during spring (cherry blossoms) or fall (autumn colors)
295
+ - Famous areas: Shibuya, Shinjuku, Harajuku, Akihabara
296
+ - Known for: Technology, anime culture, sushi, efficient public transport
297
+ - Local transportation: Extensive train and subway network
298
+ - Cultural tips: Bow when greeting, remove shoes indoors, no tipping
299
+ - Must-try experiences: Robot Restaurant, teamLab Borderless, Tsukiji Outer Market
300
+ - Popular day trips: Mount Fuji, Kamakura, Nikko
301
+ """
302
+ },
303
+ {
304
+ "destination": "New York City, USA",
305
+ "information": """
306
+ New York City is a global metropolis known for its diversity, culture, and iconic skyline.
307
+ Key Information:
308
+ - Best visited during spring (April-June) or fall (September-November)
309
+ - Famous landmarks: Statue of Liberty, Times Square, Central Park, Empire State Building
310
+ - Known for: Broadway shows, diverse cuisine, shopping, museums
311
+ - Local transportation: Extensive subway system, yellow cabs, ride-sharing
312
+ - Popular areas: Manhattan, Brooklyn, Queens
313
+ - Cultural tips: Fast-paced environment, tipping expected (15-20%)
314
+ - Must-try experiences: Broadway show, High Line walk, food tours
315
+ """
316
+ },
317
+ {
318
+ "destination": "Barcelona, Spain",
319
+ "information": """
320
+ Barcelona is a vibrant city known for its art, architecture, and Mediterranean culture.
321
+ Key Information:
322
+ - Best visited during spring and fall for mild weather
323
+ - Famous landmarks: Sagrada Familia, Park Güell, Casa Batlló
324
+ - Known for: Gaudi architecture, tapas, beach culture, FC Barcelona
325
+ - Local transportation: Metro, buses, and walkable city center
326
+ - Popular areas: Gothic Quarter, Eixample, La Barceloneta
327
+ - Cultural tips: Late dinner times (after 8 PM), siesta tradition
328
+ - Must-try experiences: La Rambla walk, tapas crawl, local markets
329
+ """
330
+ },
331
+ {
332
+ "destination": "Bangkok, Thailand",
333
+ "information": """
334
+ Bangkok is Thailand's capital city, famous for its temples, street food, and vibrant culture.
335
+ Key Information:
336
+ - Best visited during November to February (cool and dry season)
337
+ - Famous sites: Grand Palace, Wat Phra Kaew, Wat Arun
338
+ - Known for: Street food, temples, markets, nightlife
339
+ - Local transportation: BTS Skytrain, MRT, tuk-tuks, river boats
340
+ - Popular areas: Sukhumvit, Old City, Chinatown
341
+ - Cultural tips: Dress modestly at temples, respect royal family
342
+ - Must-try experiences: Street food tours, river cruises, floating markets
343
+ """
344
+ }
345
+ ]
346
+
347
+ client = openai.Client(api_key=os.getenv("OPENAI_API_KEY"))
348
+
349
+ def populate_vector_db(collection, destinations_data):
350
+ """
351
+ Populate the vector DB with travel information.
352
+ destinations_data should be a list of dictionaries with 'destination' and 'information' keys
353
+ """
354
+ for data in destinations_data:
355
+ collection.add(
356
+ documents=[data['information']],
357
+ metadatas=[{"destination": data['destination']}],
358
+ ids=[f"destination_{data['destination'].lower().replace(' ', '_')}"]
359
+ )
360
+
361
+ def search_tavily(query):
362
+ """Fetch travel data using Tavily API."""
363
+ API_KEY = os.getenv("TAVILY_API_KEY")
364
+ client = TavilyClient(api_key=API_KEY)
365
+ results = client.search(query, num_results=3)
366
+ return results
367
+
368
+ async def get_attractions(destination):
369
+ """Search for top attractions in the destination."""
370
+ prompt = f"Best tourist attractions in {destination}"
371
+ attractions_search = search_tavily(prompt)
372
+ return attractions_search
373
+
374
+ async def get_hotels(destination):
375
+ """Search for hotels in the destination."""
376
+ prompt = f"Best hotels in {destination}"
377
+ hotels_search = search_tavily(prompt)
378
+ return hotels_search
379
+
380
+ async def get_flights(destination):
381
+ """Search for flights to the destination."""
382
+ prompt = f"Flights to {destination} from major cities"
383
+ flights_search = search_tavily(prompt)
384
+ return flights_search
385
+
386
+ async def get_weather(destination, start_date, end_date):
387
+ """Search for weather information."""
388
+ prompt = f"Weather forecast for {destination} from {start_date} to {end_date}"
389
+ weather_search = search_tavily(prompt)
390
+ return weather_search
391
+
392
+ def initialize_vector_db():
393
+ """Initialize ChromaDB with OpenAI embeddings."""
394
+ client = chromadb.Client()
395
+ embedding_fn = embedding_functions.OpenAIEmbeddingFunction(
396
+ api_key=os.getenv("OPENAI_API_KEY"),
397
+ model_name="text-embedding-3-small"
398
+ )
399
+ res = client.get_or_create_collection(
400
+ "travel_information",
401
+ embedding_function=embedding_fn
402
+ )
403
+ populate_vector_db(res, destinations_data)
404
+ return res
405
+
406
+ def query_vector_db(collection, destination, k=3):
407
+ """Query the vector database for existing travel information."""
408
+ try:
409
+ results = collection.query(
410
+ query_texts=[destination],
411
+ n_results=k
412
+ )
413
+ return results['documents'][0] if results['documents'] else []
414
+ except Exception:
415
+ return []
416
+
417
+ async def research_destination(destination, start_date, end_date):
418
+ """Gather all necessary travel information for a destination."""
419
+ # First, check the vector database
420
+ collection = initialize_vector_db()
421
+ existing_info = query_vector_db(collection, destination)
422
+
423
+ # Get real-time information from Tavily
424
+ tavily_data = {
425
+ "attractions": await get_attractions(destination),
426
+ "hotels": await get_hotels(destination),
427
+ "flights": await get_flights(destination),
428
+ "weather": await get_weather(destination, start_date, end_date)
429
+ }
430
+
431
+ return {
432
+ "vector_db_results": existing_info,
433
+ **tavily_data
434
+ }
435
+
436
+ async def create_travel_plan(destination, start_date, end_date, research_data):
437
+ """Generate a travel itinerary using the researched data."""
438
+ vector_db_context = "\n".join(research_data['vector_db_results']) if research_data['vector_db_results'] else "No pre-stored information available."
439
+
440
+ prompt = f"""
441
+ Create a structured travel itinerary for a trip to {destination} from {start_date} to {end_date}.
442
+
443
+ Pre-stored destination information:
444
+ {vector_db_context}
445
+
446
+ Current travel data:
447
+ - Attractions: {research_data['attractions']}
448
+ - Hotels: {research_data['hotels']}
449
+ - Flights: {research_data['flights']}
450
+ - Weather: {research_data['weather']}
451
+ """
452
+
453
+ response = client.chat.completions.create(
454
+ model="gpt-4.1",
455
+ messages=[
456
+ {"role": "system", "content": "You are an expert travel planner. Combine both historical and current information to create the best possible itinerary."},
457
+ {"role": "user", "content": prompt}
458
+ ]
459
+ ).choices[0].message.content
460
+
461
+ return response
462
+
463
+ async def generate_itinerary(destination, start_date, end_date):
464
+ """Main function to generate a travel itinerary."""
465
+ research_data = await research_destination(destination, start_date, end_date)
466
+ res = await create_travel_plan(destination, start_date, end_date, research_data)
467
+ return res
468
+
469
+
470
+ if __name__ == "__main__":
471
+ load_dotenv()
472
+ destination = input("Enter your travel destination: ")
473
+ start_date = input("Enter start date (YYYY-MM-DD): ")
474
+ end_date = input("Enter end date (YYYY-MM-DD): ")
475
+ itinerary = asyncio.run(generate_itinerary(destination, start_date, end_date))
476
+ print("\nGenerated Itinerary:\n", itinerary)
477
+ ```
478
+
479
+ ## Sample Query 1:
480
+ Can you add Judgment tracing to my file?
481
+
482
+ ## Example of Modified Code after Query 1:
483
+ ```
484
+ from uuid import uuid4
485
+ import openai
486
+ import os
487
+ import asyncio
488
+ from tavily import TavilyClient
489
+ from dotenv import load_dotenv
490
+ import chromadb
491
+ from chromadb.utils import embedding_functions
492
+
493
+ from judgeval.tracer import Tracer, wrap
494
+ from judgeval.scorers import AnswerRelevancyScorer, FaithfulnessScorer
495
+ from judgeval.data import Example
496
+
497
+ destinations_data = [
498
+ {
499
+ "destination": "Paris, France",
500
+ "information": """
501
+ Paris is the capital city of France and a global center for art, fashion, and culture.
502
+ Key Information:
503
+ - Best visited during spring (March-May) or fall (September-November)
504
+ - Famous landmarks: Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, Arc de Triomphe
505
+ - Known for: French cuisine, café culture, fashion, art galleries
506
+ - Local transportation: Metro system is extensive and efficient
507
+ - Popular neighborhoods: Le Marais, Montmartre, Latin Quarter
508
+ - Cultural tips: Basic French phrases are appreciated; many restaurants close between lunch and dinner
509
+ - Must-try experiences: Seine River cruise, visiting local bakeries, Luxembourg Gardens
510
+ """
511
+ },
512
+ {
513
+ "destination": "Tokyo, Japan",
514
+ "information": """
515
+ Tokyo is Japan's bustling capital, blending ultramodern and traditional elements.
516
+ Key Information:
517
+ - Best visited during spring (cherry blossoms) or fall (autumn colors)
518
+ - Famous areas: Shibuya, Shinjuku, Harajuku, Akihabara
519
+ - Known for: Technology, anime culture, sushi, efficient public transport
520
+ - Local transportation: Extensive train and subway network
521
+ - Cultural tips: Bow when greeting, remove shoes indoors, no tipping
522
+ - Must-try experiences: Robot Restaurant, teamLab Borderless, Tsukiji Outer Market
523
+ - Popular day trips: Mount Fuji, Kamakura, Nikko
524
+ """
525
+ },
526
+ {
527
+ "destination": "New York City, USA",
528
+ "information": """
529
+ New York City is a global metropolis known for its diversity, culture, and iconic skyline.
530
+ Key Information:
531
+ - Best visited during spring (April-June) or fall (September-November)
532
+ - Famous landmarks: Statue of Liberty, Times Square, Central Park, Empire State Building
533
+ - Known for: Broadway shows, diverse cuisine, shopping, museums
534
+ - Local transportation: Extensive subway system, yellow cabs, ride-sharing
535
+ - Popular areas: Manhattan, Brooklyn, Queens
536
+ - Cultural tips: Fast-paced environment, tipping expected (15-20%)
537
+ - Must-try experiences: Broadway show, High Line walk, food tours
538
+ """
539
+ },
540
+ {
541
+ "destination": "Barcelona, Spain",
542
+ "information": """
543
+ Barcelona is a vibrant city known for its art, architecture, and Mediterranean culture.
544
+ Key Information:
545
+ - Best visited during spring and fall for mild weather
546
+ - Famous landmarks: Sagrada Familia, Park Güell, Casa Batlló
547
+ - Known for: Gaudi architecture, tapas, beach culture, FC Barcelona
548
+ - Local transportation: Metro, buses, and walkable city center
549
+ - Popular areas: Gothic Quarter, Eixample, La Barceloneta
550
+ - Cultural tips: Late dinner times (after 8 PM), siesta tradition
551
+ - Must-try experiences: La Rambla walk, tapas crawl, local markets
552
+ """
553
+ },
554
+ {
555
+ "destination": "Bangkok, Thailand",
556
+ "information": """
557
+ Bangkok is Thailand's capital city, famous for its temples, street food, and vibrant culture.
558
+ Key Information:
559
+ - Best visited during November to February (cool and dry season)
560
+ - Famous sites: Grand Palace, Wat Phra Kaew, Wat Arun
561
+ - Known for: Street food, temples, markets, nightlife
562
+ - Local transportation: BTS Skytrain, MRT, tuk-tuks, river boats
563
+ - Popular areas: Sukhumvit, Old City, Chinatown
564
+ - Cultural tips: Dress modestly at temples, respect royal family
565
+ - Must-try experiences: Street food tours, river cruises, floating markets
566
+ """
567
+ }
568
+ ]
569
+
570
+ client = wrap(openai.Client(api_key=os.getenv("OPENAI_API_KEY")))
571
+ judgment = Tracer(api_key=os.getenv("JUDGMENT_API_KEY"), project_name="travel_agent_demo")
572
+
573
+ def populate_vector_db(collection, destinations_data):
574
+ """
575
+ Populate the vector DB with travel information.
576
+ destinations_data should be a list of dictionaries with 'destination' and 'information' keys
577
+ """
578
+ for data in destinations_data:
579
+ collection.add(
580
+ documents=[data['information']],
581
+ metadatas=[{"destination": data['destination']}],
582
+ ids=[f"destination_{data['destination'].lower().replace(' ', '_')}"]
583
+ )
584
+
585
+ @judgment.observe(span_type="search_tool")
586
+ def search_tavily(query):
587
+ """Fetch travel data using Tavily API."""
588
+ API_KEY = os.getenv("TAVILY_API_KEY")
589
+ client = TavilyClient(api_key=API_KEY)
590
+ results = client.search(query, num_results=3)
591
+ return results
592
+
593
+ @judgment.observe(span_type="tool")
594
+ async def get_attractions(destination):
595
+ """Search for top attractions in the destination."""
596
+ prompt = f"Best tourist attractions in {destination}"
597
+ attractions_search = search_tavily(prompt)
598
+ return attractions_search
599
+
600
+ @judgment.observe(span_type="tool")
601
+ async def get_hotels(destination):
602
+ """Search for hotels in the destination."""
603
+ prompt = f"Best hotels in {destination}"
604
+ hotels_search = search_tavily(prompt)
605
+ return hotels_search
606
+
607
+ @judgment.observe(span_type="tool")
608
+ async def get_flights(destination):
609
+ """Search for flights to the destination."""
610
+ prompt = f"Flights to {destination} from major cities"
611
+ flights_search = search_tavily(prompt)
612
+ example = Example(
613
+ input=prompt,
614
+ actual_output=str(flights_search["results"])
615
+ )
616
+ judgment.async_evaluate(
617
+ scorers=[AnswerRelevancyScorer(threshold=0.5)],
618
+ example=example,
619
+ model="gpt-4.1"
620
+ )
621
+ return flights_search
622
+
623
+ @judgment.observe(span_type="tool")
624
+ async def get_weather(destination, start_date, end_date):
625
+ """Search for weather information."""
626
+ prompt = f"Weather forecast for {destination} from {start_date} to {end_date}"
627
+ weather_search = search_tavily(prompt)
628
+ example = Example(
629
+ input=prompt,
630
+ actual_output=str(weather_search["results"])
631
+ )
632
+ judgment.async_evaluate(
633
+ scorers=[AnswerRelevancyScorer(threshold=0.5)],
634
+ example=example,
635
+ model="gpt-4.1"
636
+ )
637
+ return weather_search
638
+
639
+ def initialize_vector_db():
640
+ """Initialize ChromaDB with OpenAI embeddings."""
641
+ client = chromadb.Client()
642
+ embedding_fn = embedding_functions.OpenAIEmbeddingFunction(
643
+ api_key=os.getenv("OPENAI_API_KEY"),
644
+ model_name="text-embedding-3-small"
645
+ )
646
+ res = client.get_or_create_collection(
647
+ "travel_information",
648
+ embedding_function=embedding_fn
649
+ )
650
+ populate_vector_db(res, destinations_data)
651
+ return res
652
+
653
+ @judgment.observe(span_type="retriever")
654
+ def query_vector_db(collection, destination, k=3):
655
+ """Query the vector database for existing travel information."""
656
+ try:
657
+ results = collection.query(
658
+ query_texts=[destination],
659
+ n_results=k
660
+ )
661
+ return results['documents'][0] if results['documents'] else []
662
+ except Exception:
663
+ return []
664
+
665
+ @judgment.observe(span_type="Research")
666
+ async def research_destination(destination, start_date, end_date):
667
+ """Gather all necessary travel information for a destination."""
668
+ # First, check the vector database
669
+ collection = initialize_vector_db()
670
+ existing_info = query_vector_db(collection, destination)
671
+
672
+ # Get real-time information from Tavily
673
+ tavily_data = {
674
+ "attractions": await get_attractions(destination),
675
+ "hotels": await get_hotels(destination),
676
+ "flights": await get_flights(destination),
677
+ "weather": await get_weather(destination, start_date, end_date)
678
+ }
679
+
680
+ return {
681
+ "vector_db_results": existing_info,
682
+ **tavily_data
683
+ }
684
+
685
+ @judgment.observe(span_type="function")
686
+ async def create_travel_plan(destination, start_date, end_date, research_data):
687
+ """Generate a travel itinerary using the researched data."""
688
+ vector_db_context = "\n".join(research_data['vector_db_results']) if research_data['vector_db_results'] else "No pre-stored information available."
689
+
690
+ prompt = f"""
691
+ Create a structured travel itinerary for a trip to {destination} from {start_date} to {end_date}.
692
+
693
+ Pre-stored destination information:
694
+ {vector_db_context}
695
+
696
+ Current travel data:
697
+ - Attractions: {research_data['attractions']}
698
+ - Hotels: {research_data['hotels']}
699
+ - Flights: {research_data['flights']}
700
+ - Weather: {research_data['weather']}
701
+ """
702
+
703
+ response = client.chat.completions.create(
704
+ model="gpt-4.1",
705
+ messages=[
706
+ {"role": "system", "content": "You are an expert travel planner. Combine both historical and current information to create the best possible itinerary."},
707
+ {"role": "user", "content": prompt}
708
+ ]
709
+ ).choices[0].message.content
710
+
711
+ example = Example(
712
+ input=prompt,
713
+ actual_output=str(response),
714
+ retrieval_context=[str(vector_db_context), str(research_data)]
715
+ )
716
+ judgment.async_evaluate(
717
+ scorers=[FaithfulnessScorer(threshold=0.5)],
718
+ example=example,
719
+ model="gpt-4.1"
720
+ )
721
+
722
+ return response
723
+
724
+ @judgment.observe(span_type="function")
725
+ async def generate_itinerary(destination, start_date, end_date):
726
+ """Main function to generate a travel itinerary."""
727
+ research_data = await research_destination(destination, start_date, end_date)
728
+ res = await create_travel_plan(destination, start_date, end_date, research_data)
729
+ return res
730
+
731
+
732
+ if __name__ == "__main__":
733
+ load_dotenv()
734
+ destination = input("Enter your travel destination: ")
735
+ start_date = input("Enter start date (YYYY-MM-DD): ")
736
+ end_date = input("Enter end date (YYYY-MM-DD): ")
737
+ itinerary = asyncio.run(generate_itinerary(destination, start_date, end_date))
738
+ print("\nGenerated Itinerary:\n", itinerary)
739
+ ```
740
+
741
+ ## Sample Agent 2
742
+ ```
743
+ from langchain_openai import ChatOpenAI
744
+ import asyncio
745
+ import os
746
+
747
+ import chromadb
748
+ from chromadb.utils import embedding_functions
749
+
750
+ from vectordbdocs import financial_data
751
+
752
+ from typing import Optional
753
+ from langchain_core.messages import BaseMessage, HumanMessage, SystemMessage, ChatMessage
754
+ from typing_extensions import TypedDict
755
+ from langgraph.graph import StateGraph
756
+
757
+ # Define our state type
758
+ class AgentState(TypedDict):
759
+ messages: list[BaseMessage]
760
+ category: Optional[str]
761
+ documents: Optional[str]
762
+
763
+ def populate_vector_db(collection, raw_data):
764
+ """
765
+ Populate the vector DB with financial information.
766
+ """
767
+ for data in raw_data:
768
+ collection.add(
769
+ documents=[data['information']],
770
+ metadatas=[{"category": data['category']}],
771
+ ids=[f"category_{data['category'].lower().replace(' ', '_')}_{os.urandom(4).hex()}"]
772
+ )
773
+
774
+ # Define a ChromaDB collection for document storage
775
+ client = chromadb.Client()
776
+ collection = client.get_or_create_collection(
777
+ name="financial_docs",
778
+ embedding_function=embedding_functions.OpenAIEmbeddingFunction(api_key=os.getenv("OPENAI_API_KEY"))
779
+ )
780
+
781
+ populate_vector_db(collection, financial_data)
782
+
783
+ def pnl_retriever(state: AgentState) -> AgentState:
784
+ query = state["messages"][-1].content
785
+ results = collection.query(
786
+ query_texts=[query],
787
+ where={"category": "pnl"},
788
+ n_results=3
789
+ )
790
+ documents = []
791
+ for document in results["documents"]:
792
+ documents += document
793
+
794
+ return {"messages": state["messages"], "documents": documents}
795
+
796
+ def balance_sheet_retriever(state: AgentState) -> AgentState:
797
+ query = state["messages"][-1].content
798
+ results = collection.query(
799
+ query_texts=[query],
800
+ where={"category": "balance_sheets"},
801
+ n_results=3
802
+ )
803
+ documents = []
804
+ for document in results["documents"]:
805
+ documents += document
806
+
807
+ return {"messages": state["messages"], "documents": documents}
808
+
809
+ def stock_retriever(state: AgentState) -> AgentState:
810
+ query = state["messages"][-1].content
811
+ results = collection.query(
812
+ query_texts=[query],
813
+ where={"category": "stocks"},
814
+ n_results=3
815
+ )
816
+ documents = []
817
+ for document in results["documents"]:
818
+ documents += document
819
+
820
+ return {"messages": state["messages"], "documents": documents}
821
+
822
+ async def bad_classifier(state: AgentState) -> AgentState:
823
+ return {"messages": state["messages"], "category": "stocks"}
824
+
825
+ async def bad_classify(state: AgentState) -> AgentState:
826
+ category = await bad_classifier(state)
827
+
828
+ return {"messages": state["messages"], "category": category["category"]}
829
+
830
+ async def bad_sql_generator(state: AgentState) -> AgentState:
831
+ ACTUAL_OUTPUT = "SELECT * FROM pnl WHERE stock_symbol = 'GOOGL'"
832
+ return {"messages": state["messages"] + [ChatMessage(content=ACTUAL_OUTPUT, role="text2sql")]}
833
+
834
+ # Create the classifier node with a system prompt
835
+ async def classify(state: AgentState) -> AgentState:
836
+ messages = state["messages"]
837
+ input_msg = [
838
+ SystemMessage(content="""You are a financial query classifier. Your job is to classify user queries into one of three categories:
839
+ - 'pnl' for Profit and Loss related queries
840
+ - 'balance_sheets' for Balance Sheet related queries
841
+ - 'stocks' for Stock market related queries
842
+
843
+ Respond ONLY with the category name in lowercase, nothing else."""),
844
+ *messages
845
+ ]
846
+
847
+ response = ChatOpenAI(model="gpt-4.1", temperature=0).invoke(
848
+ input=input_msg
849
+ )
850
+
851
+ return {"messages": state["messages"], "category": response.content}
852
+
853
+ # Add router node to direct flow based on classification
854
+ def router(state: AgentState) -> str:
855
+ return state["category"]
856
+
857
+ async def generate_response(state: AgentState) -> AgentState:
858
+ messages = state["messages"]
859
+ documents = state.get("documents", "")
860
+
861
+ OUTPUT = """
862
+ SELECT
863
+ stock_symbol,
864
+ SUM(CASE WHEN transaction_type = 'buy' THEN quantity ELSE -quantity END) AS total_shares,
865
+ SUM(CASE WHEN transaction_type = 'buy' THEN quantity * price_per_share ELSE -quantity * price_per_share END) AS total_cost,
866
+ MAX(CASE WHEN transaction_type = 'buy' THEN price_per_share END) AS current_market_price
867
+ FROM
868
+ stock_transactions
869
+ WHERE
870
+ stock_symbol = 'META'
871
+ GROUP BY
872
+ stock_symbol;
873
+ """
874
+
875
+ return {"messages": messages + [ChatMessage(content=OUTPUT, role="text2sql")], "documents": documents}
876
+
877
+ async def main():
878
+ # Initialize the graph
879
+ graph_builder = StateGraph(AgentState)
880
+
881
+ # Add classifier node
882
+ # For failure test, pass in bad_classifier
883
+ graph_builder.add_node("classifier", classify)
884
+ # graph_builder.add_node("classifier", bad_classify)
885
+
886
+ # Add conditional edges based on classification
887
+ graph_builder.add_conditional_edges(
888
+ "classifier",
889
+ router,
890
+ {
891
+ "pnl": "pnl_retriever",
892
+ "balance_sheets": "balance_sheet_retriever",
893
+ "stocks": "stock_retriever"
894
+ }
895
+ )
896
+
897
+ # Add retriever nodes (placeholder functions for now)
898
+ graph_builder.add_node("pnl_retriever", pnl_retriever)
899
+ graph_builder.add_node("balance_sheet_retriever", balance_sheet_retriever)
900
+ graph_builder.add_node("stock_retriever", stock_retriever)
901
+
902
+ # Add edges from retrievers to response generator
903
+ graph_builder.add_node("response_generator", generate_response)
904
+ # graph_builder.add_node("response_generator", bad_sql_generator)
905
+ graph_builder.add_edge("pnl_retriever", "response_generator")
906
+ graph_builder.add_edge("balance_sheet_retriever", "response_generator")
907
+ graph_builder.add_edge("stock_retriever", "response_generator")
908
+
909
+ graph_builder.set_entry_point("classifier")
910
+ graph_builder.set_finish_point("response_generator")
911
+
912
+ # Compile the graph
913
+ graph = graph_builder.compile()
914
+
915
+ response = await graph.ainvoke({
916
+ "messages": [HumanMessage(content="Please calculate our PNL on Apple stock. Refer to table information from documents provided.")],
917
+ "category": None,
918
+ })
919
+
920
+ print(f"Response: {response['messages'][-1].content}")
921
+
922
+ if __name__ == "__main__":
923
+ asyncio.run(main())
924
+ ```
925
+
926
+ ## Sample Query 2:
927
+ Can you add Judgment tracing to my file?
928
+
929
+ ## Example of Modified Code after Query 2:
930
+ ```
931
+ from langchain_openai import ChatOpenAI
932
+ import asyncio
933
+ import os
934
+
935
+ import chromadb
936
+ from chromadb.utils import embedding_functions
937
+
938
+ from vectordbdocs import financial_data
939
+
940
+ from typing import Optional
941
+ from langchain_core.messages import BaseMessage, HumanMessage, SystemMessage, ChatMessage
942
+ from typing_extensions import TypedDict
943
+ from langgraph.graph import StateGraph
944
+
945
+ from judgeval.common.tracer import Tracer
946
+ from judgeval.integrations.langgraph import JudgevalCallbackHandler
947
+ from judgeval.scorers import AnswerCorrectnessScorer, FaithfulnessScorer
948
+ from judgeval.data import Example
949
+
950
+
951
+
952
+ judgment = Tracer(project_name="FINANCIAL_AGENT")
953
+
954
+ # Define our state type
955
+ class AgentState(TypedDict):
956
+ messages: list[BaseMessage]
957
+ category: Optional[str]
958
+ documents: Optional[str]
959
+
960
+ def populate_vector_db(collection, raw_data):
961
+ """
962
+ Populate the vector DB with financial information.
963
+ """
964
+ for data in raw_data:
965
+ collection.add(
966
+ documents=[data['information']],
967
+ metadatas=[{"category": data['category']}],
968
+ ids=[f"category_{data['category'].lower().replace(' ', '_')}_{os.urandom(4).hex()}"]
969
+ )
970
+
971
+ # Define a ChromaDB collection for document storage
972
+ client = chromadb.Client()
973
+ collection = client.get_or_create_collection(
974
+ name="financial_docs",
975
+ embedding_function=embedding_functions.OpenAIEmbeddingFunction(api_key=os.getenv("OPENAI_API_KEY"))
976
+ )
977
+
978
+ populate_vector_db(collection, financial_data)
979
+
980
+ @judgment.observe(name="pnl_retriever", span_type="retriever")
981
+ def pnl_retriever(state: AgentState) -> AgentState:
982
+ query = state["messages"][-1].content
983
+ results = collection.query(
984
+ query_texts=[query],
985
+ where={"category": "pnl"},
986
+ n_results=3
987
+ )
988
+ documents = []
989
+ for document in results["documents"]:
990
+ documents += document
991
+
992
+ return {"messages": state["messages"], "documents": documents}
993
+
994
+ @judgment.observe(name="balance_sheet_retriever", span_type="retriever")
995
+ def balance_sheet_retriever(state: AgentState) -> AgentState:
996
+ query = state["messages"][-1].content
997
+ results = collection.query(
998
+ query_texts=[query],
999
+ where={"category": "balance_sheets"},
1000
+ n_results=3
1001
+ )
1002
+ documents = []
1003
+ for document in results["documents"]:
1004
+ documents += document
1005
+
1006
+ return {"messages": state["messages"], "documents": documents}
1007
+
1008
+ @judgment.observe(name="stock_retriever", span_type="retriever")
1009
+ def stock_retriever(state: AgentState) -> AgentState:
1010
+ query = state["messages"][-1].content
1011
+ results = collection.query(
1012
+ query_texts=[query],
1013
+ where={"category": "stocks"},
1014
+ n_results=3
1015
+ )
1016
+ documents = []
1017
+ for document in results["documents"]:
1018
+ documents += document
1019
+
1020
+ return {"messages": state["messages"], "documents": documents}
1021
+
1022
+ @judgment.observe(name="bad_classifier", span_type="llm")
1023
+ async def bad_classifier(state: AgentState) -> AgentState:
1024
+ return {"messages": state["messages"], "category": "stocks"}
1025
+
1026
+ @judgment.observe(name="bad_classify")
1027
+ async def bad_classify(state: AgentState) -> AgentState:
1028
+ category = await bad_classifier(state)
1029
+
1030
+ example = Example(
1031
+ input=state["messages"][-1].content,
1032
+ actual_output=category["category"],
1033
+ expected_output="pnl"
1034
+ )
1035
+ judgment.async_evaluate(
1036
+ scorers=[AnswerCorrectnessScorer(threshold=1)],
1037
+ example=example,
1038
+ model="gpt-4.1"
1039
+ )
1040
+
1041
+ return {"messages": state["messages"], "category": category["category"]}
1042
+
1043
+ @judgment.observe(name="bad_sql_generator", span_type="llm")
1044
+ async def bad_sql_generator(state: AgentState) -> AgentState:
1045
+ ACTUAL_OUTPUT = "SELECT * FROM pnl WHERE stock_symbol = 'GOOGL'"
1046
+
1047
+ example = Example(
1048
+ input=state["messages"][-1].content,
1049
+ actual_output=ACTUAL_OUTPUT,
1050
+ retrieval_context=state.get("documents", []),
1051
+ expected_output="""
1052
+ SELECT
1053
+ SUM(CASE
1054
+ WHEN transaction_type = 'sell' THEN (price_per_share - (SELECT price_per_share FROM stock_transactions WHERE stock_symbol = 'GOOGL' AND transaction_type = 'buy' LIMIT 1)) * quantity
1055
+ ELSE 0
1056
+ END) AS realized_pnl
1057
+ FROM
1058
+ stock_transactions
1059
+ WHERE
1060
+ stock_symbol = 'META';
1061
+ """
1062
+ )
1063
+ judgment.async_evaluate(
1064
+ scorers=[AnswerCorrectnessScorer(threshold=1), FaithfulnessScorer(threshold=1)],
1065
+ example=example,
1066
+ model="gpt-4.1"
1067
+ )
1068
+ return {"messages": state["messages"] + [ChatMessage(content=ACTUAL_OUTPUT, role="text2sql")]}
1069
+
1070
+ # Create the classifier node with a system prompt
1071
+ @judgment.observe(name="classify")
1072
+ async def classify(state: AgentState) -> AgentState:
1073
+ messages = state["messages"]
1074
+ input_msg = [
1075
+ SystemMessage(content="""You are a financial query classifier. Your job is to classify user queries into one of three categories:
1076
+ - 'pnl' for Profit and Loss related queries
1077
+ - 'balance_sheets' for Balance Sheet related queries
1078
+ - 'stocks' for Stock market related queries
1079
+
1080
+ Respond ONLY with the category name in lowercase, nothing else."""),
1081
+ *messages
1082
+ ]
1083
+
1084
+ response = ChatOpenAI(model="gpt-4.1", temperature=0).invoke(
1085
+ input=input_msg
1086
+ )
1087
+
1088
+ example = Example(
1089
+ input=str(input_msg),
1090
+ actual_output=response.content,
1091
+ expected_output="pnl"
1092
+ )
1093
+ judgment.async_evaluate(
1094
+ scorers=[AnswerCorrectnessScorer(threshold=1)],
1095
+ example=example,
1096
+ model="gpt-4.1"
1097
+ )
1098
+
1099
+ return {"messages": state["messages"], "category": response.content}
1100
+
1101
+ # Add router node to direct flow based on classification
1102
+ def router(state: AgentState) -> str:
1103
+ return state["category"]
1104
+
1105
+ @judgment.observe(name="generate_response")
1106
+ async def generate_response(state: AgentState) -> AgentState:
1107
+ messages = state["messages"]
1108
+ documents = state.get("documents", "")
1109
+
1110
+ OUTPUT = """
1111
+ SELECT
1112
+ stock_symbol,
1113
+ SUM(CASE WHEN transaction_type = 'buy' THEN quantity ELSE -quantity END) AS total_shares,
1114
+ SUM(CASE WHEN transaction_type = 'buy' THEN quantity * price_per_share ELSE -quantity * price_per_share END) AS total_cost,
1115
+ MAX(CASE WHEN transaction_type = 'buy' THEN price_per_share END) AS current_market_price
1116
+ FROM
1117
+ stock_transactions
1118
+ WHERE
1119
+ stock_symbol = 'META'
1120
+ GROUP BY
1121
+ stock_symbol;
1122
+ """
1123
+
1124
+ example = Example(
1125
+ input=messages[-1].content,
1126
+ actual_output=OUTPUT,
1127
+ retrieval_context=documents,
1128
+ expected_output="""
1129
+ SELECT
1130
+ stock_symbol,
1131
+ SUM(CASE WHEN transaction_type = 'buy' THEN quantity ELSE -quantity END) AS total_shares,
1132
+ SUM(CASE WHEN transaction_type = 'buy' THEN quantity * price_per_share ELSE -quantity * price_per_share END) AS total_cost,
1133
+ MAX(CASE WHEN transaction_type = 'buy' THEN price_per_share END) AS current_market_price
1134
+ FROM
1135
+ stock_transactions
1136
+ WHERE
1137
+ stock_symbol = 'META'
1138
+ GROUP BY
1139
+ stock_symbol;
1140
+ """
1141
+ )
1142
+ judgment.async_evaluate(
1143
+ scorers=[AnswerCorrectnessScorer(threshold=1), FaithfulnessScorer(threshold=1)],
1144
+ example=example,
1145
+ model="gpt-4.1"
1146
+ )
1147
+
1148
+ return {"messages": messages + [ChatMessage(content=OUTPUT, role="text2sql")], "documents": documents}
1149
+
1150
+ async def main():
1151
+ with judgment.trace(
1152
+ "run_1",
1153
+ project_name="FINANCIAL_AGENT",
1154
+ overwrite=True
1155
+ ) as trace:
1156
+
1157
+ # Initialize the graph
1158
+ graph_builder = StateGraph(AgentState)
1159
+
1160
+ # Add classifier node
1161
+ # For failure test, pass in bad_classifier
1162
+ graph_builder.add_node("classifier", classify)
1163
+ # graph_builder.add_node("classifier", bad_classify)
1164
+
1165
+ # Add conditional edges based on classification
1166
+ graph_builder.add_conditional_edges(
1167
+ "classifier",
1168
+ router,
1169
+ {
1170
+ "pnl": "pnl_retriever",
1171
+ "balance_sheets": "balance_sheet_retriever",
1172
+ "stocks": "stock_retriever"
1173
+ }
1174
+ )
1175
+
1176
+ # Add retriever nodes (placeholder functions for now)
1177
+ graph_builder.add_node("pnl_retriever", pnl_retriever)
1178
+ graph_builder.add_node("balance_sheet_retriever", balance_sheet_retriever)
1179
+ graph_builder.add_node("stock_retriever", stock_retriever)
1180
+
1181
+ # Add edges from retrievers to response generator
1182
+ graph_builder.add_node("response_generator", generate_response)
1183
+ # graph_builder.add_node("response_generator", bad_sql_generator)
1184
+ graph_builder.add_edge("pnl_retriever", "response_generator")
1185
+ graph_builder.add_edge("balance_sheet_retriever", "response_generator")
1186
+ graph_builder.add_edge("stock_retriever", "response_generator")
1187
+
1188
+ graph_builder.set_entry_point("classifier")
1189
+ graph_builder.set_finish_point("response_generator")
1190
+
1191
+ # Compile the graph
1192
+ graph = graph_builder.compile()
1193
+
1194
+ handler = JudgevalCallbackHandler(trace)
1195
+
1196
+ response = await graph.ainvoke({
1197
+ "messages": [HumanMessage(content="Please calculate our PNL on Apple stock. Refer to table information from documents provided.")],
1198
+ "category": None,
1199
+ }, config=dict(callbacks=[handler]))
1200
+ trace.save()
1201
+
1202
+ print(f"Response: {response['messages'][-1].content}")
1203
+
1204
+ if __name__ == "__main__":
1205
+ asyncio.run(main())
1206
+ ```
1207
+
1208
+
1209
+ # Official Judgment Documenation
1210
+
1211
+
1212
+ <div align="center">
1213
+
1214
+ <img src="assets/logo-light.svg#gh-light-mode-only" alt="Judgment Logo" width="400" />
1215
+ <img src="assets/logo-dark.svg#gh-dark-mode-only" alt="Judgment Logo" width="400" />
1216
+
1217
+ **Build monitoring & evaluation pipelines for complex agents**
1218
+
1219
+ <img src="assets/experiments_page.png" alt="Judgment Platform Experiments Page" width="800" />
1220
+
1221
+ <br>
1222
+
1223
+ ## @🌐 Landing Page • @Twitter/X • @💼 LinkedIn • @📚 Docs • @🚀 Demos • @🎮 Discord
1224
+ </div>
1225
+
1226
+ ## Judgeval: open-source testing, monitoring, and optimization for AI agents
1227
+
1228
+ Judgeval offers robust tooling for evaluating and tracing LLM agent systems. It is dev-friendly and open-source (licensed under Apache 2.0).
1229
+
1230
+ Judgeval gets you started in five minutes, after which you'll be ready to use all of its features as your agent becomes more complex. Judgeval is natively connected to the @Judgment Platform for free and you can export your data and self-host at any time.
1231
+
1232
+ We support tracing agents built with LangGraph, OpenAI SDK, Anthropic, ... and allow custom eval integrations for any use case. Check out our quickstarts below or our @setup guide to get started.
1233
+
1234
+ Judgeval is created and maintained by @Judgment Labs.
1235
+
1236
+ ## 📋 Table of Contents
1237
+ * @✨ Features
1238
+ * @🔍 Tracing
1239
+ * @🧪 Evals
1240
+ * @📡 Monitoring
1241
+ * @📊 Datasets
1242
+ * @💡 Insights
1243
+ * @🛠️ Installation
1244
+ * @🏁 Get Started
1245
+ * @🏢 Self-Hosting
1246
+ * @📚 Cookbooks
1247
+ * @⭐ Star Us on GitHub
1248
+ * @❤️ Contributors
1249
+
1250
+ <!-- Created by https://github.com/ekalinin/github-markdown-toc -->
1251
+
1252
+
1253
+ ## ✨ Features
1254
+
1255
+ | | |
1256
+ |:---|:---:|
1257
+ | <h3>🔍 Tracing</h3>Automatic agent tracing integrated with common frameworks (LangGraph, OpenAI, Anthropic): **tracking inputs/outputs, latency, and cost** at every step.<br><br>Online evals can be applied to traces to measure quality on production data in real-time.<br><br>Export trace data to the Judgment Platform or your own S3 buckets, {Parquet, JSON, YAML} files, or data warehouse.<br><br>**Useful for:**<br>• 🐛 Debugging agent runs <br>• 👤 Tracking user activity <br>• 🔬 Pinpointing performance bottlenecks| <p align="center"><img src="assets/trace_screenshot.png" alt="Tracing visualization" width="1200"/></p> |
1258
+ | <h3>🧪 Evals</h3>15+ research-backed metrics including tool call accuracy, hallucinations, instruction adherence, and retrieval context recall.<br><br>Build custom evaluators that connect with our metric-tracking infrastructure. <br><br>**Useful for:**<br>• ⚠️ Unit-testing <br>• 🔬 Experimental prompt testing<br>• 🛡️ Online guardrails <br><br> | <p align="center"><img src="assets/experiments_page.png" alt="Evaluation metrics" width="800"/></p> |
1259
+ | <h3>📡 Monitoring</h3>Real-time performance tracking of your agents in production environments. **Track all your metrics in one place.**<br><br>Set up **Slack/email alerts** for critical metrics and receive notifications when thresholds are exceeded.<br><br> **Useful for:** <br>•📉 Identifying degradation early <br>•📈 Visualizing performance trends across versions and time | <p align="center"><img src="assets/monitoring_screenshot.png" alt="Monitoring Dashboard" width="1200"/></p> |
1260
+ | <h3>📊 Datasets</h3>Export trace data or import external testcases to datasets hosted on Judgment's Platform. Move datasets to/from Parquet, S3, etc. <br><br>Run evals on datasets as unit tests or to A/B test different agent configurations. <br><br> **Useful for:**<br>• 🔄 Scaled analysis for A/B tests <br>• 🗃️ Filtered collections of agent runtime data| <p align="center"><img src="assets/datasets_preview_screenshot.png" alt="Dataset management" width="1200"/></p> |
1261
+ | <h3>💡 Insights</h3>Cluster on your data to reveal common use cases and failure modes.<br><br>Trace failures to their exact source with Judgment's Osiris agent, which localizes errors to specific components for precise fixes.<br><br> **Useful for:**<br>•🔮 Surfacing common inputs that lead to error<br>•🤖 Investigating agent/user behavior for optimization <br>| <p align="center"><img src="assets/dataset_clustering_screenshot_dm.png" alt="Insights dashboard" width="1200"/></p> |
1262
+
1263
+ ## 🛠️ Installation
1264
+
1265
+ Get started with Judgeval by installing our SDK using pip:
1266
+
1267
+ ```bash
1268
+ pip install judgeval
1269
+ ```
1270
+
1271
+ Ensure you have your `JUDGMENT_API_KEY` and `JUDGMENT_ORG_ID` environment variables set to connect to the @Judgment platform.
1272
+
1273
+ **If you don't have keys, @create an account on the platform!**
1274
+
1275
+ ## 🏁 Get Started
1276
+
1277
+ Here's how you can quickly start using Judgeval:
1278
+
1279
+ ### 🛰️ Tracing
1280
+
1281
+ Track your agent execution with full observability with just a few lines of code.
1282
+ Create a file named `traces.py` with the following code:
1283
+
1284
+ ```python
1285
+ from judgeval.common.tracer import Tracer, wrap
1286
+ from openai import OpenAI
1287
+
1288
+ client = wrap(OpenAI())
1289
+ judgment = Tracer(project_name="my_project")
1290
+
1291
+ @judgment.observe(span_type="tool")
1292
+ def my_tool():
1293
+ return "What's the capital of the U.S.?"
1294
+
1295
+ @judgment.observe(span_type="function")
1296
+ def main():
1297
+ task_input = my_tool()
1298
+ res = client.chat.completions.create(
1299
+ model="gpt-4.1",
1300
+ messages=[{"role": "user", "content": f"{task_input}"}]
1301
+ )
1302
+ return res.choices[0].message.content
1303
+
1304
+ main()
1305
+ ```
1306
+
1307
+ @Click here for a more detailed explanation.
1308
+
1309
+ ### 📝 Offline Evaluations
1310
+
1311
+ You can evaluate your agent's execution to measure quality metrics such as hallucination.
1312
+ Create a file named `evaluate.py` with the following code:
1313
+
1314
+ ```python evaluate.py
1315
+ from judgeval import JudgmentClient
1316
+ from judgeval.data import Example
1317
+ from judgeval.scorers import FaithfulnessScorer
1318
+
1319
+ client = JudgmentClient()
1320
+
1321
+ example = Example(
1322
+ input="What if these shoes don't fit?",
1323
+ actual_output="We offer a 30-day full refund at no extra cost.",
1324
+ retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."],
1325
+ )
1326
+
1327
+ scorer = FaithfulnessScorer(threshold=0.5)
1328
+ results = client.run_evaluation(
1329
+ examples=[example],
1330
+ scorers=[scorer],
1331
+ model="gpt-4.1",
1332
+ )
1333
+ print(results)
1334
+ ```
1335
+
1336
+ @Click here for a more detailed explanation.
1337
+
1338
+ ### 📡 Online Evaluations
1339
+
1340
+ Attach performance monitoring on traces to measure the quality of your systems in production.
1341
+
1342
+ Using the same `traces.py` file we created earlier, modify `main` function:
1343
+
1344
+ ```python
1345
+ from judgeval.common.tracer import Tracer, wrap
1346
+ from judgeval.scorers import AnswerRelevancyScorer
1347
+ from openai import OpenAI
1348
+
1349
+ client = wrap(OpenAI())
1350
+ judgment = Tracer(project_name="my_project")
1351
+
1352
+ @judgment.observe(span_type="tool")
1353
+ def my_tool():
1354
+ return "Hello world!"
1355
+
1356
+ @judgment.observe(span_type="function")
1357
+ def main():
1358
+ task_input = my_tool()
1359
+ res = client.chat.completions.create(
1360
+ model="gpt-4.1",
1361
+ messages=[{"role": "user", "content": f"{task_input}"}]
1362
+ ).choices[0].message.content
1363
+
1364
+ judgment.async_evaluate(
1365
+ scorers=[AnswerRelevancyScorer(threshold=0.5)],
1366
+ input=task_input,
1367
+ actual_output=res,
1368
+ model="gpt-4.1"
1369
+ )
1370
+ print("Online evaluation submitted.")
1371
+ return res
1372
+
1373
+ main()
1374
+ ```
1375
+
1376
+ @Click here for a more detailed explanation.
1377
+
1378
+ ## 🏢 Self-Hosting
1379
+
1380
+ Run Judgment on your own infrastructure: we provide comprehensive self-hosting capabilities that give you full control over the backend and data plane that Judgeval interfaces with.
1381
+
1382
+ ### Key Features
1383
+ * Deploy Judgment on your own AWS account
1384
+ * Store data in your own Supabase instance
1385
+ * Access Judgment through your own custom domain
1386
+
1387
+ ### Getting Started
1388
+ 1. Check out our @self-hosting documentation for detailed setup instructions, along with how your self-hosted instance can be accessed
1389
+ 2. Use the @Judgment CLI to deploy your self-hosted environment
1390
+ 3. After your self-hosted instance is setup, make sure the `JUDGMENT_API_URL` environmental variable is set to your self-hosted backend endpoint
1391
+
1392
+ ## 📚 Cookbooks
1393
+
1394
+ Have your own? We're happy to feature it if you create a PR or message us on @Discord.
1395
+
1396
+ You can access our repo of cookbooks @here. Here are some highlights:
1397
+
1398
+ ### Sample Agents
1399
+
1400
+ #### 💰 @LangGraph Financial QA Agent
1401
+ A LangGraph-based agent for financial queries, featuring RAG capabilities with a vector database for contextual data retrieval and evaluation of its reasoning and data accuracy.
1402
+
1403
+ #### ✈️ @OpenAI Travel Agent
1404
+ A travel planning agent using OpenAI API calls, custom tool functions, and RAG with a vector database for up-to-date and contextual travel information. Evaluated for itinerary quality and information relevance.
1405
+
1406
+ ### Custom Evaluators
1407
+
1408
+ #### 🔍 @PII Detection
1409
+ Detecting and evaluating Personal Identifiable Information (PII) leakage.
1410
+
1411
+ #### 📧 @Cold Email Generation
1412
+
1413
+ Evaluates if a cold email generator properly utilizes all relevant information about the target recipient.
1414
+
1415
+ ## ⭐ Star Us on GitHub
1416
+
1417
+ If you find Judgeval useful, please consider giving us a star on GitHub! Your support helps us grow our community and continue improving the product.
1418
+
1419
+
1420
+ ## ❤️ Contributors
1421
+
1422
+ There are many ways to contribute to Judgeval:
1423
+
1424
+ - Submit @bug reports and @feature requests
1425
+ - Review the documentation and submit @Pull Requests to improve it
1426
+ - Speaking or writing about Judgment and letting us know!
1427
+
1428
+ <!-- Contributors collage -->
1429
+ @![Contributors](https://github.com/JudgmentLabs/judgeval/graphs/contributors)
1430
+
1431
+ ````
1432
+
1433
+ </details>
1434
+
1435
+ ## ⭐ Star Us on GitHub
1436
+
1437
+ If you find Judgeval useful, please consider giving us a star on GitHub! Your support helps us grow our community and continue improving the product.
1438
+
1439
+
1440
+ ## ❤️ Contributors
1441
+
1442
+ There are many ways to contribute to Judgeval:
1443
+
1444
+ - Submit [bug reports](https://github.com/JudgmentLabs/judgeval/issues) and [feature requests](https://github.com/JudgmentLabs/judgeval/issues)
1445
+ - Review the documentation and submit [Pull Requests](https://github.com/JudgmentLabs/judgeval/pulls) to improve it
1446
+ - Speaking or writing about Judgment and letting us know!
1447
+
1448
+ <!-- Contributors collage -->
1449
+ [![Contributors](https://contributors-img.web.app/image?repo=JudgmentLabs/judgeval)](https://github.com/JudgmentLabs/judgeval/graphs/contributors)
1450
+