judgeval 0.0.38__py3-none-any.whl → 0.0.40__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,1441 @@
1
+ Metadata-Version: 2.4
2
+ Name: judgeval
3
+ Version: 0.0.40
4
+ Summary: Judgeval Package
5
+ Project-URL: Homepage, https://github.com/JudgmentLabs/judgeval
6
+ Project-URL: Issues, https://github.com/JudgmentLabs/judgeval/issues
7
+ Author-email: Andrew Li <andrew@judgmentlabs.ai>, Alex Shan <alex@judgmentlabs.ai>, Joseph Camyre <joseph@judgmentlabs.ai>
8
+ License-Expression: Apache-2.0
9
+ License-File: LICENSE.md
10
+ Classifier: Operating System :: OS Independent
11
+ Classifier: Programming Language :: Python :: 3
12
+ Requires-Python: >=3.11
13
+ Requires-Dist: anthropic
14
+ Requires-Dist: boto3
15
+ Requires-Dist: google-genai
16
+ Requires-Dist: langchain-anthropic
17
+ Requires-Dist: langchain-core
18
+ Requires-Dist: langchain-huggingface
19
+ Requires-Dist: langchain-openai
20
+ Requires-Dist: litellm==1.61.15
21
+ Requires-Dist: nest-asyncio
22
+ Requires-Dist: openai
23
+ Requires-Dist: pandas
24
+ Requires-Dist: python-dotenv==1.0.1
25
+ Requires-Dist: requests
26
+ Requires-Dist: together
27
+ Description-Content-Type: text/markdown
28
+
29
+ <div align="center">
30
+
31
+ <img src="assets/new_lightmode.svg#gh-light-mode-only" alt="Judgment Logo" width="400" />
32
+ <img src="assets/new_darkmode.svg#gh-dark-mode-only" alt="Judgment Logo" width="400" />
33
+
34
+ **Build monitoring & evaluation pipelines for complex agents**
35
+
36
+ <img src="assets/experiments_pagev2.png" alt="Judgment Platform Experiments Page" width="800" />
37
+
38
+ <br>
39
+
40
+ ## [🌐 Landing Page](https://www.judgmentlabs.ai/) • [📚 Docs](https://judgment.mintlify.app/getting_started) • [🚀 Demos](https://www.youtube.com/@AlexShan-j3o)
41
+
42
+ [![X](https://img.shields.io/badge/-X/Twitter-000?logo=x&logoColor=white)](https://x.com/JudgmentLabs)
43
+ [![LinkedIn](https://custom-icon-badges.demolab.com/badge/LinkedIn%20-0A66C2?logo=linkedin-white&logoColor=fff)](https://www.linkedin.com/company/judgmentlabs)
44
+ [![Discord](https://img.shields.io/badge/-Discord-5865F2?logo=discord&logoColor=white)](https://discord.gg/FMxHkYTtFE)
45
+
46
+ </div>
47
+
48
+ ## Judgeval: open-source testing, monitoring, and optimization for AI agents
49
+
50
+ Judgeval offers robust tooling for evaluating and tracing LLM agent systems. It is dev-friendly and open-source (licensed under Apache 2.0).
51
+
52
+ Judgeval gets you started in five minutes, after which you'll be ready to use all of its features as your agent becomes more complex. Judgeval is natively connected to the [Judgment Platform](https://www.judgmentlabs.ai/) for free and you can export your data and self-host at any time.
53
+
54
+ We support tracing agents built with LangGraph, OpenAI SDK, Anthropic, ... and allow custom eval integrations for any use case. Check out our quickstarts below or our [setup guide](https://docs.judgmentlabs.ai/getting-started) to get started.
55
+
56
+ Judgeval is created and maintained by [Judgment Labs](https://judgmentlabs.ai/).
57
+
58
+ ## 📋 Table of Contents
59
+ * [✨ Features](#-features)
60
+ * [🔍 Tracing](#-tracing)
61
+ * [🧪 Evals](#-evals)
62
+ * [📡 Monitoring](#-monitoring)
63
+ * [📊 Datasets](#-datasets)
64
+ * [💡 Insights](#-insights)
65
+ * [🛠️ Installation](#️-installation)
66
+ * [🏁 Get Started](#-get-started)
67
+ * [🏢 Self-Hosting](#-self-hosting)
68
+ * [📚 Cookbooks](#-cookbooks)
69
+ * [💻 Development with Cursor](#-development-with-cursor)
70
+ * [⭐ Star Us on GitHub](#-star-us-on-github)
71
+ * [❤️ Contributors](#️-contributors)
72
+
73
+ <!-- Created by https://github.com/ekalinin/github-markdown-toc -->
74
+
75
+
76
+ ## ✨ Features
77
+
78
+ | | |
79
+ |:---|:---:|
80
+ | <h3>🔍 Tracing</h3>Automatic agent tracing integrated with common frameworks (LangGraph, OpenAI, Anthropic): **tracking inputs/outputs, latency, and cost** at every step.<br><br>Online evals can be applied to traces to measure quality on production data in real-time.<br><br>Export trace data to the Judgment Platform or your own S3 buckets, {Parquet, JSON, YAML} files, or data warehouse.<br><br>**Useful for:**<br>• 🐛 Debugging agent runs <br>• 👤 Tracking user activity <br>• 🔬 Pinpointing performance bottlenecks| <p align="center"><img src="assets/trace_screenshot.png" alt="Tracing visualization" width="1200"/></p> |
81
+ | <h3>🧪 Evals</h3>15+ research-backed metrics including tool call accuracy, hallucinations, instruction adherence, and retrieval context recall.<br><br>Build custom evaluators that connect with our metric-tracking infrastructure. <br><br>**Useful for:**<br>• ⚠️ Unit-testing <br>• 🔬 Experimental prompt testing<br>• 🛡️ Online guardrails <br><br> | <p align="center"><img src="assets/experiments_page.png" alt="Evaluation metrics" width="800"/></p> |
82
+ | <h3>📡 Monitoring</h3>Real-time performance tracking of your agents in production environments. **Track all your metrics in one place.**<br><br>Set up **Slack/email alerts** for critical metrics and receive notifications when thresholds are exceeded.<br><br> **Useful for:** <br>•📉 Identifying degradation early <br>•📈 Visualizing performance trends across versions and time | <p align="center"><img src="assets/monitoring_screenshot.png" alt="Monitoring Dashboard" width="1200"/></p> |
83
+ | <h3>📊 Datasets</h3>Export trace data or import external testcases to datasets hosted on Judgment's Platform. Move datasets to/from Parquet, S3, etc. <br><br>Run evals on datasets as unit tests or to A/B test different agent configurations. <br><br> **Useful for:**<br>• 🔄 Scaled analysis for A/B tests <br>• 🗃️ Filtered collections of agent runtime data| <p align="center"><img src="assets/datasets_preview_screenshot.png" alt="Dataset management" width="1200"/></p> |
84
+ | <h3>💡 Insights</h3>Cluster on your data to reveal common use cases and failure modes.<br><br>Trace failures to their exact source with Judgment's Osiris agent, which localizes errors to specific components for precise fixes.<br><br> **Useful for:**<br>•🔮 Surfacing common inputs that lead to error<br>•🤖 Investigating agent/user behavior for optimization <br>| <p align="center"><img src="assets/dataset_clustering_screenshot_dm.png" alt="Insights dashboard" width="1200"/></p> |
85
+
86
+ ## 🛠️ Installation
87
+
88
+ Get started with Judgeval by installing our SDK using pip:
89
+
90
+ ```bash
91
+ pip install judgeval
92
+ ```
93
+
94
+ Ensure you have your `JUDGMENT_API_KEY` and `JUDGMENT_ORG_ID` environment variables set to connect to the [Judgment platform](https://app.judgmentlabs.ai/).
95
+
96
+ **If you don't have keys, [create an account](https://app.judgmentlabs.ai/register) on the platform!**
97
+
98
+ ## 🏁 Get Started
99
+
100
+ Here's how you can quickly start using Judgeval:
101
+
102
+ ### 🛰️ Tracing
103
+
104
+ Track your agent execution with full observability with just a few lines of code.
105
+ Create a file named `traces.py` with the following code:
106
+
107
+ ```python
108
+ from judgeval.common.tracer import Tracer, wrap
109
+ from openai import OpenAI
110
+
111
+ client = wrap(OpenAI())
112
+ judgment = Tracer(project_name="my_project")
113
+
114
+ @judgment.observe(span_type="tool")
115
+ def my_tool():
116
+ return "What's the capital of the U.S.?"
117
+
118
+ @judgment.observe(span_type="function")
119
+ def main():
120
+ task_input = my_tool()
121
+ res = client.chat.completions.create(
122
+ model="gpt-4.1",
123
+ messages=[{"role": "user", "content": f"{task_input}"}]
124
+ )
125
+ return res.choices[0].message.content
126
+
127
+ main()
128
+ ```
129
+
130
+ [Click here](https://docs.judgmentlabs.ai/getting-started#create-your-first-trace) for a more detailed explanation.
131
+
132
+ ### 📝 Offline Evaluations
133
+
134
+ You can evaluate your agent's execution to measure quality metrics such as hallucination.
135
+ Create a file named `evaluate.py` with the following code:
136
+
137
+ ```python evaluate.py
138
+ from judgeval import JudgmentClient
139
+ from judgeval.data import Example
140
+ from judgeval.scorers import FaithfulnessScorer
141
+
142
+ client = JudgmentClient()
143
+
144
+ example = Example(
145
+ input="What if these shoes don't fit?",
146
+ actual_output="We offer a 30-day full refund at no extra cost.",
147
+ retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."],
148
+ )
149
+
150
+ scorer = FaithfulnessScorer(threshold=0.5)
151
+ results = client.run_evaluation(
152
+ examples=[example],
153
+ scorers=[scorer],
154
+ model="gpt-4.1",
155
+ )
156
+ print(results)
157
+ ```
158
+
159
+ [Click here](https://docs.judgmentlabs.ai/getting-started#create-your-first-experiment) for a more detailed explanation.
160
+
161
+ ### 📡 Online Evaluations
162
+
163
+ Attach performance monitoring on traces to measure the quality of your systems in production.
164
+
165
+ Using the same `traces.py` file we created earlier, modify `main` function:
166
+
167
+ ```python
168
+ from judgeval.common.tracer import Tracer, wrap
169
+ from judgeval.scorers import AnswerRelevancyScorer
170
+ from openai import OpenAI
171
+
172
+ client = wrap(OpenAI())
173
+ judgment = Tracer(project_name="my_project")
174
+
175
+ @judgment.observe(span_type="tool")
176
+ def my_tool():
177
+ return "Hello world!"
178
+
179
+ @judgment.observe(span_type="function")
180
+ def main():
181
+ task_input = my_tool()
182
+ res = client.chat.completions.create(
183
+ model="gpt-4.1",
184
+ messages=[{"role": "user", "content": f"{task_input}"}]
185
+ ).choices[0].message.content
186
+
187
+ judgment.async_evaluate(
188
+ scorers=[AnswerRelevancyScorer(threshold=0.5)],
189
+ input=task_input,
190
+ actual_output=res,
191
+ model="gpt-4.1"
192
+ )
193
+ print("Online evaluation submitted.")
194
+ return res
195
+
196
+ main()
197
+ ```
198
+
199
+ [Click here](https://docs.judgmentlabs.ai/getting-started#create-your-first-online-evaluation) for a more detailed explanation.
200
+
201
+ ## 🏢 Self-Hosting
202
+
203
+ Run Judgment on your own infrastructure: we provide comprehensive self-hosting capabilities that give you full control over the backend and data plane that Judgeval interfaces with.
204
+
205
+ ### Key Features
206
+ * Deploy Judgment on your own AWS account
207
+ * Store data in your own Supabase instance
208
+ * Access Judgment through your own custom domain
209
+
210
+ ### Getting Started
211
+ 1. Check out our [self-hosting documentation](https://docs.judgmentlabs.ai/self-hosting/get_started) for detailed setup instructions, along with how your self-hosted instance can be accessed
212
+ 2. Use the [Judgment CLI](https://github.com/JudgmentLabs/judgment-cli) to deploy your self-hosted environment
213
+ 3. After your self-hosted instance is setup, make sure the `JUDGMENT_API_URL` environmental variable is set to your self-hosted backend endpoint
214
+
215
+ ## 📚 Cookbooks
216
+
217
+ Have your own? We're happy to feature it if you create a PR or message us on [Discord](https://discord.gg/taAufyhf).
218
+
219
+ You can access our repo of cookbooks [here](https://github.com/JudgmentLabs/judgment-cookbook). Here are some highlights:
220
+
221
+ ### Sample Agents
222
+
223
+ #### 💰 [LangGraph Financial QA Agent](https://github.com/JudgmentLabs/judgment-cookbook/blob/main/cookbooks/financial_agent/demo.py)
224
+ A LangGraph-based agent for financial queries, featuring RAG capabilities with a vector database for contextual data retrieval and evaluation of its reasoning and data accuracy.
225
+
226
+ #### ✈️ [OpenAI Travel Agent](https://github.com/JudgmentLabs/judgment-cookbook/blob/main/cookbooks/openai_travel_agent/agent.py)
227
+ A travel planning agent using OpenAI API calls, custom tool functions, and RAG with a vector database for up-to-date and contextual travel information. Evaluated for itinerary quality and information relevance.
228
+
229
+ ### Custom Evaluators
230
+
231
+ #### 🔍 [PII Detection](https://github.com/JudgmentLabs/judgment-cookbook/blob/main/cookbooks/classifier_scorer/pii_checker.py)
232
+ Detecting and evaluating Personal Identifiable Information (PII) leakage.
233
+
234
+ #### 📧 [Cold Email Generation](https://github.com/JudgmentLabs/judgment-cookbook/blob/main/cookbooks/custom_scorers/cold_email_scorer.py)
235
+
236
+ Evaluates if a cold email generator properly utilizes all relevant information about the target recipient.
237
+
238
+ ## 💻 Development with Cursor
239
+ When building agents and LLM workflows in Cursor, providing proper context to your coding assistant helps ensure seamless integration with Judgment. This rule file supplies the essential context your coding assistant needs for successful implementation.
240
+
241
+ To implement this rule file, simply copy the text below and save it in a ".cursor/rules" directory in your project's root directory.
242
+
243
+ <details>
244
+
245
+ <summary>Cursor Rule File</summary>
246
+
247
+ ````
248
+ ---
249
+ You are an expert in helping users integrate Judgment with their codebase. When you are helping someone integrate Judgment tracing or evaluations with their agents/workflows, refer to this file.
250
+ ---
251
+
252
+ # Common Questions You May Get from the User (and How to Handle These Cases):
253
+
254
+ ## Sample Agent 1:
255
+ ```
256
+ from uuid import uuid4
257
+ import openai
258
+ import os
259
+ import asyncio
260
+ from tavily import TavilyClient
261
+ from dotenv import load_dotenv
262
+ import chromadb
263
+ from chromadb.utils import embedding_functions
264
+
265
+ destinations_data = [
266
+ {
267
+ "destination": "Paris, France",
268
+ "information": """
269
+ Paris is the capital city of France and a global center for art, fashion, and culture.
270
+ Key Information:
271
+ - Best visited during spring (March-May) or fall (September-November)
272
+ - Famous landmarks: Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, Arc de Triomphe
273
+ - Known for: French cuisine, café culture, fashion, art galleries
274
+ - Local transportation: Metro system is extensive and efficient
275
+ - Popular neighborhoods: Le Marais, Montmartre, Latin Quarter
276
+ - Cultural tips: Basic French phrases are appreciated; many restaurants close between lunch and dinner
277
+ - Must-try experiences: Seine River cruise, visiting local bakeries, Luxembourg Gardens
278
+ """
279
+ },
280
+ {
281
+ "destination": "Tokyo, Japan",
282
+ "information": """
283
+ Tokyo is Japan's bustling capital, blending ultramodern and traditional elements.
284
+ Key Information:
285
+ - Best visited during spring (cherry blossoms) or fall (autumn colors)
286
+ - Famous areas: Shibuya, Shinjuku, Harajuku, Akihabara
287
+ - Known for: Technology, anime culture, sushi, efficient public transport
288
+ - Local transportation: Extensive train and subway network
289
+ - Cultural tips: Bow when greeting, remove shoes indoors, no tipping
290
+ - Must-try experiences: Robot Restaurant, teamLab Borderless, Tsukiji Outer Market
291
+ - Popular day trips: Mount Fuji, Kamakura, Nikko
292
+ """
293
+ },
294
+ {
295
+ "destination": "New York City, USA",
296
+ "information": """
297
+ New York City is a global metropolis known for its diversity, culture, and iconic skyline.
298
+ Key Information:
299
+ - Best visited during spring (April-June) or fall (September-November)
300
+ - Famous landmarks: Statue of Liberty, Times Square, Central Park, Empire State Building
301
+ - Known for: Broadway shows, diverse cuisine, shopping, museums
302
+ - Local transportation: Extensive subway system, yellow cabs, ride-sharing
303
+ - Popular areas: Manhattan, Brooklyn, Queens
304
+ - Cultural tips: Fast-paced environment, tipping expected (15-20%)
305
+ - Must-try experiences: Broadway show, High Line walk, food tours
306
+ """
307
+ },
308
+ {
309
+ "destination": "Barcelona, Spain",
310
+ "information": """
311
+ Barcelona is a vibrant city known for its art, architecture, and Mediterranean culture.
312
+ Key Information:
313
+ - Best visited during spring and fall for mild weather
314
+ - Famous landmarks: Sagrada Familia, Park Güell, Casa Batlló
315
+ - Known for: Gaudi architecture, tapas, beach culture, FC Barcelona
316
+ - Local transportation: Metro, buses, and walkable city center
317
+ - Popular areas: Gothic Quarter, Eixample, La Barceloneta
318
+ - Cultural tips: Late dinner times (after 8 PM), siesta tradition
319
+ - Must-try experiences: La Rambla walk, tapas crawl, local markets
320
+ """
321
+ },
322
+ {
323
+ "destination": "Bangkok, Thailand",
324
+ "information": """
325
+ Bangkok is Thailand's capital city, famous for its temples, street food, and vibrant culture.
326
+ Key Information:
327
+ - Best visited during November to February (cool and dry season)
328
+ - Famous sites: Grand Palace, Wat Phra Kaew, Wat Arun
329
+ - Known for: Street food, temples, markets, nightlife
330
+ - Local transportation: BTS Skytrain, MRT, tuk-tuks, river boats
331
+ - Popular areas: Sukhumvit, Old City, Chinatown
332
+ - Cultural tips: Dress modestly at temples, respect royal family
333
+ - Must-try experiences: Street food tours, river cruises, floating markets
334
+ """
335
+ }
336
+ ]
337
+
338
+ client = openai.Client(api_key=os.getenv("OPENAI_API_KEY"))
339
+
340
+ def populate_vector_db(collection, destinations_data):
341
+ """
342
+ Populate the vector DB with travel information.
343
+ destinations_data should be a list of dictionaries with 'destination' and 'information' keys
344
+ """
345
+ for data in destinations_data:
346
+ collection.add(
347
+ documents=[data['information']],
348
+ metadatas=[{"destination": data['destination']}],
349
+ ids=[f"destination_{data['destination'].lower().replace(' ', '_')}"]
350
+ )
351
+
352
+ def search_tavily(query):
353
+ """Fetch travel data using Tavily API."""
354
+ API_KEY = os.getenv("TAVILY_API_KEY")
355
+ client = TavilyClient(api_key=API_KEY)
356
+ results = client.search(query, num_results=3)
357
+ return results
358
+
359
+ async def get_attractions(destination):
360
+ """Search for top attractions in the destination."""
361
+ prompt = f"Best tourist attractions in {destination}"
362
+ attractions_search = search_tavily(prompt)
363
+ return attractions_search
364
+
365
+ async def get_hotels(destination):
366
+ """Search for hotels in the destination."""
367
+ prompt = f"Best hotels in {destination}"
368
+ hotels_search = search_tavily(prompt)
369
+ return hotels_search
370
+
371
+ async def get_flights(destination):
372
+ """Search for flights to the destination."""
373
+ prompt = f"Flights to {destination} from major cities"
374
+ flights_search = search_tavily(prompt)
375
+ return flights_search
376
+
377
+ async def get_weather(destination, start_date, end_date):
378
+ """Search for weather information."""
379
+ prompt = f"Weather forecast for {destination} from {start_date} to {end_date}"
380
+ weather_search = search_tavily(prompt)
381
+ return weather_search
382
+
383
+ def initialize_vector_db():
384
+ """Initialize ChromaDB with OpenAI embeddings."""
385
+ client = chromadb.Client()
386
+ embedding_fn = embedding_functions.OpenAIEmbeddingFunction(
387
+ api_key=os.getenv("OPENAI_API_KEY"),
388
+ model_name="text-embedding-3-small"
389
+ )
390
+ res = client.get_or_create_collection(
391
+ "travel_information",
392
+ embedding_function=embedding_fn
393
+ )
394
+ populate_vector_db(res, destinations_data)
395
+ return res
396
+
397
+ def query_vector_db(collection, destination, k=3):
398
+ """Query the vector database for existing travel information."""
399
+ try:
400
+ results = collection.query(
401
+ query_texts=[destination],
402
+ n_results=k
403
+ )
404
+ return results['documents'][0] if results['documents'] else []
405
+ except Exception:
406
+ return []
407
+
408
+ async def research_destination(destination, start_date, end_date):
409
+ """Gather all necessary travel information for a destination."""
410
+ # First, check the vector database
411
+ collection = initialize_vector_db()
412
+ existing_info = query_vector_db(collection, destination)
413
+
414
+ # Get real-time information from Tavily
415
+ tavily_data = {
416
+ "attractions": await get_attractions(destination),
417
+ "hotels": await get_hotels(destination),
418
+ "flights": await get_flights(destination),
419
+ "weather": await get_weather(destination, start_date, end_date)
420
+ }
421
+
422
+ return {
423
+ "vector_db_results": existing_info,
424
+ **tavily_data
425
+ }
426
+
427
+ async def create_travel_plan(destination, start_date, end_date, research_data):
428
+ """Generate a travel itinerary using the researched data."""
429
+ vector_db_context = "\n".join(research_data['vector_db_results']) if research_data['vector_db_results'] else "No pre-stored information available."
430
+
431
+ prompt = f"""
432
+ Create a structured travel itinerary for a trip to {destination} from {start_date} to {end_date}.
433
+
434
+ Pre-stored destination information:
435
+ {vector_db_context}
436
+
437
+ Current travel data:
438
+ - Attractions: {research_data['attractions']}
439
+ - Hotels: {research_data['hotels']}
440
+ - Flights: {research_data['flights']}
441
+ - Weather: {research_data['weather']}
442
+ """
443
+
444
+ response = client.chat.completions.create(
445
+ model="gpt-4.1",
446
+ messages=[
447
+ {"role": "system", "content": "You are an expert travel planner. Combine both historical and current information to create the best possible itinerary."},
448
+ {"role": "user", "content": prompt}
449
+ ]
450
+ ).choices[0].message.content
451
+
452
+ return response
453
+
454
+ async def generate_itinerary(destination, start_date, end_date):
455
+ """Main function to generate a travel itinerary."""
456
+ research_data = await research_destination(destination, start_date, end_date)
457
+ res = await create_travel_plan(destination, start_date, end_date, research_data)
458
+ return res
459
+
460
+
461
+ if __name__ == "__main__":
462
+ load_dotenv()
463
+ destination = input("Enter your travel destination: ")
464
+ start_date = input("Enter start date (YYYY-MM-DD): ")
465
+ end_date = input("Enter end date (YYYY-MM-DD): ")
466
+ itinerary = asyncio.run(generate_itinerary(destination, start_date, end_date))
467
+ print("\nGenerated Itinerary:\n", itinerary)
468
+ ```
469
+
470
+ ## Sample Query 1:
471
+ Can you add Judgment tracing to my file?
472
+
473
+ ## Example of Modified Code after Query 1:
474
+ ```
475
+ from uuid import uuid4
476
+ import openai
477
+ import os
478
+ import asyncio
479
+ from tavily import TavilyClient
480
+ from dotenv import load_dotenv
481
+ import chromadb
482
+ from chromadb.utils import embedding_functions
483
+
484
+ from judgeval.tracer import Tracer, wrap
485
+ from judgeval.scorers import AnswerRelevancyScorer, FaithfulnessScorer
486
+ from judgeval.data import Example
487
+
488
+ destinations_data = [
489
+ {
490
+ "destination": "Paris, France",
491
+ "information": """
492
+ Paris is the capital city of France and a global center for art, fashion, and culture.
493
+ Key Information:
494
+ - Best visited during spring (March-May) or fall (September-November)
495
+ - Famous landmarks: Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, Arc de Triomphe
496
+ - Known for: French cuisine, café culture, fashion, art galleries
497
+ - Local transportation: Metro system is extensive and efficient
498
+ - Popular neighborhoods: Le Marais, Montmartre, Latin Quarter
499
+ - Cultural tips: Basic French phrases are appreciated; many restaurants close between lunch and dinner
500
+ - Must-try experiences: Seine River cruise, visiting local bakeries, Luxembourg Gardens
501
+ """
502
+ },
503
+ {
504
+ "destination": "Tokyo, Japan",
505
+ "information": """
506
+ Tokyo is Japan's bustling capital, blending ultramodern and traditional elements.
507
+ Key Information:
508
+ - Best visited during spring (cherry blossoms) or fall (autumn colors)
509
+ - Famous areas: Shibuya, Shinjuku, Harajuku, Akihabara
510
+ - Known for: Technology, anime culture, sushi, efficient public transport
511
+ - Local transportation: Extensive train and subway network
512
+ - Cultural tips: Bow when greeting, remove shoes indoors, no tipping
513
+ - Must-try experiences: Robot Restaurant, teamLab Borderless, Tsukiji Outer Market
514
+ - Popular day trips: Mount Fuji, Kamakura, Nikko
515
+ """
516
+ },
517
+ {
518
+ "destination": "New York City, USA",
519
+ "information": """
520
+ New York City is a global metropolis known for its diversity, culture, and iconic skyline.
521
+ Key Information:
522
+ - Best visited during spring (April-June) or fall (September-November)
523
+ - Famous landmarks: Statue of Liberty, Times Square, Central Park, Empire State Building
524
+ - Known for: Broadway shows, diverse cuisine, shopping, museums
525
+ - Local transportation: Extensive subway system, yellow cabs, ride-sharing
526
+ - Popular areas: Manhattan, Brooklyn, Queens
527
+ - Cultural tips: Fast-paced environment, tipping expected (15-20%)
528
+ - Must-try experiences: Broadway show, High Line walk, food tours
529
+ """
530
+ },
531
+ {
532
+ "destination": "Barcelona, Spain",
533
+ "information": """
534
+ Barcelona is a vibrant city known for its art, architecture, and Mediterranean culture.
535
+ Key Information:
536
+ - Best visited during spring and fall for mild weather
537
+ - Famous landmarks: Sagrada Familia, Park Güell, Casa Batlló
538
+ - Known for: Gaudi architecture, tapas, beach culture, FC Barcelona
539
+ - Local transportation: Metro, buses, and walkable city center
540
+ - Popular areas: Gothic Quarter, Eixample, La Barceloneta
541
+ - Cultural tips: Late dinner times (after 8 PM), siesta tradition
542
+ - Must-try experiences: La Rambla walk, tapas crawl, local markets
543
+ """
544
+ },
545
+ {
546
+ "destination": "Bangkok, Thailand",
547
+ "information": """
548
+ Bangkok is Thailand's capital city, famous for its temples, street food, and vibrant culture.
549
+ Key Information:
550
+ - Best visited during November to February (cool and dry season)
551
+ - Famous sites: Grand Palace, Wat Phra Kaew, Wat Arun
552
+ - Known for: Street food, temples, markets, nightlife
553
+ - Local transportation: BTS Skytrain, MRT, tuk-tuks, river boats
554
+ - Popular areas: Sukhumvit, Old City, Chinatown
555
+ - Cultural tips: Dress modestly at temples, respect royal family
556
+ - Must-try experiences: Street food tours, river cruises, floating markets
557
+ """
558
+ }
559
+ ]
560
+
561
+ client = wrap(openai.Client(api_key=os.getenv("OPENAI_API_KEY")))
562
+ judgment = Tracer(api_key=os.getenv("JUDGMENT_API_KEY"), project_name="travel_agent_demo")
563
+
564
+ def populate_vector_db(collection, destinations_data):
565
+ """
566
+ Populate the vector DB with travel information.
567
+ destinations_data should be a list of dictionaries with 'destination' and 'information' keys
568
+ """
569
+ for data in destinations_data:
570
+ collection.add(
571
+ documents=[data['information']],
572
+ metadatas=[{"destination": data['destination']}],
573
+ ids=[f"destination_{data['destination'].lower().replace(' ', '_')}"]
574
+ )
575
+
576
+ @judgment.observe(span_type="search_tool")
577
+ def search_tavily(query):
578
+ """Fetch travel data using Tavily API."""
579
+ API_KEY = os.getenv("TAVILY_API_KEY")
580
+ client = TavilyClient(api_key=API_KEY)
581
+ results = client.search(query, num_results=3)
582
+ return results
583
+
584
+ @judgment.observe(span_type="tool")
585
+ async def get_attractions(destination):
586
+ """Search for top attractions in the destination."""
587
+ prompt = f"Best tourist attractions in {destination}"
588
+ attractions_search = search_tavily(prompt)
589
+ return attractions_search
590
+
591
+ @judgment.observe(span_type="tool")
592
+ async def get_hotels(destination):
593
+ """Search for hotels in the destination."""
594
+ prompt = f"Best hotels in {destination}"
595
+ hotels_search = search_tavily(prompt)
596
+ return hotels_search
597
+
598
+ @judgment.observe(span_type="tool")
599
+ async def get_flights(destination):
600
+ """Search for flights to the destination."""
601
+ prompt = f"Flights to {destination} from major cities"
602
+ flights_search = search_tavily(prompt)
603
+ example = Example(
604
+ input=prompt,
605
+ actual_output=str(flights_search["results"])
606
+ )
607
+ judgment.async_evaluate(
608
+ scorers=[AnswerRelevancyScorer(threshold=0.5)],
609
+ example=example,
610
+ model="gpt-4.1"
611
+ )
612
+ return flights_search
613
+
614
+ @judgment.observe(span_type="tool")
615
+ async def get_weather(destination, start_date, end_date):
616
+ """Search for weather information."""
617
+ prompt = f"Weather forecast for {destination} from {start_date} to {end_date}"
618
+ weather_search = search_tavily(prompt)
619
+ example = Example(
620
+ input=prompt,
621
+ actual_output=str(weather_search["results"])
622
+ )
623
+ judgment.async_evaluate(
624
+ scorers=[AnswerRelevancyScorer(threshold=0.5)],
625
+ example=example,
626
+ model="gpt-4.1"
627
+ )
628
+ return weather_search
629
+
630
+ def initialize_vector_db():
631
+ """Initialize ChromaDB with OpenAI embeddings."""
632
+ client = chromadb.Client()
633
+ embedding_fn = embedding_functions.OpenAIEmbeddingFunction(
634
+ api_key=os.getenv("OPENAI_API_KEY"),
635
+ model_name="text-embedding-3-small"
636
+ )
637
+ res = client.get_or_create_collection(
638
+ "travel_information",
639
+ embedding_function=embedding_fn
640
+ )
641
+ populate_vector_db(res, destinations_data)
642
+ return res
643
+
644
+ @judgment.observe(span_type="retriever")
645
+ def query_vector_db(collection, destination, k=3):
646
+ """Query the vector database for existing travel information."""
647
+ try:
648
+ results = collection.query(
649
+ query_texts=[destination],
650
+ n_results=k
651
+ )
652
+ return results['documents'][0] if results['documents'] else []
653
+ except Exception:
654
+ return []
655
+
656
+ @judgment.observe(span_type="Research")
657
+ async def research_destination(destination, start_date, end_date):
658
+ """Gather all necessary travel information for a destination."""
659
+ # First, check the vector database
660
+ collection = initialize_vector_db()
661
+ existing_info = query_vector_db(collection, destination)
662
+
663
+ # Get real-time information from Tavily
664
+ tavily_data = {
665
+ "attractions": await get_attractions(destination),
666
+ "hotels": await get_hotels(destination),
667
+ "flights": await get_flights(destination),
668
+ "weather": await get_weather(destination, start_date, end_date)
669
+ }
670
+
671
+ return {
672
+ "vector_db_results": existing_info,
673
+ **tavily_data
674
+ }
675
+
676
+ @judgment.observe(span_type="function")
677
+ async def create_travel_plan(destination, start_date, end_date, research_data):
678
+ """Generate a travel itinerary using the researched data."""
679
+ vector_db_context = "\n".join(research_data['vector_db_results']) if research_data['vector_db_results'] else "No pre-stored information available."
680
+
681
+ prompt = f"""
682
+ Create a structured travel itinerary for a trip to {destination} from {start_date} to {end_date}.
683
+
684
+ Pre-stored destination information:
685
+ {vector_db_context}
686
+
687
+ Current travel data:
688
+ - Attractions: {research_data['attractions']}
689
+ - Hotels: {research_data['hotels']}
690
+ - Flights: {research_data['flights']}
691
+ - Weather: {research_data['weather']}
692
+ """
693
+
694
+ response = client.chat.completions.create(
695
+ model="gpt-4.1",
696
+ messages=[
697
+ {"role": "system", "content": "You are an expert travel planner. Combine both historical and current information to create the best possible itinerary."},
698
+ {"role": "user", "content": prompt}
699
+ ]
700
+ ).choices[0].message.content
701
+
702
+ example = Example(
703
+ input=prompt,
704
+ actual_output=str(response),
705
+ retrieval_context=[str(vector_db_context), str(research_data)]
706
+ )
707
+ judgment.async_evaluate(
708
+ scorers=[FaithfulnessScorer(threshold=0.5)],
709
+ example=example,
710
+ model="gpt-4.1"
711
+ )
712
+
713
+ return response
714
+
715
+ @judgment.observe(span_type="function")
716
+ async def generate_itinerary(destination, start_date, end_date):
717
+ """Main function to generate a travel itinerary."""
718
+ research_data = await research_destination(destination, start_date, end_date)
719
+ res = await create_travel_plan(destination, start_date, end_date, research_data)
720
+ return res
721
+
722
+
723
+ if __name__ == "__main__":
724
+ load_dotenv()
725
+ destination = input("Enter your travel destination: ")
726
+ start_date = input("Enter start date (YYYY-MM-DD): ")
727
+ end_date = input("Enter end date (YYYY-MM-DD): ")
728
+ itinerary = asyncio.run(generate_itinerary(destination, start_date, end_date))
729
+ print("\nGenerated Itinerary:\n", itinerary)
730
+ ```
731
+
732
+ ## Sample Agent 2
733
+ ```
734
+ from langchain_openai import ChatOpenAI
735
+ import asyncio
736
+ import os
737
+
738
+ import chromadb
739
+ from chromadb.utils import embedding_functions
740
+
741
+ from vectordbdocs import financial_data
742
+
743
+ from typing import Optional
744
+ from langchain_core.messages import BaseMessage, HumanMessage, SystemMessage, ChatMessage
745
+ from typing_extensions import TypedDict
746
+ from langgraph.graph import StateGraph
747
+
748
+ # Define our state type
749
+ class AgentState(TypedDict):
750
+ messages: list[BaseMessage]
751
+ category: Optional[str]
752
+ documents: Optional[str]
753
+
754
+ def populate_vector_db(collection, raw_data):
755
+ """
756
+ Populate the vector DB with financial information.
757
+ """
758
+ for data in raw_data:
759
+ collection.add(
760
+ documents=[data['information']],
761
+ metadatas=[{"category": data['category']}],
762
+ ids=[f"category_{data['category'].lower().replace(' ', '_')}_{os.urandom(4).hex()}"]
763
+ )
764
+
765
+ # Define a ChromaDB collection for document storage
766
+ client = chromadb.Client()
767
+ collection = client.get_or_create_collection(
768
+ name="financial_docs",
769
+ embedding_function=embedding_functions.OpenAIEmbeddingFunction(api_key=os.getenv("OPENAI_API_KEY"))
770
+ )
771
+
772
+ populate_vector_db(collection, financial_data)
773
+
774
+ def pnl_retriever(state: AgentState) -> AgentState:
775
+ query = state["messages"][-1].content
776
+ results = collection.query(
777
+ query_texts=[query],
778
+ where={"category": "pnl"},
779
+ n_results=3
780
+ )
781
+ documents = []
782
+ for document in results["documents"]:
783
+ documents += document
784
+
785
+ return {"messages": state["messages"], "documents": documents}
786
+
787
+ def balance_sheet_retriever(state: AgentState) -> AgentState:
788
+ query = state["messages"][-1].content
789
+ results = collection.query(
790
+ query_texts=[query],
791
+ where={"category": "balance_sheets"},
792
+ n_results=3
793
+ )
794
+ documents = []
795
+ for document in results["documents"]:
796
+ documents += document
797
+
798
+ return {"messages": state["messages"], "documents": documents}
799
+
800
+ def stock_retriever(state: AgentState) -> AgentState:
801
+ query = state["messages"][-1].content
802
+ results = collection.query(
803
+ query_texts=[query],
804
+ where={"category": "stocks"},
805
+ n_results=3
806
+ )
807
+ documents = []
808
+ for document in results["documents"]:
809
+ documents += document
810
+
811
+ return {"messages": state["messages"], "documents": documents}
812
+
813
+ async def bad_classifier(state: AgentState) -> AgentState:
814
+ return {"messages": state["messages"], "category": "stocks"}
815
+
816
+ async def bad_classify(state: AgentState) -> AgentState:
817
+ category = await bad_classifier(state)
818
+
819
+ return {"messages": state["messages"], "category": category["category"]}
820
+
821
+ async def bad_sql_generator(state: AgentState) -> AgentState:
822
+ ACTUAL_OUTPUT = "SELECT * FROM pnl WHERE stock_symbol = 'GOOGL'"
823
+ return {"messages": state["messages"] + [ChatMessage(content=ACTUAL_OUTPUT, role="text2sql")]}
824
+
825
+ # Create the classifier node with a system prompt
826
+ async def classify(state: AgentState) -> AgentState:
827
+ messages = state["messages"]
828
+ input_msg = [
829
+ SystemMessage(content="""You are a financial query classifier. Your job is to classify user queries into one of three categories:
830
+ - 'pnl' for Profit and Loss related queries
831
+ - 'balance_sheets' for Balance Sheet related queries
832
+ - 'stocks' for Stock market related queries
833
+
834
+ Respond ONLY with the category name in lowercase, nothing else."""),
835
+ *messages
836
+ ]
837
+
838
+ response = ChatOpenAI(model="gpt-4.1", temperature=0).invoke(
839
+ input=input_msg
840
+ )
841
+
842
+ return {"messages": state["messages"], "category": response.content}
843
+
844
+ # Add router node to direct flow based on classification
845
+ def router(state: AgentState) -> str:
846
+ return state["category"]
847
+
848
+ async def generate_response(state: AgentState) -> AgentState:
849
+ messages = state["messages"]
850
+ documents = state.get("documents", "")
851
+
852
+ OUTPUT = """
853
+ SELECT
854
+ stock_symbol,
855
+ SUM(CASE WHEN transaction_type = 'buy' THEN quantity ELSE -quantity END) AS total_shares,
856
+ SUM(CASE WHEN transaction_type = 'buy' THEN quantity * price_per_share ELSE -quantity * price_per_share END) AS total_cost,
857
+ MAX(CASE WHEN transaction_type = 'buy' THEN price_per_share END) AS current_market_price
858
+ FROM
859
+ stock_transactions
860
+ WHERE
861
+ stock_symbol = 'META'
862
+ GROUP BY
863
+ stock_symbol;
864
+ """
865
+
866
+ return {"messages": messages + [ChatMessage(content=OUTPUT, role="text2sql")], "documents": documents}
867
+
868
+ async def main():
869
+ # Initialize the graph
870
+ graph_builder = StateGraph(AgentState)
871
+
872
+ # Add classifier node
873
+ # For failure test, pass in bad_classifier
874
+ graph_builder.add_node("classifier", classify)
875
+ # graph_builder.add_node("classifier", bad_classify)
876
+
877
+ # Add conditional edges based on classification
878
+ graph_builder.add_conditional_edges(
879
+ "classifier",
880
+ router,
881
+ {
882
+ "pnl": "pnl_retriever",
883
+ "balance_sheets": "balance_sheet_retriever",
884
+ "stocks": "stock_retriever"
885
+ }
886
+ )
887
+
888
+ # Add retriever nodes (placeholder functions for now)
889
+ graph_builder.add_node("pnl_retriever", pnl_retriever)
890
+ graph_builder.add_node("balance_sheet_retriever", balance_sheet_retriever)
891
+ graph_builder.add_node("stock_retriever", stock_retriever)
892
+
893
+ # Add edges from retrievers to response generator
894
+ graph_builder.add_node("response_generator", generate_response)
895
+ # graph_builder.add_node("response_generator", bad_sql_generator)
896
+ graph_builder.add_edge("pnl_retriever", "response_generator")
897
+ graph_builder.add_edge("balance_sheet_retriever", "response_generator")
898
+ graph_builder.add_edge("stock_retriever", "response_generator")
899
+
900
+ graph_builder.set_entry_point("classifier")
901
+ graph_builder.set_finish_point("response_generator")
902
+
903
+ # Compile the graph
904
+ graph = graph_builder.compile()
905
+
906
+ response = await graph.ainvoke({
907
+ "messages": [HumanMessage(content="Please calculate our PNL on Apple stock. Refer to table information from documents provided.")],
908
+ "category": None,
909
+ })
910
+
911
+ print(f"Response: {response['messages'][-1].content}")
912
+
913
+ if __name__ == "__main__":
914
+ asyncio.run(main())
915
+ ```
916
+
917
+ ## Sample Query 2:
918
+ Can you add Judgment tracing to my file?
919
+
920
+ ## Example of Modified Code after Query 2:
921
+ ```
922
+ from langchain_openai import ChatOpenAI
923
+ import asyncio
924
+ import os
925
+
926
+ import chromadb
927
+ from chromadb.utils import embedding_functions
928
+
929
+ from vectordbdocs import financial_data
930
+
931
+ from typing import Optional
932
+ from langchain_core.messages import BaseMessage, HumanMessage, SystemMessage, ChatMessage
933
+ from typing_extensions import TypedDict
934
+ from langgraph.graph import StateGraph
935
+
936
+ from judgeval.common.tracer import Tracer
937
+ from judgeval.integrations.langgraph import JudgevalCallbackHandler
938
+ from judgeval.scorers import AnswerCorrectnessScorer, FaithfulnessScorer
939
+ from judgeval.data import Example
940
+
941
+
942
+
943
+ judgment = Tracer(project_name="FINANCIAL_AGENT")
944
+
945
+ # Define our state type
946
+ class AgentState(TypedDict):
947
+ messages: list[BaseMessage]
948
+ category: Optional[str]
949
+ documents: Optional[str]
950
+
951
+ def populate_vector_db(collection, raw_data):
952
+ """
953
+ Populate the vector DB with financial information.
954
+ """
955
+ for data in raw_data:
956
+ collection.add(
957
+ documents=[data['information']],
958
+ metadatas=[{"category": data['category']}],
959
+ ids=[f"category_{data['category'].lower().replace(' ', '_')}_{os.urandom(4).hex()}"]
960
+ )
961
+
962
+ # Define a ChromaDB collection for document storage
963
+ client = chromadb.Client()
964
+ collection = client.get_or_create_collection(
965
+ name="financial_docs",
966
+ embedding_function=embedding_functions.OpenAIEmbeddingFunction(api_key=os.getenv("OPENAI_API_KEY"))
967
+ )
968
+
969
+ populate_vector_db(collection, financial_data)
970
+
971
+ @judgment.observe(name="pnl_retriever", span_type="retriever")
972
+ def pnl_retriever(state: AgentState) -> AgentState:
973
+ query = state["messages"][-1].content
974
+ results = collection.query(
975
+ query_texts=[query],
976
+ where={"category": "pnl"},
977
+ n_results=3
978
+ )
979
+ documents = []
980
+ for document in results["documents"]:
981
+ documents += document
982
+
983
+ return {"messages": state["messages"], "documents": documents}
984
+
985
+ @judgment.observe(name="balance_sheet_retriever", span_type="retriever")
986
+ def balance_sheet_retriever(state: AgentState) -> AgentState:
987
+ query = state["messages"][-1].content
988
+ results = collection.query(
989
+ query_texts=[query],
990
+ where={"category": "balance_sheets"},
991
+ n_results=3
992
+ )
993
+ documents = []
994
+ for document in results["documents"]:
995
+ documents += document
996
+
997
+ return {"messages": state["messages"], "documents": documents}
998
+
999
+ @judgment.observe(name="stock_retriever", span_type="retriever")
1000
+ def stock_retriever(state: AgentState) -> AgentState:
1001
+ query = state["messages"][-1].content
1002
+ results = collection.query(
1003
+ query_texts=[query],
1004
+ where={"category": "stocks"},
1005
+ n_results=3
1006
+ )
1007
+ documents = []
1008
+ for document in results["documents"]:
1009
+ documents += document
1010
+
1011
+ return {"messages": state["messages"], "documents": documents}
1012
+
1013
+ @judgment.observe(name="bad_classifier", span_type="llm")
1014
+ async def bad_classifier(state: AgentState) -> AgentState:
1015
+ return {"messages": state["messages"], "category": "stocks"}
1016
+
1017
+ @judgment.observe(name="bad_classify")
1018
+ async def bad_classify(state: AgentState) -> AgentState:
1019
+ category = await bad_classifier(state)
1020
+
1021
+ example = Example(
1022
+ input=state["messages"][-1].content,
1023
+ actual_output=category["category"],
1024
+ expected_output="pnl"
1025
+ )
1026
+ judgment.async_evaluate(
1027
+ scorers=[AnswerCorrectnessScorer(threshold=1)],
1028
+ example=example,
1029
+ model="gpt-4.1"
1030
+ )
1031
+
1032
+ return {"messages": state["messages"], "category": category["category"]}
1033
+
1034
+ @judgment.observe(name="bad_sql_generator", span_type="llm")
1035
+ async def bad_sql_generator(state: AgentState) -> AgentState:
1036
+ ACTUAL_OUTPUT = "SELECT * FROM pnl WHERE stock_symbol = 'GOOGL'"
1037
+
1038
+ example = Example(
1039
+ input=state["messages"][-1].content,
1040
+ actual_output=ACTUAL_OUTPUT,
1041
+ retrieval_context=state.get("documents", []),
1042
+ expected_output="""
1043
+ SELECT
1044
+ SUM(CASE
1045
+ WHEN transaction_type = 'sell' THEN (price_per_share - (SELECT price_per_share FROM stock_transactions WHERE stock_symbol = 'GOOGL' AND transaction_type = 'buy' LIMIT 1)) * quantity
1046
+ ELSE 0
1047
+ END) AS realized_pnl
1048
+ FROM
1049
+ stock_transactions
1050
+ WHERE
1051
+ stock_symbol = 'META';
1052
+ """
1053
+ )
1054
+ judgment.async_evaluate(
1055
+ scorers=[AnswerCorrectnessScorer(threshold=1), FaithfulnessScorer(threshold=1)],
1056
+ example=example,
1057
+ model="gpt-4.1"
1058
+ )
1059
+ return {"messages": state["messages"] + [ChatMessage(content=ACTUAL_OUTPUT, role="text2sql")]}
1060
+
1061
+ # Create the classifier node with a system prompt
1062
+ @judgment.observe(name="classify")
1063
+ async def classify(state: AgentState) -> AgentState:
1064
+ messages = state["messages"]
1065
+ input_msg = [
1066
+ SystemMessage(content="""You are a financial query classifier. Your job is to classify user queries into one of three categories:
1067
+ - 'pnl' for Profit and Loss related queries
1068
+ - 'balance_sheets' for Balance Sheet related queries
1069
+ - 'stocks' for Stock market related queries
1070
+
1071
+ Respond ONLY with the category name in lowercase, nothing else."""),
1072
+ *messages
1073
+ ]
1074
+
1075
+ response = ChatOpenAI(model="gpt-4.1", temperature=0).invoke(
1076
+ input=input_msg
1077
+ )
1078
+
1079
+ example = Example(
1080
+ input=str(input_msg),
1081
+ actual_output=response.content,
1082
+ expected_output="pnl"
1083
+ )
1084
+ judgment.async_evaluate(
1085
+ scorers=[AnswerCorrectnessScorer(threshold=1)],
1086
+ example=example,
1087
+ model="gpt-4.1"
1088
+ )
1089
+
1090
+ return {"messages": state["messages"], "category": response.content}
1091
+
1092
+ # Add router node to direct flow based on classification
1093
+ def router(state: AgentState) -> str:
1094
+ return state["category"]
1095
+
1096
+ @judgment.observe(name="generate_response")
1097
+ async def generate_response(state: AgentState) -> AgentState:
1098
+ messages = state["messages"]
1099
+ documents = state.get("documents", "")
1100
+
1101
+ OUTPUT = """
1102
+ SELECT
1103
+ stock_symbol,
1104
+ SUM(CASE WHEN transaction_type = 'buy' THEN quantity ELSE -quantity END) AS total_shares,
1105
+ SUM(CASE WHEN transaction_type = 'buy' THEN quantity * price_per_share ELSE -quantity * price_per_share END) AS total_cost,
1106
+ MAX(CASE WHEN transaction_type = 'buy' THEN price_per_share END) AS current_market_price
1107
+ FROM
1108
+ stock_transactions
1109
+ WHERE
1110
+ stock_symbol = 'META'
1111
+ GROUP BY
1112
+ stock_symbol;
1113
+ """
1114
+
1115
+ example = Example(
1116
+ input=messages[-1].content,
1117
+ actual_output=OUTPUT,
1118
+ retrieval_context=documents,
1119
+ expected_output="""
1120
+ SELECT
1121
+ stock_symbol,
1122
+ SUM(CASE WHEN transaction_type = 'buy' THEN quantity ELSE -quantity END) AS total_shares,
1123
+ SUM(CASE WHEN transaction_type = 'buy' THEN quantity * price_per_share ELSE -quantity * price_per_share END) AS total_cost,
1124
+ MAX(CASE WHEN transaction_type = 'buy' THEN price_per_share END) AS current_market_price
1125
+ FROM
1126
+ stock_transactions
1127
+ WHERE
1128
+ stock_symbol = 'META'
1129
+ GROUP BY
1130
+ stock_symbol;
1131
+ """
1132
+ )
1133
+ judgment.async_evaluate(
1134
+ scorers=[AnswerCorrectnessScorer(threshold=1), FaithfulnessScorer(threshold=1)],
1135
+ example=example,
1136
+ model="gpt-4.1"
1137
+ )
1138
+
1139
+ return {"messages": messages + [ChatMessage(content=OUTPUT, role="text2sql")], "documents": documents}
1140
+
1141
+ async def main():
1142
+ with judgment.trace(
1143
+ "run_1",
1144
+ project_name="FINANCIAL_AGENT",
1145
+ overwrite=True
1146
+ ) as trace:
1147
+
1148
+ # Initialize the graph
1149
+ graph_builder = StateGraph(AgentState)
1150
+
1151
+ # Add classifier node
1152
+ # For failure test, pass in bad_classifier
1153
+ graph_builder.add_node("classifier", classify)
1154
+ # graph_builder.add_node("classifier", bad_classify)
1155
+
1156
+ # Add conditional edges based on classification
1157
+ graph_builder.add_conditional_edges(
1158
+ "classifier",
1159
+ router,
1160
+ {
1161
+ "pnl": "pnl_retriever",
1162
+ "balance_sheets": "balance_sheet_retriever",
1163
+ "stocks": "stock_retriever"
1164
+ }
1165
+ )
1166
+
1167
+ # Add retriever nodes (placeholder functions for now)
1168
+ graph_builder.add_node("pnl_retriever", pnl_retriever)
1169
+ graph_builder.add_node("balance_sheet_retriever", balance_sheet_retriever)
1170
+ graph_builder.add_node("stock_retriever", stock_retriever)
1171
+
1172
+ # Add edges from retrievers to response generator
1173
+ graph_builder.add_node("response_generator", generate_response)
1174
+ # graph_builder.add_node("response_generator", bad_sql_generator)
1175
+ graph_builder.add_edge("pnl_retriever", "response_generator")
1176
+ graph_builder.add_edge("balance_sheet_retriever", "response_generator")
1177
+ graph_builder.add_edge("stock_retriever", "response_generator")
1178
+
1179
+ graph_builder.set_entry_point("classifier")
1180
+ graph_builder.set_finish_point("response_generator")
1181
+
1182
+ # Compile the graph
1183
+ graph = graph_builder.compile()
1184
+
1185
+ handler = JudgevalCallbackHandler(trace)
1186
+
1187
+ response = await graph.ainvoke({
1188
+ "messages": [HumanMessage(content="Please calculate our PNL on Apple stock. Refer to table information from documents provided.")],
1189
+ "category": None,
1190
+ }, config=dict(callbacks=[handler]))
1191
+ trace.save()
1192
+
1193
+ print(f"Response: {response['messages'][-1].content}")
1194
+
1195
+ if __name__ == "__main__":
1196
+ asyncio.run(main())
1197
+ ```
1198
+
1199
+
1200
+ # Official Judgment Documenation
1201
+
1202
+
1203
+ <div align="center">
1204
+
1205
+ <img src="assets/logo-light.svg#gh-light-mode-only" alt="Judgment Logo" width="400" />
1206
+ <img src="assets/logo-dark.svg#gh-dark-mode-only" alt="Judgment Logo" width="400" />
1207
+
1208
+ **Build monitoring & evaluation pipelines for complex agents**
1209
+
1210
+ <img src="assets/experiments_page.png" alt="Judgment Platform Experiments Page" width="800" />
1211
+
1212
+ <br>
1213
+
1214
+ ## @🌐 Landing Page • @Twitter/X • @💼 LinkedIn • @📚 Docs • @🚀 Demos • @🎮 Discord
1215
+ </div>
1216
+
1217
+ ## Judgeval: open-source testing, monitoring, and optimization for AI agents
1218
+
1219
+ Judgeval offers robust tooling for evaluating and tracing LLM agent systems. It is dev-friendly and open-source (licensed under Apache 2.0).
1220
+
1221
+ Judgeval gets you started in five minutes, after which you'll be ready to use all of its features as your agent becomes more complex. Judgeval is natively connected to the @Judgment Platform for free and you can export your data and self-host at any time.
1222
+
1223
+ We support tracing agents built with LangGraph, OpenAI SDK, Anthropic, ... and allow custom eval integrations for any use case. Check out our quickstarts below or our @setup guide to get started.
1224
+
1225
+ Judgeval is created and maintained by @Judgment Labs.
1226
+
1227
+ ## 📋 Table of Contents
1228
+ * @✨ Features
1229
+ * @🔍 Tracing
1230
+ * @🧪 Evals
1231
+ * @📡 Monitoring
1232
+ * @📊 Datasets
1233
+ * @💡 Insights
1234
+ * @🛠️ Installation
1235
+ * @🏁 Get Started
1236
+ * @🏢 Self-Hosting
1237
+ * @📚 Cookbooks
1238
+ * @⭐ Star Us on GitHub
1239
+ * @❤️ Contributors
1240
+
1241
+ <!-- Created by https://github.com/ekalinin/github-markdown-toc -->
1242
+
1243
+
1244
+ ## ✨ Features
1245
+
1246
+ | | |
1247
+ |:---|:---:|
1248
+ | <h3>🔍 Tracing</h3>Automatic agent tracing integrated with common frameworks (LangGraph, OpenAI, Anthropic): **tracking inputs/outputs, latency, and cost** at every step.<br><br>Online evals can be applied to traces to measure quality on production data in real-time.<br><br>Export trace data to the Judgment Platform or your own S3 buckets, {Parquet, JSON, YAML} files, or data warehouse.<br><br>**Useful for:**<br>• 🐛 Debugging agent runs <br>• 👤 Tracking user activity <br>• 🔬 Pinpointing performance bottlenecks| <p align="center"><img src="assets/trace_screenshot.png" alt="Tracing visualization" width="1200"/></p> |
1249
+ | <h3>🧪 Evals</h3>15+ research-backed metrics including tool call accuracy, hallucinations, instruction adherence, and retrieval context recall.<br><br>Build custom evaluators that connect with our metric-tracking infrastructure. <br><br>**Useful for:**<br>• ⚠️ Unit-testing <br>• 🔬 Experimental prompt testing<br>• 🛡️ Online guardrails <br><br> | <p align="center"><img src="assets/experiments_page.png" alt="Evaluation metrics" width="800"/></p> |
1250
+ | <h3>📡 Monitoring</h3>Real-time performance tracking of your agents in production environments. **Track all your metrics in one place.**<br><br>Set up **Slack/email alerts** for critical metrics and receive notifications when thresholds are exceeded.<br><br> **Useful for:** <br>•📉 Identifying degradation early <br>•📈 Visualizing performance trends across versions and time | <p align="center"><img src="assets/monitoring_screenshot.png" alt="Monitoring Dashboard" width="1200"/></p> |
1251
+ | <h3>📊 Datasets</h3>Export trace data or import external testcases to datasets hosted on Judgment's Platform. Move datasets to/from Parquet, S3, etc. <br><br>Run evals on datasets as unit tests or to A/B test different agent configurations. <br><br> **Useful for:**<br>• 🔄 Scaled analysis for A/B tests <br>• 🗃️ Filtered collections of agent runtime data| <p align="center"><img src="assets/datasets_preview_screenshot.png" alt="Dataset management" width="1200"/></p> |
1252
+ | <h3>💡 Insights</h3>Cluster on your data to reveal common use cases and failure modes.<br><br>Trace failures to their exact source with Judgment's Osiris agent, which localizes errors to specific components for precise fixes.<br><br> **Useful for:**<br>•🔮 Surfacing common inputs that lead to error<br>•🤖 Investigating agent/user behavior for optimization <br>| <p align="center"><img src="assets/dataset_clustering_screenshot_dm.png" alt="Insights dashboard" width="1200"/></p> |
1253
+
1254
+ ## 🛠️ Installation
1255
+
1256
+ Get started with Judgeval by installing our SDK using pip:
1257
+
1258
+ ```bash
1259
+ pip install judgeval
1260
+ ```
1261
+
1262
+ Ensure you have your `JUDGMENT_API_KEY` and `JUDGMENT_ORG_ID` environment variables set to connect to the @Judgment platform.
1263
+
1264
+ **If you don't have keys, @create an account on the platform!**
1265
+
1266
+ ## 🏁 Get Started
1267
+
1268
+ Here's how you can quickly start using Judgeval:
1269
+
1270
+ ### 🛰️ Tracing
1271
+
1272
+ Track your agent execution with full observability with just a few lines of code.
1273
+ Create a file named `traces.py` with the following code:
1274
+
1275
+ ```python
1276
+ from judgeval.common.tracer import Tracer, wrap
1277
+ from openai import OpenAI
1278
+
1279
+ client = wrap(OpenAI())
1280
+ judgment = Tracer(project_name="my_project")
1281
+
1282
+ @judgment.observe(span_type="tool")
1283
+ def my_tool():
1284
+ return "What's the capital of the U.S.?"
1285
+
1286
+ @judgment.observe(span_type="function")
1287
+ def main():
1288
+ task_input = my_tool()
1289
+ res = client.chat.completions.create(
1290
+ model="gpt-4.1",
1291
+ messages=[{"role": "user", "content": f"{task_input}"}]
1292
+ )
1293
+ return res.choices[0].message.content
1294
+
1295
+ main()
1296
+ ```
1297
+
1298
+ @Click here for a more detailed explanation.
1299
+
1300
+ ### 📝 Offline Evaluations
1301
+
1302
+ You can evaluate your agent's execution to measure quality metrics such as hallucination.
1303
+ Create a file named `evaluate.py` with the following code:
1304
+
1305
+ ```python evaluate.py
1306
+ from judgeval import JudgmentClient
1307
+ from judgeval.data import Example
1308
+ from judgeval.scorers import FaithfulnessScorer
1309
+
1310
+ client = JudgmentClient()
1311
+
1312
+ example = Example(
1313
+ input="What if these shoes don't fit?",
1314
+ actual_output="We offer a 30-day full refund at no extra cost.",
1315
+ retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."],
1316
+ )
1317
+
1318
+ scorer = FaithfulnessScorer(threshold=0.5)
1319
+ results = client.run_evaluation(
1320
+ examples=[example],
1321
+ scorers=[scorer],
1322
+ model="gpt-4.1",
1323
+ )
1324
+ print(results)
1325
+ ```
1326
+
1327
+ @Click here for a more detailed explanation.
1328
+
1329
+ ### 📡 Online Evaluations
1330
+
1331
+ Attach performance monitoring on traces to measure the quality of your systems in production.
1332
+
1333
+ Using the same `traces.py` file we created earlier, modify `main` function:
1334
+
1335
+ ```python
1336
+ from judgeval.common.tracer import Tracer, wrap
1337
+ from judgeval.scorers import AnswerRelevancyScorer
1338
+ from openai import OpenAI
1339
+
1340
+ client = wrap(OpenAI())
1341
+ judgment = Tracer(project_name="my_project")
1342
+
1343
+ @judgment.observe(span_type="tool")
1344
+ def my_tool():
1345
+ return "Hello world!"
1346
+
1347
+ @judgment.observe(span_type="function")
1348
+ def main():
1349
+ task_input = my_tool()
1350
+ res = client.chat.completions.create(
1351
+ model="gpt-4.1",
1352
+ messages=[{"role": "user", "content": f"{task_input}"}]
1353
+ ).choices[0].message.content
1354
+
1355
+ judgment.async_evaluate(
1356
+ scorers=[AnswerRelevancyScorer(threshold=0.5)],
1357
+ input=task_input,
1358
+ actual_output=res,
1359
+ model="gpt-4.1"
1360
+ )
1361
+ print("Online evaluation submitted.")
1362
+ return res
1363
+
1364
+ main()
1365
+ ```
1366
+
1367
+ @Click here for a more detailed explanation.
1368
+
1369
+ ## 🏢 Self-Hosting
1370
+
1371
+ Run Judgment on your own infrastructure: we provide comprehensive self-hosting capabilities that give you full control over the backend and data plane that Judgeval interfaces with.
1372
+
1373
+ ### Key Features
1374
+ * Deploy Judgment on your own AWS account
1375
+ * Store data in your own Supabase instance
1376
+ * Access Judgment through your own custom domain
1377
+
1378
+ ### Getting Started
1379
+ 1. Check out our @self-hosting documentation for detailed setup instructions, along with how your self-hosted instance can be accessed
1380
+ 2. Use the @Judgment CLI to deploy your self-hosted environment
1381
+ 3. After your self-hosted instance is setup, make sure the `JUDGMENT_API_URL` environmental variable is set to your self-hosted backend endpoint
1382
+
1383
+ ## 📚 Cookbooks
1384
+
1385
+ Have your own? We're happy to feature it if you create a PR or message us on @Discord.
1386
+
1387
+ You can access our repo of cookbooks @here. Here are some highlights:
1388
+
1389
+ ### Sample Agents
1390
+
1391
+ #### 💰 @LangGraph Financial QA Agent
1392
+ A LangGraph-based agent for financial queries, featuring RAG capabilities with a vector database for contextual data retrieval and evaluation of its reasoning and data accuracy.
1393
+
1394
+ #### ✈️ @OpenAI Travel Agent
1395
+ A travel planning agent using OpenAI API calls, custom tool functions, and RAG with a vector database for up-to-date and contextual travel information. Evaluated for itinerary quality and information relevance.
1396
+
1397
+ ### Custom Evaluators
1398
+
1399
+ #### 🔍 @PII Detection
1400
+ Detecting and evaluating Personal Identifiable Information (PII) leakage.
1401
+
1402
+ #### 📧 @Cold Email Generation
1403
+
1404
+ Evaluates if a cold email generator properly utilizes all relevant information about the target recipient.
1405
+
1406
+ ## ⭐ Star Us on GitHub
1407
+
1408
+ If you find Judgeval useful, please consider giving us a star on GitHub! Your support helps us grow our community and continue improving the product.
1409
+
1410
+
1411
+ ## ❤️ Contributors
1412
+
1413
+ There are many ways to contribute to Judgeval:
1414
+
1415
+ - Submit @bug reports and @feature requests
1416
+ - Review the documentation and submit @Pull Requests to improve it
1417
+ - Speaking or writing about Judgment and letting us know!
1418
+
1419
+ <!-- Contributors collage -->
1420
+ @![Contributors](https://github.com/JudgmentLabs/judgeval/graphs/contributors)
1421
+
1422
+ ````
1423
+
1424
+ </details>
1425
+
1426
+ ## ⭐ Star Us on GitHub
1427
+
1428
+ If you find Judgeval useful, please consider giving us a star on GitHub! Your support helps us grow our community and continue improving the product.
1429
+
1430
+
1431
+ ## ❤️ Contributors
1432
+
1433
+ There are many ways to contribute to Judgeval:
1434
+
1435
+ - Submit [bug reports](https://github.com/JudgmentLabs/judgeval/issues) and [feature requests](https://github.com/JudgmentLabs/judgeval/issues)
1436
+ - Review the documentation and submit [Pull Requests](https://github.com/JudgmentLabs/judgeval/pulls) to improve it
1437
+ - Speaking or writing about Judgment and letting us know!
1438
+
1439
+ <!-- Contributors collage -->
1440
+ [![Contributors](https://contributors-img.web.app/image?repo=JudgmentLabs/judgeval)](https://github.com/JudgmentLabs/judgeval/graphs/contributors)
1441
+