judgeval 0.0.55__py3-none-any.whl → 0.2.0__py3-none-any.whl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (43) hide show
  1. judgeval/common/api/__init__.py +3 -0
  2. judgeval/common/api/api.py +352 -0
  3. judgeval/common/api/constants.py +165 -0
  4. judgeval/common/storage/__init__.py +6 -0
  5. judgeval/common/tracer/__init__.py +31 -0
  6. judgeval/common/tracer/constants.py +22 -0
  7. judgeval/common/tracer/core.py +1916 -0
  8. judgeval/common/tracer/otel_exporter.py +108 -0
  9. judgeval/common/tracer/otel_span_processor.py +234 -0
  10. judgeval/common/tracer/span_processor.py +37 -0
  11. judgeval/common/tracer/span_transformer.py +211 -0
  12. judgeval/common/tracer/trace_manager.py +92 -0
  13. judgeval/common/utils.py +2 -2
  14. judgeval/constants.py +3 -30
  15. judgeval/data/datasets/eval_dataset_client.py +29 -156
  16. judgeval/data/judgment_types.py +4 -12
  17. judgeval/data/result.py +1 -1
  18. judgeval/data/scorer_data.py +2 -2
  19. judgeval/data/scripts/openapi_transform.py +1 -1
  20. judgeval/data/trace.py +66 -1
  21. judgeval/data/trace_run.py +0 -3
  22. judgeval/evaluation_run.py +0 -2
  23. judgeval/integrations/langgraph.py +43 -164
  24. judgeval/judgment_client.py +17 -211
  25. judgeval/run_evaluation.py +216 -611
  26. judgeval/scorers/__init__.py +2 -6
  27. judgeval/scorers/base_scorer.py +4 -23
  28. judgeval/scorers/judgeval_scorers/api_scorers/__init__.py +3 -3
  29. judgeval/scorers/judgeval_scorers/api_scorers/prompt_scorer.py +215 -0
  30. judgeval/scorers/score.py +2 -1
  31. judgeval/scorers/utils.py +1 -13
  32. judgeval/utils/requests.py +21 -0
  33. judgeval-0.2.0.dist-info/METADATA +202 -0
  34. {judgeval-0.0.55.dist-info → judgeval-0.2.0.dist-info}/RECORD +37 -29
  35. judgeval/common/tracer.py +0 -3215
  36. judgeval/scorers/judgeval_scorers/api_scorers/classifier_scorer.py +0 -73
  37. judgeval/scorers/judgeval_scorers/classifiers/__init__.py +0 -3
  38. judgeval/scorers/judgeval_scorers/classifiers/text2sql/__init__.py +0 -3
  39. judgeval/scorers/judgeval_scorers/classifiers/text2sql/text2sql_scorer.py +0 -53
  40. judgeval-0.0.55.dist-info/METADATA +0 -1384
  41. /judgeval/common/{s3_storage.py → storage/s3_storage.py} +0 -0
  42. {judgeval-0.0.55.dist-info → judgeval-0.2.0.dist-info}/WHEEL +0 -0
  43. {judgeval-0.0.55.dist-info → judgeval-0.2.0.dist-info}/licenses/LICENSE.md +0 -0
@@ -1,1384 +0,0 @@
1
- Metadata-Version: 2.4
2
- Name: judgeval
3
- Version: 0.0.55
4
- Summary: Judgeval Package
5
- Project-URL: Homepage, https://github.com/JudgmentLabs/judgeval
6
- Project-URL: Issues, https://github.com/JudgmentLabs/judgeval/issues
7
- Author-email: Andrew Li <andrew@judgmentlabs.ai>, Alex Shan <alex@judgmentlabs.ai>, Joseph Camyre <joseph@judgmentlabs.ai>
8
- License-Expression: Apache-2.0
9
- License-File: LICENSE.md
10
- Classifier: Operating System :: OS Independent
11
- Classifier: Programming Language :: Python :: 3
12
- Requires-Python: >=3.11
13
- Requires-Dist: anthropic
14
- Requires-Dist: boto3
15
- Requires-Dist: datamodel-code-generator>=0.31.1
16
- Requires-Dist: google-genai
17
- Requires-Dist: langchain-anthropic
18
- Requires-Dist: langchain-core
19
- Requires-Dist: langchain-huggingface
20
- Requires-Dist: langchain-openai
21
- Requires-Dist: litellm>=1.61.15
22
- Requires-Dist: matplotlib>=3.10.3
23
- Requires-Dist: nest-asyncio
24
- Requires-Dist: openai
25
- Requires-Dist: pandas
26
- Requires-Dist: python-dotenv==1.0.1
27
- Requires-Dist: requests
28
- Requires-Dist: together
29
- Description-Content-Type: text/markdown
30
-
31
- <div align="center">
32
-
33
- <img src="assets/new_lightmode.svg#gh-light-mode-only" alt="Judgment Logo" width="400" />
34
- <img src="assets/new_darkmode.svg#gh-dark-mode-only" alt="Judgment Logo" width="400" />
35
-
36
- <br>
37
- <div style="font-size: 1.5em;">
38
- Enable self-learning agents with traces, evals, and environment data.
39
- </div>
40
-
41
- ## [Docs](https://docs.judgmentlabs.ai/) • [Judgment Cloud](https://app.judgmentlabs.ai/register) • [Self-Host](https://docs.judgmentlabs.ai/documentation/self-hosting/get_started)
42
-
43
- [Demo](https://www.youtube.com/watch?v=1S4LixpVbcc) • [Bug Reports](https://github.com/JudgmentLabs/judgeval/issues) • [Changelog](https://docs.judgmentlabs.ai/changelog/2025-04-21)
44
-
45
- We're hiring! Join us in our mission to enable self-learning agents by providing the data and signals needed for monitoring and post-training.
46
-
47
- [![X](https://img.shields.io/badge/-X/Twitter-000?logo=x&logoColor=white)](https://x.com/JudgmentLabs)
48
- [![LinkedIn](https://custom-icon-badges.demolab.com/badge/LinkedIn%20-0A66C2?logo=linkedin-white&logoColor=fff)](https://www.linkedin.com/company/judgmentlabs)
49
- [![Discord](https://img.shields.io/badge/-Discord-5865F2?logo=discord&logoColor=white)](https://discord.gg/tGVFf8UBUY)
50
-
51
- <img src="assets/product_shot.png" alt="Judgment Platform" width="800" />
52
-
53
- </div>
54
-
55
- Judgeval offers **open-source tooling** for tracing and evaluating autonomous, stateful agents. It **provides runtime data from agent-environment interactions** for continuous learning and self-improvement.
56
-
57
- ## 🎬 See Judgeval in Action
58
-
59
- **[Multi-Agent System](https://github.com/JudgmentLabs/judgment-cookbook/tree/main/cookbooks/agents/multi-agent) with complete observability:** (1) A multi-agent system spawns agents to research topics on the internet. (2) With just **3 lines of code**, Judgeval traces every input/output + environment response across all agent tool calls for debugging. (3) After completion, (4) export all interaction data to enable further environment-specific learning and optimization.
60
-
61
- <table style="width: 100%; max-width: 800px; table-layout: fixed;">
62
- <tr>
63
- <td align="center" style="padding: 8px; width: 50%;">
64
- <img src="assets/agent.gif" alt="Agent Demo" style="width: 100%; max-width: 350px; height: auto;" />
65
- <br><strong>🤖 Agents Running</strong>
66
- </td>
67
- <td align="center" style="padding: 8px; width: 50%;">
68
- <img src="assets/trace.gif" alt="Trace Demo" style="width: 100%; max-width: 350px; height: auto;" />
69
- <br><strong>📊 Real-time Tracing</strong>
70
- </td>
71
- </tr>
72
- <tr>
73
- <td align="center" style="padding: 8px; width: 50%;">
74
- <img src="assets/document.gif" alt="Agent Completed Demo" style="width: 100%; max-width: 350px; height: auto;" />
75
- <br><strong>✅ Agents Completed Running</strong>
76
- </td>
77
- <td align="center" style="padding: 8px; width: 50%;">
78
- <img src="assets/data.gif" alt="Data Export Demo" style="width: 100%; max-width: 350px; height: auto;" />
79
- <br><strong>📤 Exporting Agent Environment Data</strong>
80
- </td>
81
- </tr>
82
-
83
- </table>
84
-
85
- ## 📋 Table of Contents
86
- - [🛠️ Installation](#️-installation)
87
- - [🏁 Quickstarts](#-quickstarts)
88
- - [✨ Features](#-features)
89
- - [🏢 Self-Hosting](#-self-hosting)
90
- - [📚 Cookbooks](#-cookbooks)
91
- - [💻 Development with Cursor](#-development-with-cursor)
92
-
93
- ## 🛠️ Installation
94
-
95
- Get started with Judgeval by installing our SDK using pip:
96
-
97
- ```bash
98
- pip install judgeval
99
- ```
100
-
101
- Ensure you have your `JUDGMENT_API_KEY` and `JUDGMENT_ORG_ID` environment variables set to connect to the [Judgment Platform](https://app.judgmentlabs.ai/).
102
-
103
- ```bash
104
- export JUDGMENT_API_KEY=...
105
- export JUDGMENT_ORG_ID=...
106
- ```
107
-
108
- **If you don't have keys, [create an account](https://app.judgmentlabs.ai/register) on the platform!**
109
-
110
- ## 🏁 Quickstarts
111
-
112
- ### 🛰️ Tracing
113
-
114
- Create a file named `agent.py` with the following code:
115
-
116
- ```python
117
- from judgeval.tracer import Tracer, wrap
118
- from openai import OpenAI
119
-
120
- client = wrap(OpenAI()) # tracks all LLM calls
121
- judgment = Tracer(project_name="my_project")
122
-
123
- @judgment.observe(span_type="tool")
124
- def format_question(question: str) -> str:
125
- # dummy tool
126
- return f"Question : {question}"
127
-
128
- @judgment.observe(span_type="function")
129
- def run_agent(prompt: str) -> str:
130
- task = format_question(prompt)
131
- response = client.chat.completions.create(
132
- model="gpt-4.1",
133
- messages=[{"role": "user", "content": task}]
134
- )
135
- return response.choices[0].message.content
136
-
137
- run_agent("What is the capital of the United States?")
138
- ```
139
- You'll see your trace exported to the Judgment Platform:
140
-
141
- <p align="center"><img src="assets/trace_demo.png" alt="Judgment Platform Trace Example" width="800" /></p>
142
-
143
-
144
- [Click here](https://docs.judgmentlabs.ai/documentation/tracing/introduction) for a more detailed explanation.
145
-
146
-
147
- <!-- Created by https://github.com/ekalinin/github-markdown-toc -->
148
-
149
-
150
- ## ✨ Features
151
-
152
- | | |
153
- |:---|:---:|
154
- | <h3>🔍 Tracing</h3>Automatic agent tracing integrated with common frameworks (LangGraph, OpenAI, Anthropic). **Tracks inputs/outputs, agent tool calls, latency, cost, and custom metadata** at every step.<br><br>**Useful for:**<br>• 🐛 Debugging agent runs <br>• 📋 Collecting agent environment data <br>• 🔬 Pinpointing performance bottlenecks| <p align="center"><img src="assets/trace_screenshot.png" alt="Tracing visualization" width="1200"/></p> |
155
- | <h3>🧪 Evals</h3>Build custom evaluators on top of your agents. Judgeval supports LLM-as-a-judge, manual labeling, and code-based evaluators that connect with our metric-tracking infrastructure. <br><br>**Useful for:**<br>• ⚠️ Unit-testing <br>• 🔬 A/B testing <br>• 🛡️ Online guardrails | <p align="center"><img src="assets/experiments_page.png" alt="Evaluation metrics" width="800"/></p> |
156
- | <h3>📡 Monitoring</h3>Get Slack alerts when you agent failures in production. Add custom hooks to address production regressions.<br><br> **Useful for:** <br>• 📉 Identifying degradation early <br>• 📈 Visualizing performance trends across agent versions and time | <p align="center"><img src="assets/error_analysis_dashboard.png" alt="Monitoring Dashboard" width="1200"/></p> |
157
- | <h3>📊 Datasets</h3>Export traces and test cases to datasets for scaled analysis and optimization. Move datasets to/from Parquet, S3, etc. <br><br>Run evals on datasets as unit tests or to A/B test different agent configurations, enabling continuous learning from production interactions. <br><br> **Useful for:**<br>• 🗃️ Agent environment interaction data for optimization<br>• 🔄 Scaled analysis for A/B tests | <p align="center"><img src="assets/datasets_preview_screenshot.png" alt="Dataset management" width="1200"/></p> |
158
-
159
- ## 🏢 Self-Hosting
160
-
161
- Run Judgment on your own infrastructure: we provide comprehensive self-hosting capabilities that give you full control over the backend and data plane that Judgeval interfaces with.
162
-
163
- ### Key Features
164
- * Deploy Judgment on your own AWS account
165
- * Store data in your own Supabase instance
166
- * Access Judgment through your own custom domain
167
-
168
- ### Getting Started
169
- 1. Check out our [self-hosting documentation](https://docs.judgmentlabs.ai/documentation/self-hosting/get_started) for detailed setup instructions, along with how your self-hosted instance can be accessed
170
- 2. Use the [Judgment CLI](https://docs.judgmentlabs.ai/documentation/developer-tools/judgment-cli/installation) to deploy your self-hosted environment
171
- 3. After your self-hosted instance is setup, make sure the `JUDGMENT_API_URL` environmental variable is set to your self-hosted backend endpoint
172
-
173
- ## 📚 Cookbooks
174
-
175
- Have your own? We're happy to feature it if you create a PR or message us on [Discord](https://discord.gg/tGVFf8UBUY).
176
-
177
- You can access our repo of cookbooks [here](https://github.com/JudgmentLabs/judgment-cookbook).
178
-
179
- ## 💻 Development with Cursor
180
- When building agents and LLM workflows in Cursor, providing proper context to your coding assistant helps ensure seamless integration with Judgment. This rule file supplies the essential context your coding assistant needs for successful implementation.
181
-
182
- To implement this rule file, simply copy the text below and save it in a ".cursor/rules" directory in your project's root directory.
183
-
184
- <details>
185
-
186
- <summary>Cursor Rule File</summary>
187
-
188
- ````
189
- ---
190
- You are an expert in helping users integrate Judgment with their codebase. When you are helping someone integrate Judgment tracing or evaluations with their agents/workflows, refer to this file.
191
- ---
192
-
193
- # Common Questions You May Get from the User (and How to Handle These Cases):
194
-
195
- ## Sample Agent 1:
196
- ```
197
- from uuid import uuid4
198
- import openai
199
- import os
200
- import asyncio
201
- from tavily import TavilyClient
202
- from dotenv import load_dotenv
203
- import chromadb
204
- from chromadb.utils import embedding_functions
205
-
206
- destinations_data = [
207
- {
208
- "destination": "Paris, France",
209
- "information": """
210
- Paris is the capital city of France and a global center for art, fashion, and culture.
211
- Key Information:
212
- - Best visited during spring (March-May) or fall (September-November)
213
- - Famous landmarks: Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, Arc de Triomphe
214
- - Known for: French cuisine, café culture, fashion, art galleries
215
- - Local transportation: Metro system is extensive and efficient
216
- - Popular neighborhoods: Le Marais, Montmartre, Latin Quarter
217
- - Cultural tips: Basic French phrases are appreciated; many restaurants close between lunch and dinner
218
- - Must-try experiences: Seine River cruise, visiting local bakeries, Luxembourg Gardens
219
- """
220
- },
221
- {
222
- "destination": "Tokyo, Japan",
223
- "information": """
224
- Tokyo is Japan's bustling capital, blending ultramodern and traditional elements.
225
- Key Information:
226
- - Best visited during spring (cherry blossoms) or fall (autumn colors)
227
- - Famous areas: Shibuya, Shinjuku, Harajuku, Akihabara
228
- - Known for: Technology, anime culture, sushi, efficient public transport
229
- - Local transportation: Extensive train and subway network
230
- - Cultural tips: Bow when greeting, remove shoes indoors, no tipping
231
- - Must-try experiences: Robot Restaurant, teamLab Borderless, Tsukiji Outer Market
232
- - Popular day trips: Mount Fuji, Kamakura, Nikko
233
- """
234
- },
235
- {
236
- "destination": "New York City, USA",
237
- "information": """
238
- New York City is a global metropolis known for its diversity, culture, and iconic skyline.
239
- Key Information:
240
- - Best visited during spring (April-June) or fall (September-November)
241
- - Famous landmarks: Statue of Liberty, Times Square, Central Park, Empire State Building
242
- - Known for: Broadway shows, diverse cuisine, shopping, museums
243
- - Local transportation: Extensive subway system, yellow cabs, ride-sharing
244
- - Popular areas: Manhattan, Brooklyn, Queens
245
- - Cultural tips: Fast-paced environment, tipping expected (15-20%)
246
- - Must-try experiences: Broadway show, High Line walk, food tours
247
- """
248
- },
249
- {
250
- "destination": "Barcelona, Spain",
251
- "information": """
252
- Barcelona is a vibrant city known for its art, architecture, and Mediterranean culture.
253
- Key Information:
254
- - Best visited during spring and fall for mild weather
255
- - Famous landmarks: Sagrada Familia, Park Güell, Casa Batlló
256
- - Known for: Gaudi architecture, tapas, beach culture, FC Barcelona
257
- - Local transportation: Metro, buses, and walkable city center
258
- - Popular areas: Gothic Quarter, Eixample, La Barceloneta
259
- - Cultural tips: Late dinner times (after 8 PM), siesta tradition
260
- - Must-try experiences: La Rambla walk, tapas crawl, local markets
261
- """
262
- },
263
- {
264
- "destination": "Bangkok, Thailand",
265
- "information": """
266
- Bangkok is Thailand's capital city, famous for its temples, street food, and vibrant culture.
267
- Key Information:
268
- - Best visited during November to February (cool and dry season)
269
- - Famous sites: Grand Palace, Wat Phra Kaew, Wat Arun
270
- - Known for: Street food, temples, markets, nightlife
271
- - Local transportation: BTS Skytrain, MRT, tuk-tuks, river boats
272
- - Popular areas: Sukhumvit, Old City, Chinatown
273
- - Cultural tips: Dress modestly at temples, respect royal family
274
- - Must-try experiences: Street food tours, river cruises, floating markets
275
- """
276
- }
277
- ]
278
-
279
- client = openai.Client(api_key=os.getenv("OPENAI_API_KEY"))
280
-
281
- def populate_vector_db(collection, destinations_data):
282
- """
283
- Populate the vector DB with travel information.
284
- destinations_data should be a list of dictionaries with 'destination' and 'information' keys
285
- """
286
- for data in destinations_data:
287
- collection.add(
288
- documents=[data['information']],
289
- metadatas=[{"destination": data['destination']}],
290
- ids=[f"destination_{data['destination'].lower().replace(' ', '_')}"]
291
- )
292
-
293
- def search_tavily(query):
294
- """Fetch travel data using Tavily API."""
295
- API_KEY = os.getenv("TAVILY_API_KEY")
296
- client = TavilyClient(api_key=API_KEY)
297
- results = client.search(query, num_results=3)
298
- return results
299
-
300
- async def get_attractions(destination):
301
- """Search for top attractions in the destination."""
302
- prompt = f"Best tourist attractions in {destination}"
303
- attractions_search = search_tavily(prompt)
304
- return attractions_search
305
-
306
- async def get_hotels(destination):
307
- """Search for hotels in the destination."""
308
- prompt = f"Best hotels in {destination}"
309
- hotels_search = search_tavily(prompt)
310
- return hotels_search
311
-
312
- async def get_flights(destination):
313
- """Search for flights to the destination."""
314
- prompt = f"Flights to {destination} from major cities"
315
- flights_search = search_tavily(prompt)
316
- return flights_search
317
-
318
- async def get_weather(destination, start_date, end_date):
319
- """Search for weather information."""
320
- prompt = f"Weather forecast for {destination} from {start_date} to {end_date}"
321
- weather_search = search_tavily(prompt)
322
- return weather_search
323
-
324
- def initialize_vector_db():
325
- """Initialize ChromaDB with OpenAI embeddings."""
326
- client = chromadb.Client()
327
- embedding_fn = embedding_functions.OpenAIEmbeddingFunction(
328
- api_key=os.getenv("OPENAI_API_KEY"),
329
- model_name="text-embedding-3-small"
330
- )
331
- res = client.get_or_create_collection(
332
- "travel_information",
333
- embedding_function=embedding_fn
334
- )
335
- populate_vector_db(res, destinations_data)
336
- return res
337
-
338
- def query_vector_db(collection, destination, k=3):
339
- """Query the vector database for existing travel information."""
340
- try:
341
- results = collection.query(
342
- query_texts=[destination],
343
- n_results=k
344
- )
345
- return results['documents'][0] if results['documents'] else []
346
- except Exception:
347
- return []
348
-
349
- async def research_destination(destination, start_date, end_date):
350
- """Gather all necessary travel information for a destination."""
351
- # First, check the vector database
352
- collection = initialize_vector_db()
353
- existing_info = query_vector_db(collection, destination)
354
-
355
- # Get real-time information from Tavily
356
- tavily_data = {
357
- "attractions": await get_attractions(destination),
358
- "hotels": await get_hotels(destination),
359
- "flights": await get_flights(destination),
360
- "weather": await get_weather(destination, start_date, end_date)
361
- }
362
-
363
- return {
364
- "vector_db_results": existing_info,
365
- **tavily_data
366
- }
367
-
368
- async def create_travel_plan(destination, start_date, end_date, research_data):
369
- """Generate a travel itinerary using the researched data."""
370
- vector_db_context = "\n".join(research_data['vector_db_results']) if research_data['vector_db_results'] else "No pre-stored information available."
371
-
372
- prompt = f"""
373
- Create a structured travel itinerary for a trip to {destination} from {start_date} to {end_date}.
374
-
375
- Pre-stored destination information:
376
- {vector_db_context}
377
-
378
- Current travel data:
379
- - Attractions: {research_data['attractions']}
380
- - Hotels: {research_data['hotels']}
381
- - Flights: {research_data['flights']}
382
- - Weather: {research_data['weather']}
383
- """
384
-
385
- response = client.chat.completions.create(
386
- model="gpt-4.1",
387
- messages=[
388
- {"role": "system", "content": "You are an expert travel planner. Combine both historical and current information to create the best possible itinerary."},
389
- {"role": "user", "content": prompt}
390
- ]
391
- ).choices[0].message.content
392
-
393
- return response
394
-
395
- async def generate_itinerary(destination, start_date, end_date):
396
- """Main function to generate a travel itinerary."""
397
- research_data = await research_destination(destination, start_date, end_date)
398
- res = await create_travel_plan(destination, start_date, end_date, research_data)
399
- return res
400
-
401
-
402
- if __name__ == "__main__":
403
- load_dotenv()
404
- destination = input("Enter your travel destination: ")
405
- start_date = input("Enter start date (YYYY-MM-DD): ")
406
- end_date = input("Enter end date (YYYY-MM-DD): ")
407
- itinerary = asyncio.run(generate_itinerary(destination, start_date, end_date))
408
- print("\nGenerated Itinerary:\n", itinerary)
409
- ```
410
-
411
- ## Sample Query 1:
412
- Can you add Judgment tracing to my file?
413
-
414
- ## Example of Modified Code after Query 1:
415
- ```
416
- from uuid import uuid4
417
- import openai
418
- import os
419
- import asyncio
420
- from tavily import TavilyClient
421
- from dotenv import load_dotenv
422
- import chromadb
423
- from chromadb.utils import embedding_functions
424
-
425
- from judgeval.tracer import Tracer, wrap
426
- from judgeval.scorers import AnswerRelevancyScorer, FaithfulnessScorer
427
- from judgeval.data import Example
428
-
429
- destinations_data = [
430
- {
431
- "destination": "Paris, France",
432
- "information": """
433
- Paris is the capital city of France and a global center for art, fashion, and culture.
434
- Key Information:
435
- - Best visited during spring (March-May) or fall (September-November)
436
- - Famous landmarks: Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, Arc de Triomphe
437
- - Known for: French cuisine, café culture, fashion, art galleries
438
- - Local transportation: Metro system is extensive and efficient
439
- - Popular neighborhoods: Le Marais, Montmartre, Latin Quarter
440
- - Cultural tips: Basic French phrases are appreciated; many restaurants close between lunch and dinner
441
- - Must-try experiences: Seine River cruise, visiting local bakeries, Luxembourg Gardens
442
- """
443
- },
444
- {
445
- "destination": "Tokyo, Japan",
446
- "information": """
447
- Tokyo is Japan's bustling capital, blending ultramodern and traditional elements.
448
- Key Information:
449
- - Best visited during spring (cherry blossoms) or fall (autumn colors)
450
- - Famous areas: Shibuya, Shinjuku, Harajuku, Akihabara
451
- - Known for: Technology, anime culture, sushi, efficient public transport
452
- - Local transportation: Extensive train and subway network
453
- - Cultural tips: Bow when greeting, remove shoes indoors, no tipping
454
- - Must-try experiences: Robot Restaurant, teamLab Borderless, Tsukiji Outer Market
455
- - Popular day trips: Mount Fuji, Kamakura, Nikko
456
- """
457
- },
458
- {
459
- "destination": "New York City, USA",
460
- "information": """
461
- New York City is a global metropolis known for its diversity, culture, and iconic skyline.
462
- Key Information:
463
- - Best visited during spring (April-June) or fall (September-November)
464
- - Famous landmarks: Statue of Liberty, Times Square, Central Park, Empire State Building
465
- - Known for: Broadway shows, diverse cuisine, shopping, museums
466
- - Local transportation: Extensive subway system, yellow cabs, ride-sharing
467
- - Popular areas: Manhattan, Brooklyn, Queens
468
- - Cultural tips: Fast-paced environment, tipping expected (15-20%)
469
- - Must-try experiences: Broadway show, High Line walk, food tours
470
- """
471
- },
472
- {
473
- "destination": "Barcelona, Spain",
474
- "information": """
475
- Barcelona is a vibrant city known for its art, architecture, and Mediterranean culture.
476
- Key Information:
477
- - Best visited during spring and fall for mild weather
478
- - Famous landmarks: Sagrada Familia, Park Güell, Casa Batlló
479
- - Known for: Gaudi architecture, tapas, beach culture, FC Barcelona
480
- - Local transportation: Metro, buses, and walkable city center
481
- - Popular areas: Gothic Quarter, Eixample, La Barceloneta
482
- - Cultural tips: Late dinner times (after 8 PM), siesta tradition
483
- - Must-try experiences: La Rambla walk, tapas crawl, local markets
484
- """
485
- },
486
- {
487
- "destination": "Bangkok, Thailand",
488
- "information": """
489
- Bangkok is Thailand's capital city, famous for its temples, street food, and vibrant culture.
490
- Key Information:
491
- - Best visited during November to February (cool and dry season)
492
- - Famous sites: Grand Palace, Wat Phra Kaew, Wat Arun
493
- - Known for: Street food, temples, markets, nightlife
494
- - Local transportation: BTS Skytrain, MRT, tuk-tuks, river boats
495
- - Popular areas: Sukhumvit, Old City, Chinatown
496
- - Cultural tips: Dress modestly at temples, respect royal family
497
- - Must-try experiences: Street food tours, river cruises, floating markets
498
- """
499
- }
500
- ]
501
-
502
- client = wrap(openai.Client(api_key=os.getenv("OPENAI_API_KEY")))
503
- judgment = Tracer(api_key=os.getenv("JUDGMENT_API_KEY"), project_name="travel_agent_demo")
504
-
505
- def populate_vector_db(collection, destinations_data):
506
- """
507
- Populate the vector DB with travel information.
508
- destinations_data should be a list of dictionaries with 'destination' and 'information' keys
509
- """
510
- for data in destinations_data:
511
- collection.add(
512
- documents=[data['information']],
513
- metadatas=[{"destination": data['destination']}],
514
- ids=[f"destination_{data['destination'].lower().replace(' ', '_')}"]
515
- )
516
-
517
- @judgment.observe(span_type="search_tool")
518
- def search_tavily(query):
519
- """Fetch travel data using Tavily API."""
520
- API_KEY = os.getenv("TAVILY_API_KEY")
521
- client = TavilyClient(api_key=API_KEY)
522
- results = client.search(query, num_results=3)
523
- return results
524
-
525
- @judgment.observe(span_type="tool")
526
- async def get_attractions(destination):
527
- """Search for top attractions in the destination."""
528
- prompt = f"Best tourist attractions in {destination}"
529
- attractions_search = search_tavily(prompt)
530
- return attractions_search
531
-
532
- @judgment.observe(span_type="tool")
533
- async def get_hotels(destination):
534
- """Search for hotels in the destination."""
535
- prompt = f"Best hotels in {destination}"
536
- hotels_search = search_tavily(prompt)
537
- return hotels_search
538
-
539
- @judgment.observe(span_type="tool")
540
- async def get_flights(destination):
541
- """Search for flights to the destination."""
542
- prompt = f"Flights to {destination} from major cities"
543
- flights_search = search_tavily(prompt)
544
- example = Example(
545
- input=prompt,
546
- actual_output=str(flights_search["results"])
547
- )
548
- judgment.async_evaluate(
549
- scorers=[AnswerRelevancyScorer(threshold=0.5)],
550
- example=example,
551
- model="gpt-4.1"
552
- )
553
- return flights_search
554
-
555
- @judgment.observe(span_type="tool")
556
- async def get_weather(destination, start_date, end_date):
557
- """Search for weather information."""
558
- prompt = f"Weather forecast for {destination} from {start_date} to {end_date}"
559
- weather_search = search_tavily(prompt)
560
- example = Example(
561
- input=prompt,
562
- actual_output=str(weather_search["results"])
563
- )
564
- judgment.async_evaluate(
565
- scorers=[AnswerRelevancyScorer(threshold=0.5)],
566
- example=example,
567
- model="gpt-4.1"
568
- )
569
- return weather_search
570
-
571
- def initialize_vector_db():
572
- """Initialize ChromaDB with OpenAI embeddings."""
573
- client = chromadb.Client()
574
- embedding_fn = embedding_functions.OpenAIEmbeddingFunction(
575
- api_key=os.getenv("OPENAI_API_KEY"),
576
- model_name="text-embedding-3-small"
577
- )
578
- res = client.get_or_create_collection(
579
- "travel_information",
580
- embedding_function=embedding_fn
581
- )
582
- populate_vector_db(res, destinations_data)
583
- return res
584
-
585
- @judgment.observe(span_type="retriever")
586
- def query_vector_db(collection, destination, k=3):
587
- """Query the vector database for existing travel information."""
588
- try:
589
- results = collection.query(
590
- query_texts=[destination],
591
- n_results=k
592
- )
593
- return results['documents'][0] if results['documents'] else []
594
- except Exception:
595
- return []
596
-
597
- @judgment.observe(span_type="Research")
598
- async def research_destination(destination, start_date, end_date):
599
- """Gather all necessary travel information for a destination."""
600
- # First, check the vector database
601
- collection = initialize_vector_db()
602
- existing_info = query_vector_db(collection, destination)
603
-
604
- # Get real-time information from Tavily
605
- tavily_data = {
606
- "attractions": await get_attractions(destination),
607
- "hotels": await get_hotels(destination),
608
- "flights": await get_flights(destination),
609
- "weather": await get_weather(destination, start_date, end_date)
610
- }
611
-
612
- return {
613
- "vector_db_results": existing_info,
614
- **tavily_data
615
- }
616
-
617
- @judgment.observe(span_type="function")
618
- async def create_travel_plan(destination, start_date, end_date, research_data):
619
- """Generate a travel itinerary using the researched data."""
620
- vector_db_context = "\n".join(research_data['vector_db_results']) if research_data['vector_db_results'] else "No pre-stored information available."
621
-
622
- prompt = f"""
623
- Create a structured travel itinerary for a trip to {destination} from {start_date} to {end_date}.
624
-
625
- Pre-stored destination information:
626
- {vector_db_context}
627
-
628
- Current travel data:
629
- - Attractions: {research_data['attractions']}
630
- - Hotels: {research_data['hotels']}
631
- - Flights: {research_data['flights']}
632
- - Weather: {research_data['weather']}
633
- """
634
-
635
- response = client.chat.completions.create(
636
- model="gpt-4.1",
637
- messages=[
638
- {"role": "system", "content": "You are an expert travel planner. Combine both historical and current information to create the best possible itinerary."},
639
- {"role": "user", "content": prompt}
640
- ]
641
- ).choices[0].message.content
642
-
643
- example = Example(
644
- input=prompt,
645
- actual_output=str(response),
646
- retrieval_context=[str(vector_db_context), str(research_data)]
647
- )
648
- judgment.async_evaluate(
649
- scorers=[FaithfulnessScorer(threshold=0.5)],
650
- example=example,
651
- model="gpt-4.1"
652
- )
653
-
654
- return response
655
-
656
- @judgment.observe(span_type="function")
657
- async def generate_itinerary(destination, start_date, end_date):
658
- """Main function to generate a travel itinerary."""
659
- research_data = await research_destination(destination, start_date, end_date)
660
- res = await create_travel_plan(destination, start_date, end_date, research_data)
661
- return res
662
-
663
-
664
- if __name__ == "__main__":
665
- load_dotenv()
666
- destination = input("Enter your travel destination: ")
667
- start_date = input("Enter start date (YYYY-MM-DD): ")
668
- end_date = input("Enter end date (YYYY-MM-DD): ")
669
- itinerary = asyncio.run(generate_itinerary(destination, start_date, end_date))
670
- print("\nGenerated Itinerary:\n", itinerary)
671
- ```
672
-
673
- ## Sample Agent 2
674
- ```
675
- from langchain_openai import ChatOpenAI
676
- import asyncio
677
- import os
678
-
679
- import chromadb
680
- from chromadb.utils import embedding_functions
681
-
682
- from vectordbdocs import financial_data
683
-
684
- from typing import Optional
685
- from langchain_core.messages import BaseMessage, HumanMessage, SystemMessage, ChatMessage
686
- from typing_extensions import TypedDict
687
- from langgraph.graph import StateGraph
688
-
689
- # Define our state type
690
- class AgentState(TypedDict):
691
- messages: list[BaseMessage]
692
- category: Optional[str]
693
- documents: Optional[str]
694
-
695
- def populate_vector_db(collection, raw_data):
696
- """
697
- Populate the vector DB with financial information.
698
- """
699
- for data in raw_data:
700
- collection.add(
701
- documents=[data['information']],
702
- metadatas=[{"category": data['category']}],
703
- ids=[f"category_{data['category'].lower().replace(' ', '_')}_{os.urandom(4).hex()}"]
704
- )
705
-
706
- # Define a ChromaDB collection for document storage
707
- client = chromadb.Client()
708
- collection = client.get_or_create_collection(
709
- name="financial_docs",
710
- embedding_function=embedding_functions.OpenAIEmbeddingFunction(api_key=os.getenv("OPENAI_API_KEY"))
711
- )
712
-
713
- populate_vector_db(collection, financial_data)
714
-
715
- def pnl_retriever(state: AgentState) -> AgentState:
716
- query = state["messages"][-1].content
717
- results = collection.query(
718
- query_texts=[query],
719
- where={"category": "pnl"},
720
- n_results=3
721
- )
722
- documents = []
723
- for document in results["documents"]:
724
- documents += document
725
-
726
- return {"messages": state["messages"], "documents": documents}
727
-
728
- def balance_sheet_retriever(state: AgentState) -> AgentState:
729
- query = state["messages"][-1].content
730
- results = collection.query(
731
- query_texts=[query],
732
- where={"category": "balance_sheets"},
733
- n_results=3
734
- )
735
- documents = []
736
- for document in results["documents"]:
737
- documents += document
738
-
739
- return {"messages": state["messages"], "documents": documents}
740
-
741
- def stock_retriever(state: AgentState) -> AgentState:
742
- query = state["messages"][-1].content
743
- results = collection.query(
744
- query_texts=[query],
745
- where={"category": "stocks"},
746
- n_results=3
747
- )
748
- documents = []
749
- for document in results["documents"]:
750
- documents += document
751
-
752
- return {"messages": state["messages"], "documents": documents}
753
-
754
- async def bad_classifier(state: AgentState) -> AgentState:
755
- return {"messages": state["messages"], "category": "stocks"}
756
-
757
- async def bad_classify(state: AgentState) -> AgentState:
758
- category = await bad_classifier(state)
759
-
760
- return {"messages": state["messages"], "category": category["category"]}
761
-
762
- async def bad_sql_generator(state: AgentState) -> AgentState:
763
- ACTUAL_OUTPUT = "SELECT * FROM pnl WHERE stock_symbol = 'GOOGL'"
764
- return {"messages": state["messages"] + [ChatMessage(content=ACTUAL_OUTPUT, role="text2sql")]}
765
-
766
- # Create the classifier node with a system prompt
767
- async def classify(state: AgentState) -> AgentState:
768
- messages = state["messages"]
769
- input_msg = [
770
- SystemMessage(content="""You are a financial query classifier. Your job is to classify user queries into one of three categories:
771
- - 'pnl' for Profit and Loss related queries
772
- - 'balance_sheets' for Balance Sheet related queries
773
- - 'stocks' for Stock market related queries
774
-
775
- Respond ONLY with the category name in lowercase, nothing else."""),
776
- *messages
777
- ]
778
-
779
- response = ChatOpenAI(model="gpt-4.1", temperature=0).invoke(
780
- input=input_msg
781
- )
782
-
783
- return {"messages": state["messages"], "category": response.content}
784
-
785
- # Add router node to direct flow based on classification
786
- def router(state: AgentState) -> str:
787
- return state["category"]
788
-
789
- async def generate_response(state: AgentState) -> AgentState:
790
- messages = state["messages"]
791
- documents = state.get("documents", "")
792
-
793
- OUTPUT = """
794
- SELECT
795
- stock_symbol,
796
- SUM(CASE WHEN transaction_type = 'buy' THEN quantity ELSE -quantity END) AS total_shares,
797
- SUM(CASE WHEN transaction_type = 'buy' THEN quantity * price_per_share ELSE -quantity * price_per_share END) AS total_cost,
798
- MAX(CASE WHEN transaction_type = 'buy' THEN price_per_share END) AS current_market_price
799
- FROM
800
- stock_transactions
801
- WHERE
802
- stock_symbol = 'META'
803
- GROUP BY
804
- stock_symbol;
805
- """
806
-
807
- return {"messages": messages + [ChatMessage(content=OUTPUT, role="text2sql")], "documents": documents}
808
-
809
- async def main():
810
- # Initialize the graph
811
- graph_builder = StateGraph(AgentState)
812
-
813
- # Add classifier node
814
- # For failure test, pass in bad_classifier
815
- graph_builder.add_node("classifier", classify)
816
- # graph_builder.add_node("classifier", bad_classify)
817
-
818
- # Add conditional edges based on classification
819
- graph_builder.add_conditional_edges(
820
- "classifier",
821
- router,
822
- {
823
- "pnl": "pnl_retriever",
824
- "balance_sheets": "balance_sheet_retriever",
825
- "stocks": "stock_retriever"
826
- }
827
- )
828
-
829
- # Add retriever nodes (placeholder functions for now)
830
- graph_builder.add_node("pnl_retriever", pnl_retriever)
831
- graph_builder.add_node("balance_sheet_retriever", balance_sheet_retriever)
832
- graph_builder.add_node("stock_retriever", stock_retriever)
833
-
834
- # Add edges from retrievers to response generator
835
- graph_builder.add_node("response_generator", generate_response)
836
- # graph_builder.add_node("response_generator", bad_sql_generator)
837
- graph_builder.add_edge("pnl_retriever", "response_generator")
838
- graph_builder.add_edge("balance_sheet_retriever", "response_generator")
839
- graph_builder.add_edge("stock_retriever", "response_generator")
840
-
841
- graph_builder.set_entry_point("classifier")
842
- graph_builder.set_finish_point("response_generator")
843
-
844
- # Compile the graph
845
- graph = graph_builder.compile()
846
-
847
- response = await graph.ainvoke({
848
- "messages": [HumanMessage(content="Please calculate our PNL on Apple stock. Refer to table information from documents provided.")],
849
- "category": None,
850
- })
851
-
852
- print(f"Response: {response['messages'][-1].content}")
853
-
854
- if __name__ == "__main__":
855
- asyncio.run(main())
856
- ```
857
-
858
- ## Sample Query 2:
859
- Can you add Judgment tracing to my file?
860
-
861
- ## Example of Modified Code after Query 2:
862
- ```
863
- from langchain_openai import ChatOpenAI
864
- import asyncio
865
- import os
866
-
867
- import chromadb
868
- from chromadb.utils import embedding_functions
869
-
870
- from vectordbdocs import financial_data
871
-
872
- from typing import Optional
873
- from langchain_core.messages import BaseMessage, HumanMessage, SystemMessage, ChatMessage
874
- from typing_extensions import TypedDict
875
- from langgraph.graph import StateGraph
876
-
877
- from judgeval.common.tracer import Tracer
878
- from judgeval.integrations.langgraph import JudgevalCallbackHandler
879
- from judgeval.scorers import AnswerCorrectnessScorer, FaithfulnessScorer
880
- from judgeval.data import Example
881
-
882
-
883
-
884
- judgment = Tracer(project_name="FINANCIAL_AGENT")
885
-
886
- # Define our state type
887
- class AgentState(TypedDict):
888
- messages: list[BaseMessage]
889
- category: Optional[str]
890
- documents: Optional[str]
891
-
892
- def populate_vector_db(collection, raw_data):
893
- """
894
- Populate the vector DB with financial information.
895
- """
896
- for data in raw_data:
897
- collection.add(
898
- documents=[data['information']],
899
- metadatas=[{"category": data['category']}],
900
- ids=[f"category_{data['category'].lower().replace(' ', '_')}_{os.urandom(4).hex()}"]
901
- )
902
-
903
- # Define a ChromaDB collection for document storage
904
- client = chromadb.Client()
905
- collection = client.get_or_create_collection(
906
- name="financial_docs",
907
- embedding_function=embedding_functions.OpenAIEmbeddingFunction(api_key=os.getenv("OPENAI_API_KEY"))
908
- )
909
-
910
- populate_vector_db(collection, financial_data)
911
-
912
- @judgment.observe(name="pnl_retriever", span_type="retriever")
913
- def pnl_retriever(state: AgentState) -> AgentState:
914
- query = state["messages"][-1].content
915
- results = collection.query(
916
- query_texts=[query],
917
- where={"category": "pnl"},
918
- n_results=3
919
- )
920
- documents = []
921
- for document in results["documents"]:
922
- documents += document
923
-
924
- return {"messages": state["messages"], "documents": documents}
925
-
926
- @judgment.observe(name="balance_sheet_retriever", span_type="retriever")
927
- def balance_sheet_retriever(state: AgentState) -> AgentState:
928
- query = state["messages"][-1].content
929
- results = collection.query(
930
- query_texts=[query],
931
- where={"category": "balance_sheets"},
932
- n_results=3
933
- )
934
- documents = []
935
- for document in results["documents"]:
936
- documents += document
937
-
938
- return {"messages": state["messages"], "documents": documents}
939
-
940
- @judgment.observe(name="stock_retriever", span_type="retriever")
941
- def stock_retriever(state: AgentState) -> AgentState:
942
- query = state["messages"][-1].content
943
- results = collection.query(
944
- query_texts=[query],
945
- where={"category": "stocks"},
946
- n_results=3
947
- )
948
- documents = []
949
- for document in results["documents"]:
950
- documents += document
951
-
952
- return {"messages": state["messages"], "documents": documents}
953
-
954
- @judgment.observe(name="bad_classifier", span_type="llm")
955
- async def bad_classifier(state: AgentState) -> AgentState:
956
- return {"messages": state["messages"], "category": "stocks"}
957
-
958
- @judgment.observe(name="bad_classify")
959
- async def bad_classify(state: AgentState) -> AgentState:
960
- category = await bad_classifier(state)
961
-
962
- example = Example(
963
- input=state["messages"][-1].content,
964
- actual_output=category["category"],
965
- expected_output="pnl"
966
- )
967
- judgment.async_evaluate(
968
- scorers=[AnswerCorrectnessScorer(threshold=1)],
969
- example=example,
970
- model="gpt-4.1"
971
- )
972
-
973
- return {"messages": state["messages"], "category": category["category"]}
974
-
975
- @judgment.observe(name="bad_sql_generator", span_type="llm")
976
- async def bad_sql_generator(state: AgentState) -> AgentState:
977
- ACTUAL_OUTPUT = "SELECT * FROM pnl WHERE stock_symbol = 'GOOGL'"
978
-
979
- example = Example(
980
- input=state["messages"][-1].content,
981
- actual_output=ACTUAL_OUTPUT,
982
- retrieval_context=state.get("documents", []),
983
- expected_output="""
984
- SELECT
985
- SUM(CASE
986
- WHEN transaction_type = 'sell' THEN (price_per_share - (SELECT price_per_share FROM stock_transactions WHERE stock_symbol = 'GOOGL' AND transaction_type = 'buy' LIMIT 1)) * quantity
987
- ELSE 0
988
- END) AS realized_pnl
989
- FROM
990
- stock_transactions
991
- WHERE
992
- stock_symbol = 'META';
993
- """
994
- )
995
- judgment.async_evaluate(
996
- scorers=[AnswerCorrectnessScorer(threshold=1), FaithfulnessScorer(threshold=1)],
997
- example=example,
998
- model="gpt-4.1"
999
- )
1000
- return {"messages": state["messages"] + [ChatMessage(content=ACTUAL_OUTPUT, role="text2sql")]}
1001
-
1002
- # Create the classifier node with a system prompt
1003
- @judgment.observe(name="classify")
1004
- async def classify(state: AgentState) -> AgentState:
1005
- messages = state["messages"]
1006
- input_msg = [
1007
- SystemMessage(content="""You are a financial query classifier. Your job is to classify user queries into one of three categories:
1008
- - 'pnl' for Profit and Loss related queries
1009
- - 'balance_sheets' for Balance Sheet related queries
1010
- - 'stocks' for Stock market related queries
1011
-
1012
- Respond ONLY with the category name in lowercase, nothing else."""),
1013
- *messages
1014
- ]
1015
-
1016
- response = ChatOpenAI(model="gpt-4.1", temperature=0).invoke(
1017
- input=input_msg
1018
- )
1019
-
1020
- example = Example(
1021
- input=str(input_msg),
1022
- actual_output=response.content,
1023
- expected_output="pnl"
1024
- )
1025
- judgment.async_evaluate(
1026
- scorers=[AnswerCorrectnessScorer(threshold=1)],
1027
- example=example,
1028
- model="gpt-4.1"
1029
- )
1030
-
1031
- return {"messages": state["messages"], "category": response.content}
1032
-
1033
- # Add router node to direct flow based on classification
1034
- def router(state: AgentState) -> str:
1035
- return state["category"]
1036
-
1037
- @judgment.observe(name="generate_response")
1038
- async def generate_response(state: AgentState) -> AgentState:
1039
- messages = state["messages"]
1040
- documents = state.get("documents", "")
1041
-
1042
- OUTPUT = """
1043
- SELECT
1044
- stock_symbol,
1045
- SUM(CASE WHEN transaction_type = 'buy' THEN quantity ELSE -quantity END) AS total_shares,
1046
- SUM(CASE WHEN transaction_type = 'buy' THEN quantity * price_per_share ELSE -quantity * price_per_share END) AS total_cost,
1047
- MAX(CASE WHEN transaction_type = 'buy' THEN price_per_share END) AS current_market_price
1048
- FROM
1049
- stock_transactions
1050
- WHERE
1051
- stock_symbol = 'META'
1052
- GROUP BY
1053
- stock_symbol;
1054
- """
1055
-
1056
- example = Example(
1057
- input=messages[-1].content,
1058
- actual_output=OUTPUT,
1059
- retrieval_context=documents,
1060
- expected_output="""
1061
- SELECT
1062
- stock_symbol,
1063
- SUM(CASE WHEN transaction_type = 'buy' THEN quantity ELSE -quantity END) AS total_shares,
1064
- SUM(CASE WHEN transaction_type = 'buy' THEN quantity * price_per_share ELSE -quantity * price_per_share END) AS total_cost,
1065
- MAX(CASE WHEN transaction_type = 'buy' THEN price_per_share END) AS current_market_price
1066
- FROM
1067
- stock_transactions
1068
- WHERE
1069
- stock_symbol = 'META'
1070
- GROUP BY
1071
- stock_symbol;
1072
- """
1073
- )
1074
- judgment.async_evaluate(
1075
- scorers=[AnswerCorrectnessScorer(threshold=1), FaithfulnessScorer(threshold=1)],
1076
- example=example,
1077
- model="gpt-4.1"
1078
- )
1079
-
1080
- return {"messages": messages + [ChatMessage(content=OUTPUT, role="text2sql")], "documents": documents}
1081
-
1082
- async def main():
1083
- with judgment.trace(
1084
- "run_1",
1085
- project_name="FINANCIAL_AGENT",
1086
- overwrite=True
1087
- ) as trace:
1088
-
1089
- # Initialize the graph
1090
- graph_builder = StateGraph(AgentState)
1091
-
1092
- # Add classifier node
1093
- # For failure test, pass in bad_classifier
1094
- graph_builder.add_node("classifier", classify)
1095
- # graph_builder.add_node("classifier", bad_classify)
1096
-
1097
- # Add conditional edges based on classification
1098
- graph_builder.add_conditional_edges(
1099
- "classifier",
1100
- router,
1101
- {
1102
- "pnl": "pnl_retriever",
1103
- "balance_sheets": "balance_sheet_retriever",
1104
- "stocks": "stock_retriever"
1105
- }
1106
- )
1107
-
1108
- # Add retriever nodes (placeholder functions for now)
1109
- graph_builder.add_node("pnl_retriever", pnl_retriever)
1110
- graph_builder.add_node("balance_sheet_retriever", balance_sheet_retriever)
1111
- graph_builder.add_node("stock_retriever", stock_retriever)
1112
-
1113
- # Add edges from retrievers to response generator
1114
- graph_builder.add_node("response_generator", generate_response)
1115
- # graph_builder.add_node("response_generator", bad_sql_generator)
1116
- graph_builder.add_edge("pnl_retriever", "response_generator")
1117
- graph_builder.add_edge("balance_sheet_retriever", "response_generator")
1118
- graph_builder.add_edge("stock_retriever", "response_generator")
1119
-
1120
- graph_builder.set_entry_point("classifier")
1121
- graph_builder.set_finish_point("response_generator")
1122
-
1123
- # Compile the graph
1124
- graph = graph_builder.compile()
1125
-
1126
- handler = JudgevalCallbackHandler(trace)
1127
-
1128
- response = await graph.ainvoke({
1129
- "messages": [HumanMessage(content="Please calculate our PNL on Apple stock. Refer to table information from documents provided.")],
1130
- "category": None,
1131
- }, config=dict(callbacks=[handler]))
1132
- trace.save()
1133
-
1134
- print(f"Response: {response['messages'][-1].content}")
1135
-
1136
- if __name__ == "__main__":
1137
- asyncio.run(main())
1138
- ```
1139
-
1140
-
1141
- # Official Judgment Documenation
1142
-
1143
-
1144
- <div align="center">
1145
-
1146
- <img src="assets/logo-light.svg#gh-light-mode-only" alt="Judgment Logo" width="400" />
1147
- <img src="assets/logo-dark.svg#gh-dark-mode-only" alt="Judgment Logo" width="400" />
1148
-
1149
- **Build monitoring & evaluation pipelines for complex agents**
1150
-
1151
- <img src="assets/experiments_page.png" alt="Judgment Platform Experiments Page" width="800" />
1152
-
1153
- <br>
1154
-
1155
- ## @🌐 Landing Page • @Twitter/X • @💼 LinkedIn • @📚 Docs • @🚀 Demos • @🎮 Discord
1156
- </div>
1157
-
1158
- ## Judgeval: open-source testing, monitoring, and optimization for AI agents
1159
-
1160
- Judgeval offers robust tooling for evaluating and tracing LLM agent systems. It is dev-friendly and open-source (licensed under Apache 2.0).
1161
-
1162
- Judgeval gets you started in five minutes, after which you'll be ready to use all of its features as your agent becomes more complex. Judgeval is natively connected to the @Judgment Platform for free and you can export your data and self-host at any time.
1163
-
1164
- We support tracing agents built with LangGraph, OpenAI SDK, Anthropic, ... and allow custom eval integrations for any use case. Check out our quickstarts below or our @setup guide to get started.
1165
-
1166
- Judgeval is created and maintained by @Judgment Labs.
1167
-
1168
- ## 📋 Table of Contents
1169
- * @✨ Features
1170
- * @🔍 Tracing
1171
- * @🧪 Evals
1172
- * @📡 Monitoring
1173
- * @📊 Datasets
1174
- * @💡 Insights
1175
- * @🛠️ Installation
1176
- * @🏁 Get Started
1177
- * @🏢 Self-Hosting
1178
- * @📚 Cookbooks
1179
- * @⭐ Star Us on GitHub
1180
- * @❤️ Contributors
1181
-
1182
- <!-- Created by https://github.com/ekalinin/github-markdown-toc -->
1183
-
1184
-
1185
- ## ✨ Features
1186
-
1187
- | | |
1188
- |:---|:---:|
1189
- | <h3>🔍 Tracing</h3>Automatic agent tracing integrated with common frameworks (LangGraph, OpenAI, Anthropic): **tracking inputs/outputs, agent tool calls, latency, and cost** at every step.<br><br>Online evals can be applied to traces to measure quality on production data in real-time.<br><br>**Useful for:**<br>• 🐛 Debugging agent runs <br>• 📋 Collecting agent environment data <br>• 🔬 Pinpointing performance bottlenecks| <p align="center"><img src="assets/trace_screenshot.png" alt="Tracing visualization" width="1200"/></p> |
1190
- | <h3>🧪 Evals</h3>15+ research-backed metrics including tool call accuracy, hallucinations, instruction adherence, and retrieval context recall.<br><br>Build custom evaluators that connect with our metric-tracking infrastructure. <br><br>**Useful for:**<br>• ⚠️ Unit-testing <br>• 🔬 Experimental prompt testing<br>• 🛡️ Online guardrails <br><br> | <p align="center"><img src="assets/experiments_page.png" alt="Evaluation metrics" width="800"/></p> |
1191
- | <h3>📡 Monitoring</h3>Real-time performance tracking of your agents in production environments. **Track all your metrics in one place.**<br><br>Set up **Slack/email alerts** for critical metrics and receive notifications when thresholds are exceeded.<br><br> **Useful for:** <br>•📉 Identifying degradation early <br>•📈 Visualizing performance trends across versions and time | <p align="center"><img src="assets/monitoring_screenshot.png" alt="Monitoring Dashboard" width="1200"/></p> |
1192
- | <h3>📊 Datasets</h3>Export trace data or import external testcases to datasets for scaled unit testing and structured experiments. Move datasets to/from Parquet, S3, etc. <br><br>Run evals on datasets as unit tests or to A/B test different agent configurations. <br><br> **Useful for:**<br>• 🗃️ Filtered agent runtime data for fine tuning<br>• 🔄 Scaled analysis for A/B tests | <p align="center"><img src="assets/datasets_preview_screenshot.png" alt="Dataset management" width="1200"/></p> |
1193
- | <h3>💡 Insights</h3>Cluster on your data to reveal common use cases and failure modes.<br><br>Trace failures to their exact source with Judgment's Osiris agent, which localizes errors to specific components for precise fixes.<br><br> **Useful for:**<br>•🔮 Surfacing common inputs that lead to error<br>•🤖 Investigating agent/user behavior for optimization <br>| <p align="center"><img src="assets/dataset_clustering_screenshot_dm.png" alt="Insights dashboard" width="1200"/></p> |
1194
-
1195
- ## 🛠️ Installation
1196
-
1197
- Get started with Judgeval by installing our SDK using pip:
1198
-
1199
- ```bash
1200
- pip install judgeval
1201
- ```
1202
-
1203
- Ensure you have your `JUDGMENT_API_KEY` and `JUDGMENT_ORG_ID` environment variables set to connect to the @Judgment platform.
1204
-
1205
- **If you don't have keys, @create an account on the platform!**
1206
-
1207
- ## 🏁 Get Started
1208
-
1209
- Here's how you can quickly start using Judgeval:
1210
-
1211
- ### 🛰️ Tracing
1212
-
1213
- Track your agent execution with full observability with just a few lines of code.
1214
- Create a file named `traces.py` with the following code:
1215
-
1216
- ```python
1217
- from judgeval.tracer import Tracer, wrap
1218
- from openai import OpenAI
1219
-
1220
- client = wrap(OpenAI()) # tracks all LLM calls
1221
- judgment = Tracer(project_name="my_project")
1222
-
1223
- @judgment.observe(span_type="tool")
1224
- def format_question(question: str) -> str:
1225
- # dummy tool
1226
- return f"Question : {question}"
1227
-
1228
- @judgment.observe(span_type="function")
1229
- def run_agent(prompt: str) -> str:
1230
- task = format_question(prompt)
1231
- response = client.chat.completions.create(
1232
- model="gpt-4.1",
1233
- messages=[{"role": "user", "content": task}]
1234
- )
1235
- return response.choices[0].message.content
1236
-
1237
- run_agent("What is the capital of the United States?")
1238
- ```
1239
-
1240
- @Click here for a more detailed explanation.
1241
-
1242
- ### 📝 Offline Evaluations
1243
-
1244
- You can evaluate your agent's execution to measure quality metrics such as hallucination.
1245
- Create a file named `evaluate.py` with the following code:
1246
-
1247
- ```python evaluate.py
1248
- from judgeval import JudgmentClient
1249
- from judgeval.data import Example
1250
- from judgeval.scorers import FaithfulnessScorer
1251
-
1252
- client = JudgmentClient()
1253
-
1254
- example = Example(
1255
- input="What if these shoes don't fit?",
1256
- actual_output="We offer a 30-day full refund at no extra cost.",
1257
- retrieval_context=["All customers are eligible for a 30 day full refund at no extra cost."],
1258
- )
1259
-
1260
- scorer = FaithfulnessScorer(threshold=0.5)
1261
- results = client.run_evaluation(
1262
- examples=[example],
1263
- scorers=[scorer],
1264
- model="gpt-4.1",
1265
- )
1266
- print(results)
1267
- ```
1268
-
1269
- @Click here for a more detailed explanation.
1270
-
1271
- ### 📡 Online Evaluations
1272
-
1273
- Attach performance monitoring on traces to measure the quality of your systems in production.
1274
-
1275
- Using the same `traces.py` file we created earlier, modify `main` function:
1276
-
1277
- ```python
1278
- from judgeval.common.tracer import Tracer, wrap
1279
- from judgeval.scorers import AnswerRelevancyScorer
1280
- from openai import OpenAI
1281
-
1282
- client = wrap(OpenAI())
1283
- judgment = Tracer(project_name="my_project")
1284
-
1285
- @judgment.observe(span_type="tool")
1286
- def my_tool():
1287
- return "Hello world!"
1288
-
1289
- @judgment.observe(span_type="function")
1290
- def main():
1291
- task_input = my_tool()
1292
- res = client.chat.completions.create(
1293
- model="gpt-4.1",
1294
- messages=[{"role": "user", "content": f"{task_input}"}]
1295
- ).choices[0].message.content
1296
-
1297
- judgment.async_evaluate(
1298
- scorers=[AnswerRelevancyScorer(threshold=0.5)],
1299
- input=task_input,
1300
- actual_output=res,
1301
- model="gpt-4.1"
1302
- )
1303
- print("Online evaluation submitted.")
1304
- return res
1305
-
1306
- main()
1307
- ```
1308
-
1309
- @Click here for a more detailed explanation.
1310
-
1311
- ## 🏢 Self-Hosting
1312
-
1313
- Run Judgment on your own infrastructure: we provide comprehensive self-hosting capabilities that give you full control over the backend and data plane that Judgeval interfaces with.
1314
-
1315
- ### Key Features
1316
- * Deploy Judgment on your own AWS account
1317
- * Store data in your own Supabase instance
1318
- * Access Judgment through your own custom domain
1319
-
1320
- ### Getting Started
1321
- 1. Check out our @self-hosting documentation for detailed setup instructions, along with how your self-hosted instance can be accessed
1322
- 2. Use the @Judgment CLI to deploy your self-hosted environment
1323
- 3. After your self-hosted instance is setup, make sure the `JUDGMENT_API_URL` environmental variable is set to your self-hosted backend endpoint
1324
-
1325
- ## 📚 Cookbooks
1326
-
1327
- Have your own? We're happy to feature it if you create a PR or message us on @Discord.
1328
-
1329
- You can access our repo of cookbooks @here. Here are some highlights:
1330
-
1331
- ### Sample Agents
1332
-
1333
- #### 💰 @LangGraph Financial QA Agent
1334
- A LangGraph-based agent for financial queries, featuring RAG capabilities with a vector database for contextual data retrieval and evaluation of its reasoning and data accuracy.
1335
-
1336
- #### ✈️ @OpenAI Travel Agent
1337
- A travel planning agent using OpenAI API calls, custom tool functions, and RAG with a vector database for up-to-date and contextual travel information. Evaluated for itinerary quality and information relevance.
1338
-
1339
- ### Custom Evaluators
1340
-
1341
- #### 🔍 @PII Detection
1342
- Detecting and evaluating Personal Identifiable Information (PII) leakage.
1343
-
1344
- #### 📧 @Cold Email Generation
1345
-
1346
- Evaluates if a cold email generator properly utilizes all relevant information about the target recipient.
1347
-
1348
- ## ⭐ Star Us on GitHub
1349
-
1350
- If you find Judgeval useful, please consider giving us a star on GitHub! Your support helps us grow our community and continue improving the product.
1351
-
1352
-
1353
- ## ❤️ Contributors
1354
-
1355
- There are many ways to contribute to Judgeval:
1356
-
1357
- - Submit @bug reports and @feature requests
1358
- - Review the documentation and submit @Pull Requests to improve it
1359
- - Speaking or writing about Judgment and letting us know!
1360
-
1361
- <!-- Contributors collage -->
1362
- @![Contributors](https://github.com/JudgmentLabs/judgeval/graphs/contributors)
1363
-
1364
- ````
1365
- </details>
1366
-
1367
- ## ⭐ Star Us on GitHub
1368
-
1369
- If you find Judgeval useful, please consider giving us a star on GitHub! Your support helps us grow our community and continue improving the repository.
1370
-
1371
- ## ❤️ Contributors
1372
-
1373
- There are many ways to contribute to Judgeval:
1374
-
1375
- - Submit [bug reports](https://github.com/JudgmentLabs/judgeval/issues) and [feature requests](https://github.com/JudgmentLabs/judgeval/issues)
1376
- - Review the documentation and submit [Pull Requests](https://github.com/JudgmentLabs/judgeval/pulls) to improve it
1377
- - Speaking or writing about Judgment and letting us know!
1378
-
1379
- <!-- Contributors collage -->
1380
- [![Contributors](https://contributors-img.web.app/image?repo=JudgmentLabs/judgeval)](https://github.com/JudgmentLabs/judgeval/graphs/contributors)
1381
-
1382
- ---
1383
-
1384
- Judgeval is created and maintained by [Judgment Labs](https://judgmentlabs.ai/).