PyPI - judgeval - Versions diffs - 0.0.11__tar.gz → 0.0.13__tar.gz - Mend

judgeval 0.0.11tar.gz → 0.0.13tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (162) hide show

{judgeval-0.0.11 → judgeval-0.0.13}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: judgeval
-Version: 0.0.11
+Version: 0.0.13
 Summary: Judgeval Package
 Project-URL: Homepage, https://github.com/JudgmentLabs/judgeval
 Project-URL: Issues, https://github.com/JudgmentLabs/judgeval/issues

{judgeval-0.0.11 → judgeval-0.0.13}/docs/api_reference/trace.mdx RENAMED Viewed

@@ -12,27 +12,34 @@ The `Tracer` class is used to trace the execution of your LLM system.
 ```python
 from judgeval.common.tracer import Tracer
-tracer = Tracer()
+tracer = Tracer(project_name="my_project")
 ```
 <Note>
-The `Tracer` class is a singleton, so you only need to initialize it once in your application.
+The `Tracer` class is a singleton, so you only need to initialize it once in your application.
+The `project_name` enables you to group traces by workflow, keeping all your evaluations and
+observability tooling in one place.
 </Note>
-## Exporting traces
+## Explicitly exporting traces
 When using the `.trace()` context manager, you can control how your traces are exported to the Judgment platform by
 providing the `project_name` argument. This allows you to group traces by workflow, keeping all your evaluations and
 observability tooling in one place.
 ```python
-with tracer.trace("my_workflow", project_name="my_project"):
+with tracer.trace(
+    name="my_workflow",
+    project_name="my_project",
+    overwrite=True
+    ) as trace:
     ...
 ```
 `.trace()` has the following args:
 - `name`: The name of the trace. Can be make unique to each workflow run by using a timestamp or other unique identifier.
 - `project_name`: The name of the project to use for the trace. Used to group traces by workflow.
+- `overwrite`: Whether to overwrite the trace with the same `name` if it already exists.
 The `trace()` context manager yields a `TraceClient` object.

{judgeval-0.0.11 → judgeval-0.0.13}/docs/getting_started.mdx RENAMED Viewed

@@ -32,7 +32,7 @@ large-scale evaluations. [Contact us](mailto:contact@judgmentlabs.ai) if you're
 sensitive data that has to reside in your private VPCs.
 </Note>
-# Create your first evaluation
+# Create Your First Evaluation
 ```python sample_eval.py
 from judgeval import JudgmentClient
@@ -68,6 +68,48 @@ is a scorer that checks if the output is hallucinated relative to the retrieved
 To learn more about using the Judgment Client to run evaluations, click [here](/api_reference/judgment_client).
 </Tip>
+# Create Your First Trace
+Beyond experimentation, `judgeval` supports monitoring your LLM systems in **production**.
+Using our `tracing` module, you can **track your LLM system outputs from end to end**, allowing you to visualize the flow of your LLM system.
+Additionally, you can **enable evaluations to run in real-time** using Judgment's state-of-the-art judge models.
+```python trace_example.py
+from judgeval.common.tracer import Tracer, wrap
+from openai import OpenAI
+client = wrap(OpenAI())
+judgment = Tracer(project_name="my_project")
+@judgment.observe(span_type="tool")
+def my_tool():
+    return "Hello world!"
+@judgment.observe(span_type="function")
+def main():
+    res = client.chat.completions.create(
+        model="gpt-4o",
+        messages=[{"role": "user", "content": f"{my_tool()}"}]
+    )
+    return res.choices[0].message.content
+```
+<div style={{display: 'flex', justifyContent: 'center'}}>
+  ![Alt text](/images/trace_screenshot.png "Image of a RAG pipeline trace")
+</div>
+There are many benefits of monitoring your LLM systems in production with `judgeval`, including:
+- Detecting hallucinations and other quality issues **before they reach your customers**
+- Automatically creating experimental datasets from your **real-world production cases** for future improvement/optimization
+- Track and create alerts on **any metric** (e.g. latency, cost, hallucination, etc.)
+<Tip>
+To learn more about `judgeval`'s tracing module, click [here](/tracing/introduction).
+</Tip>
 # Create Your First Scorer
 `judgeval` offers three kinds of LLM scorers for your evaluation needs: ready-made, classifier scorers, and custom scorers.
@@ -264,22 +306,3 @@ A `Project` keeps track of `Evaluation Run`s in your project. Each `Evaluation R
 You can try different models (e.g. `gpt-4o`, `claude-3-5-sonnet`, etc.) and prompt templates in each `Evaluation Run` to find the
 optimal setup for your LLM system.
 </Tip>
-## Monitoring LLM Systems in Production
-Beyond experimenting and measuring historical performance, `judgeval` supports monitoring your LLM systems in **production**.
-Using our `tracing` module, you can **track your LLM system outputs from end to end**, allowing you to visualize the flow of your LLM system.
-Additionally, you can **enable evaluations to run in real-time** using Judgment's state-of-the-art judge models.
-<div style={{display: 'flex', justifyContent: 'center'}}>
-  ![Alt text](/images/trace_screenshot.png "Image of a RAG pipeline trace")
-</div>
-There are many benefits of monitoring your LLM systems in production with `judgeval`, including:
-- Detecting hallucinations and other quality issues **before they reach your customers**
-- Automatically creating experimental datasets from your **real-world production cases** for future improvement/optimization
-- Track and create alerts on **any metric** (e.g. latency, cost, hallucination, etc.)
-<Tip>
-To learn more about `judgeval`'s tracing module, click [here](/tracing/introduction).
-</Tip>

{judgeval-0.0.11 → judgeval-0.0.13}/docs/monitoring/tracing.mdx RENAMED Viewed

@@ -18,24 +18,25 @@ Using tracing, you can:
 ## Tracing Your Workflow ##
-Setting up tracing with `judgeval` takes three simple steps:
+Setting up tracing with `judgeval` takes two simple steps:
-### 1. Initialize a tracer with your API key
+### 1. Initialize a tracer with your API key and project name
 ```python
 from judgeval.common.tracer import Tracer
-judgment = Tracer()  # loads from JUDGMENT_API_KEY env var
+judgment = Tracer(project_name="my_project")  # loads from JUDGMENT_API_KEY env var
 ```
 <Note>
-    The [Judgment tracer](/api_reference/trace) is a singleton object that should be shared across your application.
+    The [Judgment tracer](/api_reference/trace) is a singleton object that should be shared across your application.
+    Your project name will be used to organize your traces in one place on the Judgment platform.
 </Note>
 ### 2. Wrap your workflow components
-`judgeval` provides three wrapping mechanisms for your workflow components:
+`judgeval` provides wrapping mechanisms for your workflow components:
 #### `wrap()` ####
 The `wrap()` function goes over your LLM client (e.g. OpenAI, Anthropic, etc.) and captures metadata surrounding your LLM calls, such as:
@@ -44,6 +45,14 @@ The `wrap()` function goes over your LLM client (e.g. OpenAI, Anthropic, etc.) a
 - Prompt/Completion
 - Model name
+Here's an example of using `wrap()` on an OpenAI client:
+```python
+from openai import OpenAI
+from judgeval.common.tracer import wrap
+client = wrap(OpenAI())
+```
 #### `@observe` ####
 The `@observe` decorator wraps your functions/tools and captures metadata surrounding your function calls, such as:
 - Latency
@@ -63,30 +72,20 @@ def my_tool():
 ```
 <Note>
-    The `@observe` decorator is used on top of helper functions that you write, but is not designed to be used
-    on your "main" function. For more information, see the `context manager` section below.
+    `span_type` is a string that you can use to categorize and organize your trace spans.
+    Span types are displayed on the trace UI to easily nagivate a visualization of your workflow.
+    Common span types include `tool`, `function`, `retriever`, `database`, `web search`, etc.
 </Note>
-#### `context manager` ####
-In your main function (e.g. the one that executes the primary workflow logic), you can use the `with judgment.trace()` context manager to trace the entire workflow.
-The context manager can **save/print the state of the trace at any point in the workflow**.
-This is useful for debugging or exporting any state of your workflow to run an evaluation from!
-<Tip>
-    The `with judgment.trace()` context manager detects any `@observe` decorated functions or wrapped LLM calls within the context and automatically captures their metadata.
-</Tip>
 #### Putting it all Together
-Here's a complete example of using the `with judgment.trace()` context manager with the other tracing mechanisms:
+Here's a complete example of using judgeval's tracing mechanisms:
 ```python
 from judgeval.common.tracer import Tracer, wrap
 from openai import OpenAI
 openai_client = wrap(OpenAI())
-judgment = Tracer()  # loads from JUDGMENT_API_KEY env var
+judgment = Tracer(project_name="my_project")  # loads from JUDGMENT_API_KEY env var
 @judgment.observe(span_type="tool")
 def my_tool():
@@ -101,28 +100,10 @@ def my_llm_call():
     )
     return res.choices[0].message.content
+@judgment.observe(span_type="function")
 def main():
-    with judgment.trace(
-        "main_workflow",
-        project_name="my_project"
-    ) as trace:
-        res = my_llm_call()
-        trace.save()
-        trace.print()
-        return res
-```
-The printed trace appears as follows on the terminal:
-```
-→ main_workflow (trace: main_workflow)
-  → my_llm_call (trace: my_llm_call)
-    Input: {'args': [], 'kwargs': {}}
-    → my_tool (trace: my_tool)
-      Input: {'args': [], 'kwargs': {}}
-      Output: Hello world!
-    ← my_tool (0.000s)
-    Output: Hello! How can I assist you today?
-  ← my_llm_call (0.789s)
+    res = my_llm_call()
+    return res
 ```
 And the trace will appear on the Judgment platform as follows:
@@ -142,32 +123,27 @@ To execute an asynchronous evaluation, you can use the `trace.async_evaluate()`
 ```python
 from judgeval.common.tracer import Tracer
-from judgeval.scorers import FaithfulnessScorer
+from judgeval.scorers import AnswerRelevancyScorer
-judgment = Tracer()
+judgment = Tracer(project_name="my_project")
+@judgment.observe(span_type="function")
 def main():
-    with judgment.trace(
-        "main_workflow",
-        project_name="my_project"
-    ) as trace:
-        retrieved_info = ...   # from knowledge base
-        res = ...  # your main workflow logic
-        judgment.get_current_trace().async_evaluate(
-            scorers=[FaithfulnesssScorer(threshold=0.5)],
-            input="",
-            actual_output=res,
-            retrieval_context=[retrieved_info],
-            model="gpt-4o-mini",
-        )
-        return res
+    query = "What is the capital of France?"
+    res = ...  # Your workflow logic
+    judgment.get_current_trace().async_evaluate(
+        scorers=[AnswerRelevancyScorer(threshold=0.5)],
+        input="",
+        actual_output=res,
+        model="gpt-4o",
+    )
+    return res
 ```
 <Tip>
-You can organize how your async evaluation runs are logged to the Judgment platform by using the
-`project_name` argument in the `trace` context manager. See our [API documentation](/api_reference/trace)
-for more information.
+    Your async evaluations will be logged to the Judgment platform as part of the original trace and
+    a new evaluation will be created on the Judgment platform.
 </Tip>
 ## Example: OpenAI Travel Agent
@@ -183,4 +159,53 @@ In this video, we'll walk through all of the topics covered in this guide by tra
     allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share"
     referrerpolicy="strict-origin-when-cross-origin"
     allowfullscreen
-></iframe>
+></iframe>
+## Advanced: Customizing Traces Using the Context Manager ##
+If you need to customize your tracing context, you can use the `with judgment.trace()` context manager.
+The context manager can **save/print the state of the trace at any point in the workflow**.
+This is useful for debugging or exporting any state of your workflow to run an evaluation from!
+<Tip>
+    The `with judgment.trace()` context manager detects any `@observe` decorated functions or wrapped LLM calls within the context and automatically captures their metadata.
+</Tip>
+Here's an example of using the context manager to trace a workflow:
+```python
+from judgeval.common.tracer import Tracer, wrap
+from openai import OpenAI
+judgment = Tracer(project_name="my_project")
+client = wrap(OpenAI())
+@judgment.observe(span_type="tool")
+def my_tool():
+    return "Hello world!"
+def main():
+    with judgment.trace(name="my_workflow") as trace:
+        res = client.chat.completions.create(
+            model="gpt-4o",
+            messages=[{"role": "user", "content": f"{my_tool()}"}]
+        )
+    trace.print()  # prints the state of the trace to console
+    trace.save()  # saves the current state of the trace to the Judgment platform
+    return res.choices[0].message.content
+```
+<Warning>
+    The `with judgment.trace()` context manager should only be used if you need to customize the context
+    over which you're tracing. In most cases, you should trace using the `@observe` decorator.
+</Warning>

{judgeval-0.0.11 → judgeval-0.0.13}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "judgeval"
-version = "0.0.11"
+version = "0.0.13"
 authors = [
     { name="Andrew Li", email="andrew@judgmentlabs.ai" },
     { name="Alex Shan", email="alex@judgmentlabs.ai" },

judgeval-0.0.13/src/demo/cookbooks/new_bot/basic_bot.py ADDED Viewed

@@ -0,0 +1,106 @@
+import os
+import asyncio
+from typing import Dict, List
+from openai import OpenAI
+from uuid import uuid4
+from dotenv import load_dotenv
+from judgeval.tracer import Tracer, wrap
+from judgeval.scorers import AnswerRelevancyScorer, FaithfulnessScorer
+# Initialize clients
+load_dotenv()
+judgment = Tracer(api_key=os.getenv("JUDGMENT_API_KEY"), project_name="restaurant_bot")
+client = wrap(OpenAI())
+@judgment.observe(span_type="Research")
+async def search_restaurants(cuisine: str, location: str = "nearby") -> List[Dict]:
+    """Search for restaurants matching the cuisine type."""
+    # Simulate API call to restaurant database
+    prompt = f"Find 3 popular {cuisine} restaurants {location}. Return ONLY a JSON array of objects with 'name', 'rating', and 'price_range' fields. No other text."
+    response = client.chat.completions.create(
+        model="gpt-4",
+        messages=[
+            {"role": "system", "content": """You are a restaurant search expert.
+             Return ONLY valid JSON arrays containing restaurant objects.
+             Example format: [{"name": "Restaurant Name", "rating": 4.5, "price_range": "$$"}]
+             Do not include any other text or explanations."""},
+            {"role": "user", "content": prompt}
+        ]
+    )
+    try:
+        import json
+        return json.loads(response.choices[0].message.content)
+    except json.JSONDecodeError as e:
+        print(f"Error parsing JSON response: {response.choices[0].message.content}")
+        return [{"name": "Error fetching restaurants", "rating": 0, "price_range": "N/A"}]
+@judgment.observe(span_type="Research")
+async def get_menu_highlights(restaurant_name: str) -> List[str]:
+    """Get popular menu items for a restaurant."""
+    prompt = f"What are 3 must-try dishes at {restaurant_name}?"
+    response = client.chat.completions.create(
+        model="gpt-4",
+        messages=[
+            {"role": "system", "content": "You are a food critic. List only the dish names."},
+            {"role": "user", "content": prompt}
+        ]
+    )
+    judgment.get_current_trace().async_evaluate(
+        scorers=[AnswerRelevancyScorer(threshold=0.5)],
+        input=prompt,
+        actual_output=response.choices[0].message.content,
+        model="gpt-4",
+    )
+    return response.choices[0].message.content.split("\n")
+@judgment.observe(span_type="function")
+async def generate_recommendation(cuisine: str, restaurants: List[Dict], menu_items: Dict[str, List[str]]) -> str:
+    """Generate a natural language recommendation."""
+    context = f"""
+    Cuisine: {cuisine}
+    Restaurants: {restaurants}
+    Popular Items: {menu_items}
+    """
+    response = client.chat.completions.create(
+        model="gpt-4",
+        messages=[
+            {"role": "system", "content": "You are a helpful food recommendation bot. Provide a natural recommendation based on the data."},
+            {"role": "user", "content": context}
+        ]
+    )
+    return response.choices[0].message.content
+@judgment.observe(span_type="Research")
+async def get_food_recommendations(cuisine: str) -> str:
+    """Main function to get restaurant recommendations."""
+    # Search for restaurants
+    restaurants = await search_restaurants(cuisine)
+    # Get menu highlights for each restaurant
+    menu_items = {}
+    for restaurant in restaurants:
+        menu_items[restaurant['name']] = await get_menu_highlights(restaurant['name'])
+    # Generate final recommendation
+    recommendation = await generate_recommendation(cuisine, restaurants, menu_items)
+    judgment.get_current_trace().async_evaluate(
+        scorers=[AnswerRelevancyScorer(threshold=0.5), FaithfulnessScorer(threshold=1.0)],
+        input=f"Create a recommendation for a restaurant and dishes based on the desired cuisine: {cuisine}",
+        actual_output=recommendation,
+        retrieval_context=[str(restaurants), str(menu_items)],
+        model="gpt-4",
+    )
+    return recommendation
+if __name__ == "__main__":
+    cuisine = input("What kind of food would you like to eat? ")
+    recommendation = asyncio.run(get_food_recommendations(cuisine))
+    print("\nHere are my recommendations:\n")
+    print(recommendation)

{judgeval-0.0.11 → judgeval-0.0.13}/src/demo/cookbooks/openai_travel_agent/agent.py RENAMED Viewed

@@ -10,9 +10,11 @@ from chromadb.utils import embedding_functions
 from judgeval.common.tracer import Tracer, wrap
 from demo.cookbooks.openai_travel_agent.populate_db import destinations_data
 from demo.cookbooks.openai_travel_agent.tools import search_tavily
+from judgeval.scorers import AnswerRelevancyScorer, FaithfulnessScorer
 client = wrap(openai.Client(api_key=os.getenv("OPENAI_API_KEY")))
-judgment = Tracer()
+judgment = Tracer(api_key=os.getenv("JUDGMENT_API_KEY"), project_name="travel_agent_demo")
 def populate_vector_db(collection, destinations_data):
     """
@@ -45,6 +47,12 @@ async def get_flights(destination):
     """Search for flights to the destination."""
     prompt = f"Flights to {destination} from major cities"
     flights_search = search_tavily(prompt)
+    judgment.get_current_trace().async_evaluate(
+        scorers=[AnswerRelevancyScorer(threshold=0.5)],
+        input=prompt,
+        actual_output=flights_search,
+        model="gpt-4",
+    )
     return flights_search
 @judgment.observe(span_type="tool")
@@ -52,6 +60,12 @@ async def get_weather(destination, start_date, end_date):
     """Search for weather information."""
     prompt = f"Weather forecast for {destination} from {start_date} to {end_date}"
     weather_search = search_tavily(prompt)
+    judgment.get_current_trace().async_evaluate(
+        scorers=[AnswerRelevancyScorer(threshold=0.5)],
+        input=prompt,
+        actual_output=weather_search,
+        model="gpt-4",
+    )
     return weather_search
 def initialize_vector_db():
@@ -125,21 +139,23 @@ async def create_travel_plan(destination, start_date, end_date, research_data):
             {"role": "user", "content": prompt}
         ]
     ).choices[0].message.content
+    judgment.get_current_trace().async_evaluate(
+        scorers=[FaithfulnessScorer(threshold=0.5)],
+        input=prompt,
+        actual_output=response,
+        retrieval_context=[str(vector_db_context), str(research_data)],
+        model="gpt-4",
+    )
     return response
+@judgment.observe(span_type="Main Function", overwrite=True)
 async def generate_itinerary(destination, start_date, end_date):
     """Main function to generate a travel itinerary."""
-    with judgment.trace(
-        f"generate_itinerary_demo_{uuid4()}",
-        project_name="travel_agent_demo"
-    ) as trace:
-        research_data = await research_destination(destination, start_date, end_date)
-        res = await create_travel_plan(destination, start_date, end_date, research_data)
-        trace.save()
-        return res
+    research_data = await research_destination(destination, start_date, end_date)
+    res = await create_travel_plan(destination, start_date, end_date, research_data)
+    return res
 if __name__ == "__main__":

{judgeval-0.0.11 → judgeval-0.0.13}/src/demo/cookbooks/openai_travel_agent/tools.py RENAMED Viewed

@@ -5,7 +5,7 @@ from tavily import TavilyClient
 from judgeval.common.tracer import Tracer
-judgment = Tracer()
+judgment = Tracer(project_name="travel_agent_demo")
 @judgment.observe(span_type="search_tool")
 def search_tavily(query):

{judgeval-0.0.11 → judgeval-0.0.13}/src/demo/customer_use/cstone/faithfulness_testing.py RENAMED Viewed

@@ -53,10 +53,10 @@ def run_judgment_evaluation(examples: List[Example]):
     scorer = FaithfulnessScorer(threshold=1.0)
     output = client.run_evaluation(
-        model="osiris-mini",
+        model="osiris-large",
         examples=examples,
         scorers=[scorer],
-        eval_run_name="cstone-basic-test-osiris-mini-2",
+        eval_run_name="cstone-basic-test-osiris-large-1",
         project_name="cstone_faithfulness_testing",
         override=True,
     )
@@ -66,7 +66,7 @@ def run_judgment_evaluation(examples: List[Example]):
         score = result.scorers_data[0].score
         scores.append(score)
-    return [score < 0.95 for score in scores]
+    return [score < 1 for score in scores]
 def run_patronus_evaluation(examples: List[Example]):
     """
@@ -94,7 +94,7 @@ def run_patronus_evaluation(examples: List[Example]):
     print(f"patronus scores: {scores}")
-    return [score < 0.95 for score in scores]
+    return [score < 0.9 for score in scores]
 def evaluate_predictions(predictions):
     """Calculate metrics comparing predictions to gold labels"""

judgeval 0.0.11__tar.gz → 0.0.13__tar.gz

judgeval 0.0.11tar.gz → 0.0.13tar.gz