PyPI - judgeval - Versions diffs - 0.0.35__tar.gz → 0.0.36__tar.gz - Mend

judgeval 0.0.35tar.gz → 0.0.36tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (178) hide show

{judgeval-0.0.35 → judgeval-0.0.36}/.github/workflows/ci.yaml RENAMED Viewed

@@ -1,8 +1,8 @@
-name: CI
+name: CI Tests
 on:
-  pull_request_review:
-    types: [submitted]
+  pull_request:
+    types: [opened, synchronize, reopened]
     branches:
       - main

{judgeval-0.0.35 → judgeval-0.0.36}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: judgeval
-Version: 0.0.35
+Version: 0.0.36
 Summary: Judgeval Package
 Project-URL: Homepage, https://github.com/JudgmentLabs/judgeval
 Project-URL: Issues, https://github.com/JudgmentLabs/judgeval/issues
@@ -11,7 +11,6 @@ Classifier: Operating System :: OS Independent
 Classifier: Programming Language :: Python :: 3
 Requires-Python: >=3.11
 Requires-Dist: anthropic
-Requires-Dist: boto3==1.38.3
 Requires-Dist: fastapi
 Requires-Dist: google-genai
 Requires-Dist: langchain

{judgeval-0.0.35 → judgeval-0.0.36}/docs/api_reference/judgment_client.mdx RENAMED Viewed

@@ -31,6 +31,8 @@ const client = JudgmentClient.getInstance();
 ## Running an Evaluation
+### Example Level
 The `client.run_evaluation` (Python) or `client.evaluate` (Typescript) method is the primary method for executing evaluations.
 <CodeGroup>
@@ -99,3 +101,47 @@ In Judgment, **projects** are used to organize workflows, while **evaluation run
 are used to group versions of a workflow for comparative analysis of evaluations.
 As a result, you can think of projects as folders, and evaluation runs as sub-folders that contain evaluation results.
 </Tip>
+### Sequence Level
+The `client.run_sequence_evaluation` (Python) or `client.evaluateSequence` (Typescript) method is the primary method for executing sequence evaluations.
+<CodeGroup>
+```Python Python
+from judgeval import JudgmentClient
+from judgeval.data import Example, Sequence
+from judgeval.scorers import DerailmentScorer
+client = JudgmentClient()
+airlines_example = Example(
+    input="Which airlines fly to Paris?",
+    actual_output="Air France, Delta, and American Airlines offer direct flights."
+)
+airline_followup = Example(
+    input="Which airline is the best for a family of 4?",
+    actual_output="Delta is the best airline for a family of 4."
+)
+weather_example = Example(
+    input="What is the weather like in Texas?",
+    actual_output="It's sunny with a high of 75°F in Texas."
+)
+airline_sequence = Sequence(
+    name="Flight Details",
+    items=[airlines_example, airline_followup, weather_example]
+)
+results = client.run_sequence_evaluation(
+    sequences=[airline_sequence],
+    scorers=[DerailmentScorer(threshold=0.5)],
+    model="gpt-4o",
+    log_results=True,
+    override=True,
+)
+```
+</CodeGroup>
+The `run_sequence_evaluation` (Python) / `evaluateSequence` (Typescript) method accepts the same arguments as the `run_evaluation` (Python) / `evaluate` (Typescript) method, with the following changes to the arguments:
+- `sequences`: A list/array of [Sequence](/evaluation/data_examples) objects to evaluate (instead of 'examples')

{judgeval-0.0.35 → judgeval-0.0.36}/docs/api_reference/trace.mdx RENAMED Viewed

@@ -92,34 +92,24 @@ The `TraceClient` object manages the context of a single trace context (or workf
 ## Tracing functions (`@observe` / `observe()`)
-With automatic deep tracing, you only need to observe top-level functions, and all nested function calls will be automatically traced. This significantly reduces the amount of instrumentation needed in your code.
+Each intermediate function or coroutine you want to trace is wrapped with the `@judgment.observe()` decorator (Python) or the `tracer.observe()` higher-order function (Typescript).
 **If you use multiple decorators in Python**, the `@judgment.observe()` decorator should be the innermost decorator to preserve functionality.
-Here's an example using automatic deep tracing:
+Here's an example using `observe`:
 <CodeGroup>
 ```Python Python
-# Initialize tracer (deep tracing is enabled by default)
-judgment = Tracer(project_name="my_project")
-# Only need to observe the top-level function
-@judgment.observe(span_type="function")
-def main_workflow(query: str):
-    # All function calls inside will be automatically traced
-    result = my_tool(query)
-    return process_result(result)
+# Assume judgment = Tracer(...) exists
+from langchain.tools import tool # Example other decorator
-# No @observe needed - automatically traced when called from main_workflow
+@tool
+@judgment.observe(span_type="tool")
 def my_tool(query: str):
+    # ... tool logic ...
     print(f"Tool executed with query: {query}")
     return "Tool result"
-# No @observe needed - automatically traced when called from main_workflow
-def process_result(result: str):
-    return f"Processed: {result}"
-# Calling main_workflow("some query") will trace the entire call stack
+# Calling my_tool("some query") will now be traced.
 ```
 ```Typescript Typescript
 // Assume tracer = Tracer.getInstance(...) exists
@@ -144,29 +134,6 @@ const observedMyTool = tracer.observe({ spanType: "tool" })(myTool);
 ```
 </CodeGroup>
-You can also disable deep tracing if you prefer manual control:
-<CodeGroup>
-```Python Python
-# Disable deep tracing
-judgment = Tracer(project_name="my_project")
-# Even with deep tracing globally enabled (default)
-# You can disable it for specific functions so judgment would not trace any functions this selective_function calls
-@judgment.observe(span_type="function", deep_tracing=False)
-def selective_function():
-    helper_function()  # Won't be traced automatically
-    return "Done"
-```
-```Typescript Typescript
-// Disable deep tracing
-const judgment = Tracer.getInstance({
-    projectName: "my_project",
-    deepTracing: false  // Disable automatic deep tracing
-});
-```
-</CodeGroup>
 The `span_type` / `spanType` argument can be used to categorize the type of span for observability purposes and will be displayed
 on the Judgment platform:

{judgeval-0.0.35 → judgeval-0.0.36}/docs/clustering/clustering.mdx RENAMED Viewed

@@ -14,7 +14,9 @@ Clustering visualization helps you:
 - Explore data points to understand cluster characteristics
 - Compare results across different evaluation sets, traces, or datasets
-![Clustering visualization example](/images/cluster.png)
+<Frame>
+  <img src="/images/cluster.png" alt="Clustering visualization example" />
+</Frame>
 ## Accessing Clustering Visualization
@@ -31,7 +33,9 @@ You can access clustering visualization in three different contexts:
 1. **Select a project**: Choose your project from the from the projects page of the platform website.
 2. **Choose data source**: From the project page, we can click into experiments, monitoring traces, or datasets and choose the option to cluster from within the respective pages. The visualization will display data from the specified source.
-![Clustering visualization button location](/images/cluster_button.png)
+<Frame>
+  <img src="/images/cluster_button.png" alt="Clustering visualization button location" />
+</Frame>
 ### Interacting with the Visualization

judgeval-0.0.36/docs/compliance/certifications.mdx ADDED Viewed

@@ -0,0 +1,47 @@
+---
+title: Security & Compliance
+---
+At Judgment Labs, we take security and compliance seriously. We maintain rigorous standards to protect our customers' data and ensure the highest level of service reliability.
+## SOC 2 Compliance
+### Type 1 Certification
+We have successfully completed our SOC 2 Type 1 audit, demonstrating our commitment to security, availability, and confidentiality. This certification verifies that our security controls are appropriately designed and implemented.
+<Note>
+View our [SOC 2 Type 1 Report](https://app.delve.co/judgment-labs)
+</Note>
+### Type 2 Certification (In Progress)
+We are currently undergoing our SOC 2 Type 2 audit, which will validate the operational effectiveness of our security controls over time. This comprehensive audit examines our systems and processes over an extended period to ensure consistent adherence to security protocols.
+<Note>
+The SOC 2 Type 2 audit is expected to be completed in the coming months. Once completed, the report will be available through our [Delve compliance portal](https://app.delve.co/judgment-labs).
+</Note>
+## HIPAA Compliance
+We maintain HIPAA compliance to ensure the security and privacy of protected health information (PHI). Our infrastructure and processes are designed to meet HIPAA's strict requirements for:
+- Data encryption
+- Access controls
+- Audit logging
+- Data backup and recovery
+- Security incident handling
+<Tip>
+Access our [HIPAA Compliance Report](https://app.delve.co/judgment-labs) through our compliance portal. If you're working with healthcare data, please contact our team at contact@judgmentlabs.ai to discuss your specific compliance needs.
+</Tip>
+## Our Commitment
+Our security and compliance certifications reflect our commitment to:
+- Protecting customer data
+- Maintaining system availability
+- Ensuring process integrity
+- Preserving confidentiality
+- Following industry best practices
+For detailed information about our security practices or compliance certifications, please:
+1. Visit our [Compliance Portal](https://app.delve.co/judgment-labs)
+2. Contact our security team at contact@judgmentlabs.ai

{judgeval-0.0.35 → judgeval-0.0.36}/docs/evaluation/data_datasets.mdx RENAMED Viewed

@@ -279,6 +279,74 @@ const results = await client.evaluate({
 ```
 </CodeGroup>
+## Exporting Datasets
+You can export your datasets from the Judgment Platform UI for backup purposes or sharing with team members.
+### Export from Platform UI
+1. Navigate to your project in the [Judgment Platform](https://app.judgmentlabs.ai)
+2. Select the dataset you want to export
+3. Click the "Download Dataset" button in the top right
+4. The dataset will be downloaded as a JSON file
+<Frame>
+  <img src="/images/export-dataset.png" alt="Export Dataset" />
+</Frame>
+The exported JSON file contains the complete dataset information, including metadata and examples:
+```json
+{
+  "dataset_id": "f852eeee-87fa-4430-9571-5784e693326e",
+  "organization_id": "0fbb0aa8-a7b3-4108-b92a-cc6c6800d825",
+  "dataset_alias": "QA-Pairs",
+  "comments": null,
+  "source_file": null,
+  "created_at": "2025-04-23T22:38:11.709763+00:00",
+  "is_sequence": false,
+  "examples": [
+    {
+      "example_id": "119ee1f6-1046-41bc-bb89-d9fc704829dd",
+      "input": "How can I start meditating?",
+      "actual_output": null,
+      "expected_output": "Meditation is a wonderful way to relax and focus...",
+      "context": null,
+      "retrieval_context": null,
+      "additional_metadata": {
+        "synthetic": true
+      },
+      "tools_called": null,
+      "expected_tools": null,
+      "name": null,
+      "created_at": "2025-04-23T23:34:33.117479+00:00",
+      "dataset_id": "f852eeee-87fa-4430-9571-5784e693326e",
+      "eval_results_id": null,
+      "sequence_id": null,
+      "sequence_order": 0
+    },
+    // more examples...
+  ]
+}
+```
+Each example in the dataset contains:
+- `example_id`: Unique identifier for the example
+- `input`: The input query or prompt
+- `actual_output`: The response from your agent (if any)
+- `expected_output`: The expected response or ground truth
+- `context`: Additional context for the example
+- `retrieval_context`: Retrieved context used for RAG systems
+- `additional_metadata`: Custom metadata (e.g., whether the example is synthetic)
+- `tools_called`: Record of tools used in the response
+- `expected_tools`: Expected tool calls for the example
+- `created_at`: Timestamp of example creation
+- `sequence_order`: Order in sequence (if part of a sequence)
+<Note>
+When downloading datasets that contain sensitive information, make sure to follow your organization's data handling policies and store the exported files in secure locations.
+</Note>
 ## Conclusion
 Congratulations! 🎉

judgeval-0.0.36/docs/evaluation/experiment_comparisons.mdx ADDED Viewed

@@ -0,0 +1,143 @@
+---
+title: Experiment Comparisons
+description: "Learn how to A/B test changes in your LLM workflows using experiment comparisons."
+---
+# Introduction
+Experiment comparisons allow you to systematically A/B test changes in your LLM workflows. Whether you're testing different prompts, models, or architectures, Judgment helps you compare results across experiments to make data-driven decisions about your LLM systems.
+# Creating Your First Comparison
+Let's walk through how to create and run experiment comparisons:
+<CodeGroup>
+```Python Python
+from judgeval import JudgmentClient
+from judgeval.data import Example
+from judgeval.scorers import AnswerCorrectnessScorer
+client = JudgmentClient()
+# Define your test examples
+examples = [
+    Example(
+        input="What is the capital of France?",
+        actual_output="Paris is the capital of France.",
+        expected_output="Paris"
+    ),
+    Example(
+        input="What is the capital of Japan?",
+        actual_output="Tokyo is the capital of Japan.",
+        expected_output="Tokyo"
+    )
+]
+# Define your scorer
+scorer = AnswerCorrectnessScorer(threshold=0.7)
+# Run first experiment with GPT-4
+experiment_1 = client.run_evaluation(
+    examples=examples,
+    scorers=[scorer],
+    model="gpt-4",
+    project_name="capital_cities",
+    eval_name="gpt4_experiment"
+)
+# Run second experiment with a different model
+experiment_2 = client.run_evaluation(
+    examples=examples,
+    scorers=[scorer],
+    model="gpt-3.5-turbo",
+    project_name="capital_cities",
+    eval_name="gpt35_experiment"
+)
+```
+```Typescript Typescript
+import { JudgmentClient, ExampleBuilder, AnswerCorrectnessScorer } from 'judgeval';
+async function runComparativeExperiments() {
+    const client = JudgmentClient.getInstance();
+    // Define your test examples
+    const examples = [
+        new ExampleBuilder()
+            .input("What is the capital of France?")
+            .actualOutput("Paris is the capital of France.")
+            .expectedOutput("Paris")
+            .build(),
+        new ExampleBuilder()
+            .input("What is the capital of Japan?")
+            .actualOutput("Tokyo is the capital of Japan.")
+            .expectedOutput("Tokyo")
+            .build()
+    ];
+    // Define your scorer
+    const scorer = new AnswerCorrectnessScorer(0.7);
+    // Run first experiment with GPT-4
+    const experiment1 = await client.evaluate({
+        examples: examples,
+        scorers: [scorer],
+        model: "gpt-4",
+        projectName: "capital_cities",
+        evalName: "gpt4_experiment"
+    });
+    // Run second experiment with a different model
+    const experiment2 = await client.evaluate({
+        examples: examples,
+        scorers: [scorer],
+        model: "gpt-3.5-turbo",
+        projectName: "capital_cities",
+        evalName: "gpt35_experiment"
+    });
+}
+runComparativeExperiments();
+```
+</CodeGroup>
+After running the following code, click the `View Results` link to take you to your experiment run on the Judgment Platform.
+# Analyzing Results
+Once your experiments are complete, you can compare them on the Judgment Platform:
+1. You'll be automatically directed to your **Experiment page**. Here you'll see your latest experiment results and a "Compare" button.
+   <div style={{display: 'flex', justifyContent: 'center'}}>
+     <Frame>
+       ![Experiment Page](/images/experiment-comparison-page-2.png "Experiment page with Compare button")
+     </Frame>
+   </div>
+2. Click the "Compare" button to navigate to the **Experiments page**. Here you can select a previous experiment to compare against your current results.
+   <div style={{display: 'flex', justifyContent: 'center'}}>
+     <Frame>
+       ![Experiments Selection](/images/experiments-page-comparison-2.png "Selecting an experiment to compare")
+     </Frame>
+   </div>
+3. After selecting an experiment, you'll return to the **Experiment page** with both experiments' results displayed side by side.
+   <div style={{display: 'flex', justifyContent: 'center'}}>
+     <Frame>
+       ![Comparison View](/images/experiment-page-comparison.png "Side-by-side experiment comparison")
+     </Frame>
+   </div>
+4. For detailed insights, click on any row in the comparison table to see specific metrics and analysis.
+   <div style={{display: 'flex', justifyContent: 'center'}}>
+     <Frame>
+       ![Detailed Comparison](/images/experiment-popout-comparison.png "Detailed comparison metrics")
+     </Frame>
+   </div>
+<Tip>
+Use these detailed comparisons to make data-driven decisions about which model, prompt, or architecture performs best for your specific use case.
+</Tip>
+# Next Steps
+- To learn more about creating datasets to run on your experiments, check out our [Datasets](/evaluation/datasets) section

{judgeval-0.0.35 → judgeval-0.0.36}/docs/getting_started.mdx RENAMED Viewed

@@ -120,12 +120,14 @@ from judgeval.common.tracer import Tracer, wrap
 from openai import OpenAI
 client = wrap(OpenAI())
-judgment = Tracer(project_name="my_project")  # Deep tracing is enabled by default
+judgment = Tracer(project_name="my_project")
+@judgment.observe(span_type="tool")
+def my_tool():
+    return "Hello world!"
-# With automatic deep tracing, you only need to observe the top-level function
 @judgment.observe(span_type="function")
 def main():
-    # my_tool will be automatically traced without @observe
     task_input = my_tool()
     res = client.chat.completions.create(
         model="gpt-4o",
@@ -133,10 +135,6 @@ def main():
     )
     return res.choices[0].message.content
-# No @observe needed - automatically traced when called from main
-def my_tool():
-    return "Hello world!"
 # Calling the observed function implicitly starts and saves the trace
 main()
 ```
@@ -183,8 +181,6 @@ runImplicitTrace();
 ```
 </CodeGroup>
-With automatic deep tracing, you only need to observe top-level functions, and all nested function calls will be automatically traced. This significantly reduces the amount of instrumentation needed in your code.
 Congratulations! You've just created your first trace. It should look like this:
 <div style={{display: 'flex', justifyContent: 'center'}}>
@@ -200,6 +196,40 @@ There are many benefits of monitoring your LLM systems with `judgeval` tracing,
 To learn more about `judgeval`'s tracing module, click [here](/tracing/introduction).
 </Tip>
+## Automatic Deep Tracing
+Judgeval supports automatic deep tracing, which significantly reduces the amount of instrumentation needed in your code. With deep tracing enabled (which is the default), you only need to observe top-level functions, and all nested function calls will be automatically traced.
+<CodeGroup>
+```Python Python
+from judgeval.tracer import Tracer, wrap
+from openai import OpenAI
+client = wrap(OpenAI())
+judgment = Tracer(project_name="my_project")
+# Define a function that will be automatically traced when called from main
+def helper_function():
+    return "This will be traced automatically"
+# Only need to observe the top-level function
+@judgment.observe(span_type="function")
+def main():
+    # helper_function will be automatically traced without @observe
+    result = helper_function()
+    res = client.chat.completions.create(
+        model="gpt-4o",
+        messages=[{"role": "user", "content": result}]
+    )
+    return res.choices[0].message.content
+main()
+```
+</CodeGroup>
+To disable deep tracing, initialize the tracer with `deep_tracing=False`. You can still name and declare span types for each function using jdugement.observe().
 # Create Your First Online Evaluation
 In addition to tracing, `judgeval` allows you to run online evaluations on your LLM systems. This enables you to:
@@ -229,11 +259,14 @@ def main():
         messages=[{"role": "user", "content": f"{task_input}"}]
     ).choices[0].message.content
+    example = Example(
+        input=task_input,
+        actual_output=res
+    )
     # In Python, this likely operates on the implicit trace context
     judgment.async_evaluate(
         scorers=[AnswerRelevancyScorer(threshold=0.5)],
-        input=task_input,
-        actual_output=res,
+        example=example,
         model="gpt-4o"
     )