PyPI - judgeval - Versions diffs - 0.0.52__tar.gz → 0.0.54__tar.gz - Mend

judgeval 0.0.52tar.gz → 0.0.54tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (128) hide show

judgeval-0.0.54/.github/ISSUE_TEMPLATE/bug_report.md ADDED Viewed

@@ -0,0 +1,41 @@
+---
+name: Bug report
+about: Create a report to help us improve Judgeval
+title: "[BUG]"
+labels: potential bug
+---
+## Describe the bug
+A clear and concise description of what the bug is.
+## To Reproduce
+Steps to reproduce the behavior:
+1. Go to '...'
+2. Click on '....'
+3. Scroll down to '....'
+4. See error
+## Expected behavior
+A clear and concise description of what you expected to happen.
+## Screenshots
+If applicable, add screenshots to help explain your problem.
+## Environment (please complete the following information):
+ - OS: [e.g. MacOS, Linux, Windows]
+ - Browser (if website issue): [e.g. Chrome, Safari, Firefox]
+ - Browser Version (if website issue): [e.g. 22]
+ - SDK Version: [e.g. 1.2.3]
+ - Programming Language/Runtime (if SDK issue): [e.g. Python 3.11, Python 3.12, etc.]
+ - Package Manager (if SDK issue): [e.g. uv, pip, pipenv]
+## Additional context
+Add any other context about the problem here.
+## Are you interested to contribute a fix for this bug?
+If this is a confirmed bug, the Judgment community is happy to support with guidance and review via [Discord](https://discord.com/invite/tGVFf8UBUY).
+- [ ] Yes
+- [ ] No

judgeval-0.0.54/.github/ISSUE_TEMPLATE/feature_request.md ADDED Viewed

@@ -0,0 +1,43 @@
+---
+name: Feature Request
+about: Suggest an idea for Judgeval
+title: "[FEATURE]"
+labels: feature-request
+---
+## Is your feature request related to a problem? Please describe.
+A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
+## Describe the solution you'd like
+A clear and concise description of what you want to happen.
+## Describe alternatives you've considered
+A clear and concise description of any alternative solutions or features you've considered.
+## Which component(s) does this affect?
+- [ ] SDK (open for community contributions)
+- [ ] Website (internal development only)
+- [ ] Documentation (open for community contributions)
+- [ ] Not sure
+## Use case and impact
+Describe your specific use case and how this feature would benefit you or other users. Include:
+- How often would you use this feature?
+- How many users might benefit from this?
+- Is this blocking your current implementation?
+## Proposed API/Interface (if applicable)
+If you have ideas about how this feature should be exposed (API methods, UI elements, etc.), please describe them here.
+## Additional context
+Add any other context, screenshots, code examples, or links to related issues/discussions about the feature request here.
+## Are you interested in contributing this feature?
+The Judgment community is happy to provide guidance and review for contributions via [Discord](https://discord.com/invite/tGVFf8UBUY).
+- [ ] Yes, I'd like to implement this
+- [ ] Yes, I'd like to help with design/planning
+- [ ] No, but I'd be happy to test it
+- [ ] No

judgeval-0.0.54/.github/pull_request_template.md ADDED Viewed

@@ -0,0 +1,23 @@
+## 📝 Summary
+<!-- Add your list of changes, make it a list to improve the PR reviewers' experience. Ie:
+- [ ] 1. Remove duplicate filter table
+- [ ] 2. Reenabled filtering on new ExperimentRunsTableClient component, reapplied filtering changes
+- [ ] 3. Added only search and filter when enter is pressed or apply filter is pressed
+- [ ] 4. Error message for applying incomplete filters
+- [ ] 5. Deletion should now work again for table
+- [ ] 6. Comparison should now work again for table
+-->
+- [ ] 1. ...
+## 🎥 Demo of Changes
+<!-- Add a short 1-3 minute video describing/demoing the changes -->
+## ✅ Checklist
+- [ ] Tagged Linear ticket in PR title. Ie. PR Title (JUD-XXXX)
+- [ ] Video demo of changes
+- [ ] Reviewers assigned
+- [ ] Docs updated ([if necessary](https://github.com/JudgmentLabs/docs))
+- [ ] Cookbooks updated ([if necessary](https://github.com/JudgmentLabs/judgment-cookbook))

{judgeval-0.0.52 → judgeval-0.0.54}/.gitignore RENAMED Viewed

@@ -110,4 +110,9 @@ test-results.xml
 # Logs
 ./logs
-demo
+demo
+# OpenAPI json file
+src/judgeval/data/openapi_new.json
+CLAUDE.md

{judgeval-0.0.52 → judgeval-0.0.54}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: judgeval
-Version: 0.0.52
+Version: 0.0.54
 Summary: Judgeval Package
 Project-URL: Homepage, https://github.com/JudgmentLabs/judgeval
 Project-URL: Issues, https://github.com/JudgmentLabs/judgeval/issues
@@ -12,6 +12,7 @@ Classifier: Programming Language :: Python :: 3
 Requires-Python: >=3.11
 Requires-Dist: anthropic
 Requires-Dist: boto3
+Requires-Dist: datamodel-code-generator>=0.31.1
 Requires-Dist: google-genai
 Requires-Dist: langchain-anthropic
 Requires-Dist: langchain-core
@@ -150,10 +151,10 @@ You'll see your trace exported to the Judgment Platform:
 |  |  |
 |:---|:---:|
-| <h3>🔍 Tracing</h3>Automatic agent tracing integrated with common frameworks (LangGraph, OpenAI, Anthropic): **tracking inputs/outputs, agent tool calls, latency, and cost** at every step.<br><br>Online evals can be applied to traces to measure quality on production data in real-time. Export data per individual trace for detailed analysis.<br><br>**Useful for:**<br>• 🐛 Debugging agent runs <br>• 📋 Collecting agent environment data <br>• 🔬 Pinpointing performance bottlenecks| <p align="center"><img src="assets/trace_screenshot.png" alt="Tracing visualization" width="1200"/></p> |
-| <h3>🧪 Evals</h3>Evals are the key to regression testing for agents. Judgeval provides 15+ research-backed metrics including tool call accuracy, hallucinations, instruction adherence, and retrieval context recall.<br><br>Judgeval supports LLM-as-a-judge, manual labeling, and custom evaluators that connect with our metric-tracking infrastructure. <br><br>**Useful for:**<br>• ⚠️ Unit-testing <br>• 🔬 Experimental prompt testing<br>• 🛡️ Online guardrails | <p align="center"><img src="assets/experiments_page.png" alt="Evaluation metrics" width="800"/></p> |
-| <h3>📡 Monitoring</h3>Track all your agent metrics in production. **Catch production regressions early.**<br><br>Configure alerts to trigger automated actions when metric thresholds are exceeded (add agent trace to review queue/dataset, Slack notification, etc.).<br><br> **Useful for:** <br>• 📉 Identifying degradation early <br>• 📈 Visualizing performance trends across agent versions and time | <p align="center"><img src="assets/error_analysis_dashboard.png" alt="Monitoring Dashboard" width="1200"/></p> |
-| <h3>📊 Datasets</h3>Export comprehensive agent-environment interaction data or import external testcases to datasets for scaled analysis and optimization. Move datasets to/from Parquet, S3, etc. <br><br>Run evals on datasets as unit tests or to A/B test different agent configurations, enabling continuous learning from production interactions. <br><br> **Useful for:**<br>• 🗃️ Agent environment interaction data for optimization<br>• 🔄 Scaled analysis for A/B tests | <p align="center"><img src="assets/datasets_preview_screenshot.png" alt="Dataset management" width="1200"/></p> |
+| <h3>🔍 Tracing</h3>Automatic agent tracing integrated with common frameworks (LangGraph, OpenAI, Anthropic). **Tracks inputs/outputs, agent tool calls, latency, cost, and custom metadata** at every step.<br><br>**Useful for:**<br>• 🐛 Debugging agent runs <br>• 📋 Collecting agent environment data <br>• 🔬 Pinpointing performance bottlenecks| <p align="center"><img src="assets/trace_screenshot.png" alt="Tracing visualization" width="1200"/></p> |
+| <h3>🧪 Evals</h3>Build custom evaluators on top of your agents. Judgeval supports LLM-as-a-judge, manual labeling, and code-based evaluators that connect with our metric-tracking infrastructure. <br><br>**Useful for:**<br>• ⚠️ Unit-testing <br>• 🔬 A/B testing <br>• 🛡️ Online guardrails | <p align="center"><img src="assets/experiments_page.png" alt="Evaluation metrics" width="800"/></p> |
+| <h3>📡 Monitoring</h3>Get Slack alerts when you agent failures in production. Add custom hooks to address production regressions.<br><br> **Useful for:** <br>• 📉 Identifying degradation early <br>• 📈 Visualizing performance trends across agent versions and time | <p align="center"><img src="assets/error_analysis_dashboard.png" alt="Monitoring Dashboard" width="1200"/></p> |
+| <h3>📊 Datasets</h3>Export traces and test cases to datasets for scaled analysis and optimization. Move datasets to/from Parquet, S3, etc. <br><br>Run evals on datasets as unit tests or to A/B test different agent configurations, enabling continuous learning from production interactions. <br><br> **Useful for:**<br>• 🗃️ Agent environment interaction data for optimization<br>• 🔄 Scaled analysis for A/B tests | <p align="center"><img src="assets/datasets_preview_screenshot.png" alt="Dataset management" width="1200"/></p> |
 ## 🏢 Self-Hosting

{judgeval-0.0.52 → judgeval-0.0.54}/README.md RENAMED Viewed

@@ -121,10 +121,10 @@ You'll see your trace exported to the Judgment Platform:
 |  |  |
 |:---|:---:|
-| <h3>🔍 Tracing</h3>Automatic agent tracing integrated with common frameworks (LangGraph, OpenAI, Anthropic): **tracking inputs/outputs, agent tool calls, latency, and cost** at every step.<br><br>Online evals can be applied to traces to measure quality on production data in real-time. Export data per individual trace for detailed analysis.<br><br>**Useful for:**<br>• 🐛 Debugging agent runs <br>• 📋 Collecting agent environment data <br>• 🔬 Pinpointing performance bottlenecks| <p align="center"><img src="assets/trace_screenshot.png" alt="Tracing visualization" width="1200"/></p> |
-| <h3>🧪 Evals</h3>Evals are the key to regression testing for agents. Judgeval provides 15+ research-backed metrics including tool call accuracy, hallucinations, instruction adherence, and retrieval context recall.<br><br>Judgeval supports LLM-as-a-judge, manual labeling, and custom evaluators that connect with our metric-tracking infrastructure. <br><br>**Useful for:**<br>• ⚠️ Unit-testing <br>• 🔬 Experimental prompt testing<br>• 🛡️ Online guardrails | <p align="center"><img src="assets/experiments_page.png" alt="Evaluation metrics" width="800"/></p> |
-| <h3>📡 Monitoring</h3>Track all your agent metrics in production. **Catch production regressions early.**<br><br>Configure alerts to trigger automated actions when metric thresholds are exceeded (add agent trace to review queue/dataset, Slack notification, etc.).<br><br> **Useful for:** <br>• 📉 Identifying degradation early <br>• 📈 Visualizing performance trends across agent versions and time | <p align="center"><img src="assets/error_analysis_dashboard.png" alt="Monitoring Dashboard" width="1200"/></p> |
-| <h3>📊 Datasets</h3>Export comprehensive agent-environment interaction data or import external testcases to datasets for scaled analysis and optimization. Move datasets to/from Parquet, S3, etc. <br><br>Run evals on datasets as unit tests or to A/B test different agent configurations, enabling continuous learning from production interactions. <br><br> **Useful for:**<br>• 🗃️ Agent environment interaction data for optimization<br>• 🔄 Scaled analysis for A/B tests | <p align="center"><img src="assets/datasets_preview_screenshot.png" alt="Dataset management" width="1200"/></p> |
+| <h3>🔍 Tracing</h3>Automatic agent tracing integrated with common frameworks (LangGraph, OpenAI, Anthropic). **Tracks inputs/outputs, agent tool calls, latency, cost, and custom metadata** at every step.<br><br>**Useful for:**<br>• 🐛 Debugging agent runs <br>• 📋 Collecting agent environment data <br>• 🔬 Pinpointing performance bottlenecks| <p align="center"><img src="assets/trace_screenshot.png" alt="Tracing visualization" width="1200"/></p> |
+| <h3>🧪 Evals</h3>Build custom evaluators on top of your agents. Judgeval supports LLM-as-a-judge, manual labeling, and code-based evaluators that connect with our metric-tracking infrastructure. <br><br>**Useful for:**<br>• ⚠️ Unit-testing <br>• 🔬 A/B testing <br>• 🛡️ Online guardrails | <p align="center"><img src="assets/experiments_page.png" alt="Evaluation metrics" width="800"/></p> |
+| <h3>📡 Monitoring</h3>Get Slack alerts when you agent failures in production. Add custom hooks to address production regressions.<br><br> **Useful for:** <br>• 📉 Identifying degradation early <br>• 📈 Visualizing performance trends across agent versions and time | <p align="center"><img src="assets/error_analysis_dashboard.png" alt="Monitoring Dashboard" width="1200"/></p> |
+| <h3>📊 Datasets</h3>Export traces and test cases to datasets for scaled analysis and optimization. Move datasets to/from Parquet, S3, etc. <br><br>Run evals on datasets as unit tests or to A/B test different agent configurations, enabling continuous learning from production interactions. <br><br> **Useful for:**<br>• 🗃️ Agent environment interaction data for optimization<br>• 🔄 Scaled analysis for A/B tests | <p align="center"><img src="assets/datasets_preview_screenshot.png" alt="Dataset management" width="1200"/></p> |
 ## 🏢 Self-Hosting

judgeval-0.0.54/assets/agent.gif ADDED Viewed

Binary file

judgeval-0.0.54/assets/data.gif ADDED Viewed

Binary file

judgeval-0.0.54/assets/document.gif ADDED Viewed

Binary file

judgeval-0.0.54/assets/trace.gif ADDED Viewed

Binary file

{judgeval-0.0.52 → judgeval-0.0.54}/pyproject.toml RENAMED Viewed

@@ -1,6 +1,6 @@
 [project]
 name = "judgeval"
-version = "0.0.52"
+version = "0.0.54"
 authors = [
     { name="Andrew Li", email="andrew@judgmentlabs.ai" },
     { name="Alex Shan", email="alex@judgmentlabs.ai" },
@@ -31,6 +31,7 @@ dependencies = [
     "google-genai",
     "boto3",
     "matplotlib>=3.10.3",
+    "datamodel-code-generator>=0.31.1",
 ]
 [project.urls]

judgeval-0.0.54/src/judgeval/common/logger.py ADDED Viewed

@@ -0,0 +1,60 @@
+# logger.py
+import logging
+import sys
+import os
+# ANSI escape sequences
+RESET = "\033[0m"
+RED = "\033[31m"
+YELLOW = "\033[33m"
+BLUE = "\033[34m"
+GRAY = "\033[90m"
+class ColorFormatter(logging.Formatter):
+    """
+    Wrap the final formatted log record in ANSI color codes based on level.
+    """
+    COLORS = {
+        logging.DEBUG: GRAY,
+        logging.INFO: GRAY,
+        logging.WARNING: YELLOW,
+        logging.ERROR: RED,
+        logging.CRITICAL: RED,
+    }
+    def __init__(self, fmt=None, datefmt=None, use_color=True):
+        super().__init__(fmt=fmt, datefmt=datefmt)
+        self.use_color = use_color and sys.stdout.isatty()
+    def format(self, record):
+        message = super().format(record)
+        if self.use_color:
+            color = self.COLORS.get(record.levelno, "")
+            if color:
+                message = f"{color}{message}{RESET}"
+        return message
+def _setup_judgeval_logger():
+    use_color = sys.stdout.isatty() and os.getenv("NO_COLOR") is None
+    handler = logging.StreamHandler(sys.stdout)
+    handler.setLevel(logging.DEBUG)
+    handler.setFormatter(
+        ColorFormatter(
+            fmt="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
+            datefmt="%Y-%m-%d %H:%M:%S",
+            use_color=use_color,
+        )
+    )
+    logger = logging.getLogger("judgeval")
+    logger.setLevel(logging.DEBUG)
+    logger.addHandler(handler)
+    return logger
+# Global logger you can import elsewhere
+judgeval_logger = _setup_judgeval_logger()

{judgeval-0.0.52 → judgeval-0.0.54}/src/judgeval/common/s3_storage.py RENAMED Viewed

@@ -4,7 +4,7 @@ import boto3
 from typing import Optional
 from datetime import datetime, UTC
 from botocore.exceptions import ClientError
-from judgeval.common.logger import warning, info
+from judgeval.common.logger import judgeval_logger
 class S3Storage:
@@ -42,7 +42,6 @@ class S3Storage:
             error_code = e.response["Error"]["Code"]
             if error_code == "404":
                 # Bucket doesn't exist, create it
-                info(f"Bucket {self.bucket_name} doesn't exist, creating it ...")
                 try:
                     self.s3_client.create_bucket(
                         Bucket=self.bucket_name,
@@ -52,14 +51,13 @@ class S3Storage:
                     ) if self.s3_client.meta.region_name != "us-east-1" else self.s3_client.create_bucket(
                         Bucket=self.bucket_name
                     )
-                    info(f"Created S3 bucket: {self.bucket_name}")
                 except ClientError as create_error:
                     if (
                         create_error.response["Error"]["Code"]
                         == "BucketAlreadyOwnedByYou"
                     ):
                         # Bucket was just created by another process
-                        warning(
+                        judgeval_logger.warning(
                             f"Bucket {self.bucket_name} was just created by another process"
                         )
                         pass
@@ -90,8 +88,6 @@ class S3Storage:
         # Convert trace data to JSON string
         trace_json = json.dumps(trace_data)
-        # Upload to S3
-        info(f"Uploading trace to S3 at key {s3_key}, in bucket {self.bucket_name} ...")
         self.s3_client.put_object(
             Bucket=self.bucket_name,
             Key=s3_key,

judgeval 0.0.52__tar.gz → 0.0.54__tar.gz

judgeval 0.0.52tar.gz → 0.0.54tar.gz