PyPI - llm-diff - Versions diffs - 1.2.2__tar.gz → 1.2.3__tar.gz - Mend

llm-diff 1.2.2tar.gz → 1.2.3tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (23) hide show

{llm_diff-1.2.2 → llm_diff-1.2.3}/PKG-INFO RENAMED Viewed

@@ -1,6 +1,6 @@
 Metadata-Version: 2.4
 Name: llm-diff
-Version: 1.2.2
+Version: 1.2.3
 Summary: A CLI tool for comparing LLM outputs — semantically, visually, and at scale
 Project-URL: Homepage, https://github.com/veerarag1973/llmdiff
 Project-URL: Repository, https://github.com/veerarag1973/llmdiff
@@ -47,8 +47,8 @@ Description-Content-Type: text/markdown
 **A CLI tool and Python library for comparing LLM outputs — semantically, visually, and at scale.**
-[![PyPI](https://img.shields.io/badge/PyPI-1.2.2-blue?logo=pypi&logoColor=white)](https://pypi.org/project/llm-diff/1.2.2/)
-[![Tests](https://img.shields.io/badge/tests-715%20passed-brightgreen)](https://pypi.org/project/llm-diff/)
+[![PyPI](https://img.shields.io/badge/PyPI-1.2.3-blue?logo=pypi&logoColor=white)](https://pypi.org/project/llm-diff/1.2.3/)
+[![Tests](https://img.shields.io/badge/tests-722%20passed-brightgreen)](https://pypi.org/project/llm-diff/)
 [![Coverage](https://img.shields.io/badge/coverage-100%25-brightgreen)](https://pypi.org/project/llm-diff/)
 [![Python](https://img.shields.io/pypi/pyversions/llm-diff)](https://pypi.org/project/llm-diff/)
 [![License](https://img.shields.io/pypi/l/llm-diff)](LICENSE)
@@ -79,10 +79,16 @@ threshold — making it a first-class citizen in CI/CD pipelines.
 Version 1.2 adds LLM-as-a-Judge scoring, per-call USD cost tracking,
 multi-model (3–4 model) comparison, and structured JSON diff.
+Version 1.2.3 adds `EVAL_REGRESSION_FAILED` schema event emission — `--fail-under`
+gate failures now emit a structured `llm.eval.regression.failed` event (via
+`make_eval_regression_event()`) in addition to returning exit code 1,
+providing a full audit trail for CI regression gates.
 Version 1.2.2 integrates [llm-toolkit-schema](https://pypi.org/project/llm-toolkit-schema/)
 as a built-in observability layer: every comparison, model call, cache lookup,
-cost record, and judge evaluation now emits a validated schema event that can be
-collected in memory, exported to JSONL, or forwarded to any custom backend.
+cost record, judge evaluation, and `--fail-under` regression failure now emits a
+validated schema event that can be collected in memory, exported to JSONL, or
+forwarded to any custom backend.
 ## Documentation

{llm_diff-1.2.2 → llm_diff-1.2.3}/README.md RENAMED Viewed

@@ -2,8 +2,8 @@
 **A CLI tool and Python library for comparing LLM outputs — semantically, visually, and at scale.**
-[![PyPI](https://img.shields.io/badge/PyPI-1.2.2-blue?logo=pypi&logoColor=white)](https://pypi.org/project/llm-diff/1.2.2/)
-[![Tests](https://img.shields.io/badge/tests-715%20passed-brightgreen)](https://pypi.org/project/llm-diff/)
+[![PyPI](https://img.shields.io/badge/PyPI-1.2.3-blue?logo=pypi&logoColor=white)](https://pypi.org/project/llm-diff/1.2.3/)
+[![Tests](https://img.shields.io/badge/tests-722%20passed-brightgreen)](https://pypi.org/project/llm-diff/)
 [![Coverage](https://img.shields.io/badge/coverage-100%25-brightgreen)](https://pypi.org/project/llm-diff/)
 [![Python](https://img.shields.io/pypi/pyversions/llm-diff)](https://pypi.org/project/llm-diff/)
 [![License](https://img.shields.io/pypi/l/llm-diff)](LICENSE)
@@ -34,10 +34,16 @@ threshold — making it a first-class citizen in CI/CD pipelines.
 Version 1.2 adds LLM-as-a-Judge scoring, per-call USD cost tracking,
 multi-model (3–4 model) comparison, and structured JSON diff.
+Version 1.2.3 adds `EVAL_REGRESSION_FAILED` schema event emission — `--fail-under`
+gate failures now emit a structured `llm.eval.regression.failed` event (via
+`make_eval_regression_event()`) in addition to returning exit code 1,
+providing a full audit trail for CI regression gates.
 Version 1.2.2 integrates [llm-toolkit-schema](https://pypi.org/project/llm-toolkit-schema/)
 as a built-in observability layer: every comparison, model call, cache lookup,
-cost record, and judge evaluation now emits a validated schema event that can be
-collected in memory, exported to JSONL, or forwarded to any custom backend.
+cost record, judge evaluation, and `--fail-under` regression failure now emits a
+validated schema event that can be collected in memory, exported to JSONL, or
+forwarded to any custom backend.
 ## Documentation

{llm_diff-1.2.2 → llm_diff-1.2.3}/llm_diff/__init__.py RENAMED Viewed

@@ -2,7 +2,7 @@
 from __future__ import annotations
-__version__ = "1.2.2"
+__version__ = "1.2.3"
 from llm_diff.api import ComparisonReport, compare, compare_batch, compare_prompts
 from llm_diff.diff import JsonStructDiffResult, json_struct_diff
@@ -18,6 +18,7 @@ from llm_diff.schema_events import (
     make_comparison_completed_event,
     make_comparison_started_event,
     make_cost_recorded_event,
+    make_eval_regression_event,
     make_eval_scenario_event,
     make_report_exported_event,
     make_trace_span_event,
@@ -45,6 +46,7 @@ __all__ = [
     "make_comparison_completed_event",
     "make_comparison_started_event",
     "make_cost_recorded_event",
+    "make_eval_regression_event",
     "make_eval_scenario_event",
     "make_report_exported_event",
     "make_trace_span_event",

{llm_diff-1.2.2 → llm_diff-1.2.3}/llm_diff/cli.py RENAMED Viewed

@@ -600,6 +600,28 @@ async def _run_batch(
                 f"[bold red]--fail-under {fail_under:.2f}: "
                 f"{len(failing)}/{len(batch_results)} item(s) below threshold.[/bold red]"
             )
+            from llm_diff.schema_events import (  # noqa: PLC0415
+                emit as schema_emit,
+                make_eval_regression_event,
+            )
+            for _r in failing:
+                _score = (
+                    _r.semantic_score
+                    if _r.semantic_score is not None
+                    else _r.diff_result.similarity
+                )
+                schema_emit(
+                    make_eval_regression_event(
+                        scenario_name="llm-diff/fail-under/batch",
+                        current_score=_score,
+                        baseline_score=float(fail_under),
+                        threshold=float(fail_under),
+                        metrics={"similarity": _r.diff_result.similarity}
+                        if _r.semantic_score is not None
+                        else None,
+                    )
+                )
             sys.exit(1)
     if out:
@@ -787,6 +809,22 @@ async def _run_diff(
                 f"[bold red]--fail-under {fail_under:.2f}: "
                 f"score {primary:.4f} is below threshold.[/bold red]"
             )
+            from llm_diff.schema_events import (  # noqa: PLC0415
+                emit as schema_emit,
+                make_eval_regression_event,
+            )
+            schema_emit(
+                make_eval_regression_event(
+                    scenario_name="llm-diff/fail-under/single",
+                    current_score=float(primary),
+                    baseline_score=float(fail_under),
+                    threshold=float(fail_under),
+                    metrics={"similarity": float(diff_result.similarity)}
+                    if semantic_score is not None
+                    else None,
+                )
+            )
             sys.exit(1)
     # ── Save HTML report ─────────────────────────────────────────────────────

{llm_diff-1.2.2 → llm_diff-1.2.3}/llm_diff/schema_events.py RENAMED Viewed

@@ -620,3 +620,55 @@ def make_eval_scenario_event(
     return _make_event(
         ET.EVAL_SCENARIO_COMPLETED, payload, session_id=session_id, org_id=org_id
     )
+def make_eval_regression_event(
+    *,
+    scenario_name: str,
+    current_score: float,
+    baseline_score: float,
+    threshold: float,
+    metrics: dict[str, float] | None = None,
+    session_id: str | None = None,
+    org_id: str | None = None,
+) -> Any:
+    """Build a ``llm.eval.regression.failed`` event with EvalRegressionPayload.
+    Emitted when the ``--fail-under`` threshold is not met, indicating that
+    the primary similarity/semantic score has regressed below the minimum
+    acceptable level.
+    Parameters
+    ----------
+    scenario_name:
+        Human-readable name for the scenario that triggered the regression,
+        e.g. ``"llm-diff/fail-under/batch"`` or ``"llm-diff/fail-under/single"``.
+    current_score:
+        The actual similarity or semantic score that was measured.
+    baseline_score:
+        The minimum acceptable score (i.e. the ``--fail-under`` value).
+    threshold:
+        The ``--fail-under`` threshold value (same as *baseline_score* here).
+    metrics:
+        Optional mapping of metric names to values for richer diagnostics.
+    session_id:
+        Optional session identifier for correlation.
+    org_id:
+        Optional organisation identifier.
+    """
+    ET = _event_type()
+    ns = _eval_ns()
+    payload_obj = ns.EvalRegressionPayload(
+        scenario_id=_ulid_or_empty(),
+        scenario_name=scenario_name,
+        current_score=current_score,
+        baseline_score=baseline_score,
+        regression_delta=baseline_score - current_score,
+        threshold=threshold,
+        metrics=metrics,
+    )
+    payload = dataclasses.asdict(payload_obj)
+    return _make_event(
+        ET.EVAL_REGRESSION_FAILED, payload, session_id=session_id, org_id=org_id
+    )

{llm_diff-1.2.2 → llm_diff-1.2.3}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 [project]
 name = "llm-diff"
-version = "1.2.2"
+version = "1.2.3"
 description = "A CLI tool for comparing LLM outputs — semantically, visually, and at scale"
 readme = "README.md"
 license = { text = "MIT" }