PyPI - llm-diff - Versions diffs - 1.2.0__tar.gz → 1.2.2__tar.gz - Mend

llm-diff 1.2.0tar.gz → 1.2.2tar.gz

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (24) hide show

{llm_diff-1.2.0 → llm_diff-1.2.2}/PKG-INFO RENAMED Viewed

@@ -1,11 +1,11 @@
 Metadata-Version: 2.4
 Name: llm-diff
-Version: 1.2.0
+Version: 1.2.2
 Summary: A CLI tool for comparing LLM outputs — semantically, visually, and at scale
-Project-URL: Homepage, https://github.com/sriramrathinavelu/llm-diff
-Project-URL: Repository, https://github.com/sriramrathinavelu/llm-diff
-Project-URL: Bug Tracker, https://github.com/sriramrathinavelu/llm-diff/issues
-Project-URL: Documentation, https://github.com/sriramrathinavelu/llm-diff/tree/main/docs
+Project-URL: Homepage, https://github.com/veerarag1973/llmdiff
+Project-URL: Repository, https://github.com/veerarag1973/llmdiff
+Project-URL: Bug Tracker, https://github.com/veerarag1973/llmdiff/issues
+Project-URL: Documentation, https://github.com/veerarag1973/llmdiff/tree/main/docs
 License: MIT
 License-File: LICENSE
 Keywords: claude,cli,diff,llm,openai,prompt-testing
@@ -24,6 +24,7 @@ Classifier: Topic :: Software Development :: Testing
 Requires-Python: >=3.9
 Requires-Dist: click>=8.1
 Requires-Dist: jinja2>=3.1
+Requires-Dist: llm-toolkit-schema>=1.1.0
 Requires-Dist: openai>=1.14
 Requires-Dist: python-dotenv>=1.0
 Requires-Dist: pyyaml>=6.0
@@ -46,9 +47,9 @@ Description-Content-Type: text/markdown
 **A CLI tool and Python library for comparing LLM outputs — semantically, visually, and at scale.**
-[![PyPI](https://img.shields.io/pypi/v/llm-diff)](https://pypi.org/project/llm-diff/)
-[![CI](https://img.shields.io/github/actions/workflow/status/sriramrathinavelu/llm-diff/ci.yml?branch=main)](https://github.com/sriramrathinavelu/llm-diff/actions)
-[![Coverage](https://img.shields.io/badge/coverage-100%25-brightgreen)](https://github.com/sriramrathinavelu/llm-diff)
+[![PyPI](https://img.shields.io/badge/PyPI-1.2.2-blue?logo=pypi&logoColor=white)](https://pypi.org/project/llm-diff/1.2.2/)
+[![Tests](https://img.shields.io/badge/tests-715%20passed-brightgreen)](https://pypi.org/project/llm-diff/)
+[![Coverage](https://img.shields.io/badge/coverage-100%25-brightgreen)](https://pypi.org/project/llm-diff/)
 [![Python](https://img.shields.io/pypi/pyversions/llm-diff)](https://pypi.org/project/llm-diff/)
 [![License](https://img.shields.io/pypi/l/llm-diff)](LICENSE)
 [![Status](https://img.shields.io/badge/status-production--stable-brightgreen)](CHANGELOG.md)
@@ -58,7 +59,9 @@ Description-Content-Type: text/markdown
 `llm-diff` calls two LLM models in parallel, diffs their responses word-by-word,
 scores them semantically, and renders results in the terminal or as a
 self-contained HTML report.  It scales to batch workloads, caches API responses,
-and gates CI pipelines via `--fail-under`.
+gates CI pipelines via `--fail-under`, and emits structured
+[llm-toolkit-schema](https://pypi.org/project/llm-toolkit-schema/) events for
+observability tooling.
 ## What is llm-diff?
@@ -76,13 +79,20 @@ threshold — making it a first-class citizen in CI/CD pipelines.
 Version 1.2 adds LLM-as-a-Judge scoring, per-call USD cost tracking,
 multi-model (3–4 model) comparison, and structured JSON diff.
+Version 1.2.2 integrates [llm-toolkit-schema](https://pypi.org/project/llm-toolkit-schema/)
+as a built-in observability layer: every comparison, model call, cache lookup,
+cost record, and judge evaluation now emits a validated schema event that can be
+collected in memory, exported to JSONL, or forwarded to any custom backend.
 ## Documentation
 | Guide | Description |
 |-------|-------------|
 | [Getting Started](docs/getting-started.md) | Installation, API keys, first diff |
+| [Tutorials](docs/tutorials/README.md) | Step-by-step learning path from first run to Python API (12 tutorials) |
 | [CLI Reference](docs/cli-reference.md) | All flags, option groups, exit codes, YAML format |
 | [Python API](docs/api.md) | All public functions, dataclasses, and field descriptions |
+| [Schema Events](docs/schema-events.md) | Observability integration with llm-toolkit-schema |
 | [Configuration](docs/configuration.md) | `.llmdiff` TOML schema, env vars, config priority |
 | [Provider Setup](docs/providers.md) | OpenAI, Groq, Mistral, Ollama, LM Studio, Anthropic |
 | [HTML Reports](docs/html-reports.md) | Report anatomy, batch reports, judge card, cost table |
@@ -94,6 +104,9 @@ multi-model (3–4 model) comparison, and structured JSON diff.
 # Install with semantic scoring support
 pip install "llm-diff[semantic]"
+# Install with schema-events observability
+pip install "llm-diff[semantic]" llm-toolkit-schema
 # Set an API key
 export OPENAI_API_KEY="sk-..."
@@ -107,18 +120,19 @@ llm-diff "Explain recursion." -a gpt-4o -b gpt-4o-mini --semantic --out report.h
 llm-diff --batch prompts.yml -a gpt-4o -b gpt-4o-mini --semantic --fail-under 0.85
 ```
-See [Getting Started](docs/getting-started.md) for more examples including
-prompt-diff mode, BLEU/ROUGE metrics, LLM-as-a-Judge, cost tracking, and
-multi-model comparison.
+See [Getting Started](docs/getting-started.md) for quick examples, or work through the
+[Tutorials](docs/tutorials/README.md) for a guided learning path covering prompt engineering,
+batch evaluation, CI/CD gating, LLM-as-a-Judge, cost tracking, and the Python API.
 ## Getting Help
 | | |
 |---|---|
-| **Bug reports** | [Open an issue](https://github.com/sriramrathinavelu/llm-diff/issues/new?labels=bug&template=bug_report.md) |
-| **Feature requests** | [Open a feature request](https://github.com/sriramrathinavelu/llm-diff/issues/new?labels=enhancement&template=feature_request.md) |
-| **Questions & discussion** | [GitHub Discussions](https://github.com/sriramrathinavelu/llm-diff/discussions) |
-| **Open issues** | [github.com/sriramrathinavelu/llm-diff/issues](https://github.com/sriramrathinavelu/llm-diff/issues) |
+| **Bug reports** | [Open an issue](https://github.com/veerarag1973/llmdiff/issues/new?labels=bug&template=bug_report.md) |
+| **Feature requests** | [Open a feature request](https://github.com/veerarag1973/llmdiff/issues/new?labels=enhancement&template=feature_request.md) |
+| **Questions & discussion** | [GitHub Discussions](https://github.com/veerarag1973/llmdiff/discussions) |
+| **Open issues** | [github.com/veerarag1973/llmdiff/issues](https://github.com/veerarag1973/llmdiff/issues) |
+| **PyPI project page** | [pypi.org/project/llm-diff](https://pypi.org/project/llm-diff/) |
 | **Roadmap** | [IMPLEMENTATION_PLAN.md](IMPLEMENTATION_PLAN.md) |
 | **Changelog** | [CHANGELOG.md](CHANGELOG.md) |

{llm_diff-1.2.0 → llm_diff-1.2.2}/README.md RENAMED Viewed

@@ -2,9 +2,9 @@
 **A CLI tool and Python library for comparing LLM outputs — semantically, visually, and at scale.**
-[![PyPI](https://img.shields.io/pypi/v/llm-diff)](https://pypi.org/project/llm-diff/)
-[![CI](https://img.shields.io/github/actions/workflow/status/sriramrathinavelu/llm-diff/ci.yml?branch=main)](https://github.com/sriramrathinavelu/llm-diff/actions)
-[![Coverage](https://img.shields.io/badge/coverage-100%25-brightgreen)](https://github.com/sriramrathinavelu/llm-diff)
+[![PyPI](https://img.shields.io/badge/PyPI-1.2.2-blue?logo=pypi&logoColor=white)](https://pypi.org/project/llm-diff/1.2.2/)
+[![Tests](https://img.shields.io/badge/tests-715%20passed-brightgreen)](https://pypi.org/project/llm-diff/)
+[![Coverage](https://img.shields.io/badge/coverage-100%25-brightgreen)](https://pypi.org/project/llm-diff/)
 [![Python](https://img.shields.io/pypi/pyversions/llm-diff)](https://pypi.org/project/llm-diff/)
 [![License](https://img.shields.io/pypi/l/llm-diff)](LICENSE)
 [![Status](https://img.shields.io/badge/status-production--stable-brightgreen)](CHANGELOG.md)
@@ -14,7 +14,9 @@
 `llm-diff` calls two LLM models in parallel, diffs their responses word-by-word,
 scores them semantically, and renders results in the terminal or as a
 self-contained HTML report.  It scales to batch workloads, caches API responses,
-and gates CI pipelines via `--fail-under`.
+gates CI pipelines via `--fail-under`, and emits structured
+[llm-toolkit-schema](https://pypi.org/project/llm-toolkit-schema/) events for
+observability tooling.
 ## What is llm-diff?
@@ -32,13 +34,20 @@ threshold — making it a first-class citizen in CI/CD pipelines.
 Version 1.2 adds LLM-as-a-Judge scoring, per-call USD cost tracking,
 multi-model (3–4 model) comparison, and structured JSON diff.
+Version 1.2.2 integrates [llm-toolkit-schema](https://pypi.org/project/llm-toolkit-schema/)
+as a built-in observability layer: every comparison, model call, cache lookup,
+cost record, and judge evaluation now emits a validated schema event that can be
+collected in memory, exported to JSONL, or forwarded to any custom backend.
 ## Documentation
 | Guide | Description |
 |-------|-------------|
 | [Getting Started](docs/getting-started.md) | Installation, API keys, first diff |
+| [Tutorials](docs/tutorials/README.md) | Step-by-step learning path from first run to Python API (12 tutorials) |
 | [CLI Reference](docs/cli-reference.md) | All flags, option groups, exit codes, YAML format |
 | [Python API](docs/api.md) | All public functions, dataclasses, and field descriptions |
+| [Schema Events](docs/schema-events.md) | Observability integration with llm-toolkit-schema |
 | [Configuration](docs/configuration.md) | `.llmdiff` TOML schema, env vars, config priority |
 | [Provider Setup](docs/providers.md) | OpenAI, Groq, Mistral, Ollama, LM Studio, Anthropic |
 | [HTML Reports](docs/html-reports.md) | Report anatomy, batch reports, judge card, cost table |
@@ -50,6 +59,9 @@ multi-model (3–4 model) comparison, and structured JSON diff.
 # Install with semantic scoring support
 pip install "llm-diff[semantic]"
+# Install with schema-events observability
+pip install "llm-diff[semantic]" llm-toolkit-schema
 # Set an API key
 export OPENAI_API_KEY="sk-..."
@@ -63,18 +75,19 @@ llm-diff "Explain recursion." -a gpt-4o -b gpt-4o-mini --semantic --out report.h
 llm-diff --batch prompts.yml -a gpt-4o -b gpt-4o-mini --semantic --fail-under 0.85
 ```
-See [Getting Started](docs/getting-started.md) for more examples including
-prompt-diff mode, BLEU/ROUGE metrics, LLM-as-a-Judge, cost tracking, and
-multi-model comparison.
+See [Getting Started](docs/getting-started.md) for quick examples, or work through the
+[Tutorials](docs/tutorials/README.md) for a guided learning path covering prompt engineering,
+batch evaluation, CI/CD gating, LLM-as-a-Judge, cost tracking, and the Python API.
 ## Getting Help
 | | |
 |---|---|
-| **Bug reports** | [Open an issue](https://github.com/sriramrathinavelu/llm-diff/issues/new?labels=bug&template=bug_report.md) |
-| **Feature requests** | [Open a feature request](https://github.com/sriramrathinavelu/llm-diff/issues/new?labels=enhancement&template=feature_request.md) |
-| **Questions & discussion** | [GitHub Discussions](https://github.com/sriramrathinavelu/llm-diff/discussions) |
-| **Open issues** | [github.com/sriramrathinavelu/llm-diff/issues](https://github.com/sriramrathinavelu/llm-diff/issues) |
+| **Bug reports** | [Open an issue](https://github.com/veerarag1973/llmdiff/issues/new?labels=bug&template=bug_report.md) |
+| **Feature requests** | [Open a feature request](https://github.com/veerarag1973/llmdiff/issues/new?labels=enhancement&template=feature_request.md) |
+| **Questions & discussion** | [GitHub Discussions](https://github.com/veerarag1973/llmdiff/discussions) |
+| **Open issues** | [github.com/veerarag1973/llmdiff/issues](https://github.com/veerarag1973/llmdiff/issues) |
+| **PyPI project page** | [pypi.org/project/llm-diff](https://pypi.org/project/llm-diff/) |
 | **Roadmap** | [IMPLEMENTATION_PLAN.md](IMPLEMENTATION_PLAN.md) |
 | **Changelog** | [CHANGELOG.md](CHANGELOG.md) |

llm_diff-1.2.2/llm_diff/__init__.py ADDED Viewed

@@ -0,0 +1,51 @@
+"""llm-diff — CLI tool for comparing LLM outputs."""
+from __future__ import annotations
+__version__ = "1.2.2"
+from llm_diff.api import ComparisonReport, compare, compare_batch, compare_prompts
+from llm_diff.diff import JsonStructDiffResult, json_struct_diff
+from llm_diff.judge import JudgeResult
+from llm_diff.multi import MultiModelReport, PairScore, run_multi_model
+from llm_diff.pricing import CostEstimate
+from llm_diff.schema_events import (
+    EventEmitter,
+    configure_emitter,
+    emit,
+    get_emitter,
+    make_cache_event,
+    make_comparison_completed_event,
+    make_comparison_started_event,
+    make_cost_recorded_event,
+    make_eval_scenario_event,
+    make_report_exported_event,
+    make_trace_span_event,
+)
+__all__ = [
+    "__version__",
+    "ComparisonReport",
+    "compare",
+    "compare_batch",
+    "compare_prompts",
+    "CostEstimate",
+    "JudgeResult",
+    "json_struct_diff",
+    "JsonStructDiffResult",
+    "MultiModelReport",
+    "PairScore",
+    "run_multi_model",
+    # Schema events
+    "EventEmitter",
+    "configure_emitter",
+    "emit",
+    "get_emitter",
+    "make_cache_event",
+    "make_comparison_completed_event",
+    "make_comparison_started_event",
+    "make_cost_recorded_event",
+    "make_eval_scenario_event",
+    "make_report_exported_event",
+    "make_trace_span_event",
+]

{llm_diff-1.2.0 → llm_diff-1.2.2}/llm_diff/api.py RENAMED Viewed

@@ -196,6 +196,24 @@ async def compare(
     """
     cfg = _resolve_config(config, temperature=temperature, max_tokens=max_tokens, timeout=timeout)
+    # Emit comparison started event (best-effort)
+    _started_event_id: str = ""
+    try:
+        from llm_diff.schema_events import (  # noqa: PLC0415
+            emit as schema_emit,
+            make_comparison_started_event,
+        )
+        started_evt = make_comparison_started_event(
+            model_a=model_a,
+            model_b=model_b,
+            prompt=prompt,
+        )
+        schema_emit(started_evt)
+        _started_event_id = started_evt.event_id
+    except Exception:  # noqa: BLE001
+        pass
     comparison = await compare_models(
         prompt_a=prompt,
         prompt_b=prompt,
@@ -222,6 +240,42 @@ async def compare(
     cost_a, cost_b = _compute_cost(comparison, show_cost=show_cost)
+    # Emit cost recorded events for each model call (best-effort)
+    if cost_a is not None:
+        try:
+            from llm_diff.schema_events import (  # noqa: PLC0415
+                emit as schema_emit,
+                make_cost_recorded_event,
+            )
+            schema_emit(
+                make_cost_recorded_event(
+                    input_cost=cost_a.prompt_usd,
+                    output_cost=cost_a.completion_usd,
+                    total_cost=cost_a.total_usd,
+                    model=cost_a.model,
+                )
+            )
+        except Exception:  # noqa: BLE001
+            pass
+    if cost_b is not None:
+        try:
+            from llm_diff.schema_events import (  # noqa: PLC0415
+                emit as schema_emit,
+                make_cost_recorded_event,
+            )
+            schema_emit(
+                make_cost_recorded_event(
+                    input_cost=cost_b.prompt_usd,
+                    output_cost=cost_b.completion_usd,
+                    total_cost=cost_b.total_usd,
+                    model=cost_b.model,
+                )
+            )
+        except Exception:  # noqa: BLE001
+            pass
     html_report: str | None = None
     if build_html:
         from llm_diff.report import build_report  # noqa: PLC0415
@@ -239,6 +293,26 @@ async def compare(
             cost_b=cost_b,
         )
+    # Emit comparison completed event (best-effort)
+    try:
+        from llm_diff.schema_events import (  # noqa: PLC0415
+            emit as schema_emit,
+            make_comparison_completed_event,
+        )
+        schema_emit(
+            make_comparison_completed_event(
+                model_a=model_a,
+                model_b=model_b,
+                diff_type="completion",
+                completion_diff=diff_result.as_unified_diff() or None,
+                similarity_score=diff_result.similarity,
+                base_event_id=_started_event_id,
+            )
+        )
+    except Exception:  # noqa: BLE001
+        pass
     return ComparisonReport(
         prompt_a=prompt,
         prompt_b=prompt,
@@ -295,6 +369,24 @@ async def compare_prompts(
     """
     cfg = _resolve_config(config, temperature=temperature, max_tokens=max_tokens, timeout=timeout)
+    # Emit comparison started event (best-effort) — diff_type is "prompt"
+    _started_event_id_p: str = ""
+    try:
+        from llm_diff.schema_events import (  # noqa: PLC0415
+            emit as schema_emit,
+            make_comparison_started_event,
+        )
+        started_evt = make_comparison_started_event(
+            model_a=model,
+            model_b=model,
+            prompt=prompt_a,
+        )
+        schema_emit(started_evt)
+        _started_event_id_p = started_evt.event_id
+    except Exception:  # noqa: BLE001
+        pass
     comparison = await compare_models(
         prompt_a=prompt_a,
         prompt_b=prompt_b,
@@ -342,6 +434,26 @@ async def compare_prompts(
             cost_b=cost_b,
         )
+    # Emit comparison completed event (prompt diff, best-effort)
+    try:
+        from llm_diff.schema_events import (  # noqa: PLC0415
+            emit as schema_emit,
+            make_comparison_completed_event,
+        )
+        schema_emit(
+            make_comparison_completed_event(
+                model_a=model,
+                model_b=model,
+                diff_type="prompt",
+                completion_diff=diff_result.as_unified_diff() or None,
+                similarity_score=diff_result.similarity,
+                base_event_id=_started_event_id_p,
+            )
+        )
+    except Exception:  # noqa: BLE001
+        pass
     return ComparisonReport(
         prompt_a=prompt_a,
         prompt_b=prompt_b,

{llm_diff-1.2.0 → llm_diff-1.2.2}/llm_diff/cache.py RENAMED Viewed

@@ -146,13 +146,42 @@ class ResultCache:
         path = self._entry_path(key)
         if not path.is_file():
+            # Emit cache miss event
+            try:
+                from llm_diff.schema_events import emit as schema_emit, make_cache_event  # noqa: PLC0415
+                schema_emit(
+                    make_cache_event(
+                        hit=False,
+                        cache_key=key[:16],
+                        backend="disk",
+                    )
+                )
+            except Exception:  # noqa: BLE001
+                pass
             return None
         try:
             data = json.loads(path.read_text(encoding="utf-8"))
             from llm_diff.providers import ModelResponse  # noqa: PLC0415
-            return ModelResponse(**data)
+            cached_response = ModelResponse(**data)
+            # Emit cache hit event
+            try:
+                from llm_diff.schema_events import emit as schema_emit, make_cache_event  # noqa: PLC0415
+                schema_emit(
+                    make_cache_event(
+                        hit=True,
+                        cache_key=key[:16],
+                        backend="disk",
+                    )
+                )
+            except Exception:  # noqa: BLE001
+                pass
+            return cached_response
         except Exception:  # noqa: BLE001
             logger.warning("Cache entry for key %s is corrupt — ignoring.", key[:8])
             return None

{llm_diff-1.2.0 → llm_diff-1.2.2}/llm_diff/diff.py RENAMED Viewed

@@ -61,6 +61,42 @@ class DiffResult:
             "word_count_b": self.word_count_b,
         }
+    def as_unified_diff(self) -> str:
+        """Return a compact unified-diff string from the chunks.
+        The output is a lightweight diff that summarises the DELETE and INSERT
+        segments.  It is suitable for embedding in a schema
+        :class:`~llm_toolkit_schema.namespaces.diff.DiffPayload`.
+        """
+        lines: list[str] = ["--- model_a", "+++ model_b"]
+        for chunk in self.chunks:
+            if chunk.type == DiffType.DELETE:
+                for line in chunk.text.splitlines(keepends=True):
+                    lines.append(f"-{line}" if line.endswith("\n") else f"-{line}\n")
+            elif chunk.type == DiffType.INSERT:
+                for line in chunk.text.splitlines(keepends=True):
+                    lines.append(f"+{line}" if line.endswith("\n") else f"+{line}\n")
+        return "".join(lines) if len(lines) > 2 else ""
+    def to_schema_payload(self, base_event_id: str = "") -> dict:
+        """Return a dict conforming to the ``llm.diff.*`` namespace payload.
+        Compatible with
+        :class:`~llm_toolkit_schema.namespaces.diff.DiffPayload` field names.
+        Parameters
+        ----------
+        base_event_id:
+            ULID of the ``comparison.started`` event this result belongs to.
+        """
+        return {
+            "base_event_id": base_event_id,
+            "diff_type": "completion",
+            "prompt_diff": None,
+            "completion_diff": self.as_unified_diff() or None,
+            "similarity_score": round(self.similarity, 4),
+        }
 # ---------------------------------------------------------------------------
 # Tokenisation

{llm_diff-1.2.0 → llm_diff-1.2.2}/llm_diff/judge.py RENAMED Viewed

@@ -114,6 +114,34 @@ class JudgeResult:
             "judge_model": self.judge_model,
         }
+    def to_schema_payload(self) -> dict:
+        """Return a dict conforming to the ``llm.eval.*`` namespace payload.
+        Compatible with
+        :class:`~llm_toolkit_schema.namespaces.eval.EvalPayload` field names.
+        The ``score`` is normalised to a ``0-1`` range from the ``1-10`` scale
+        returned by the judge prompt, so consumers always get a consistent range.
+        """
+        # Normalise scores: the judge returns 1-10; schema uses 0-1 by convention
+        # when `scale` is set accordingly.  We expose raw scores with proper scale.
+        avg_score: float = 0.0
+        scale = "1-10"
+        if self.score_a is not None and self.score_b is not None:
+            avg_score = (self.score_a + self.score_b) / 2.0
+        elif self.score_a is not None:
+            avg_score = self.score_a
+        elif self.score_b is not None:
+            avg_score = self.score_b
+        return {
+            "evaluator": self.judge_model or "unknown",
+            "score": avg_score,
+            "scale": scale,
+            "label": self.winner,
+            "rationale": self.reasoning,
+            "criteria": ["accuracy", "completeness", "clarity", "conciseness"],
+        }
 # ---------------------------------------------------------------------------
 # Parsing helpers
@@ -263,7 +291,7 @@ async def run_judge(
     except (TypeError, ValueError):
         pass
-    return JudgeResult(
+    result = JudgeResult(
         winner=winner,
         reasoning=reasoning,
         score_a=score_a,
@@ -271,3 +299,23 @@ async def run_judge(
         judge_model=judge_model,
         raw_response=raw,
     )
+    # Emit schema event for the evaluation
+    try:
+        from llm_diff.schema_events import make_eval_scenario_event, emit as schema_emit  # noqa: PLC0415
+        schema_emit(
+            make_eval_scenario_event(
+                evaluator=judge_model,
+                score=((score_a or 0.0) + (score_b or 0.0)) / 2.0 if (score_a or score_b) else None,
+                scale="1-10",
+                label=winner,
+                rationale=reasoning,
+                criteria=["accuracy", "completeness", "clarity", "conciseness"],
+                status="passed",
+            )
+        )
+    except Exception:  # noqa: BLE001
+        pass  # schema events are best-effort
+    return result

{llm_diff-1.2.0 → llm_diff-1.2.2}/llm_diff/pricing.py RENAMED Viewed

@@ -158,6 +158,20 @@ class CostEstimate:
             "known_model": self.known_model,
         }
+    def to_schema_payload(self) -> dict:
+        """Return a dict conforming to the ``llm.cost.*`` namespace payload.
+        Compatible with
+        :class:`~llm_toolkit_schema.namespaces.cost.CostPayload` field names.
+        """
+        return {
+            "input_cost": round(self.prompt_usd, 6),
+            "output_cost": round(self.completion_usd, 6),
+            "total_cost": round(self.total_usd, 6),
+            "currency": "USD",
+            "pricing_tier": None,
+        }
     @property
     def total_usd_str(self) -> str:
         """Human-readable cost string (e.g. ``'$0.000250'``)."""

{llm_diff-1.2.0 → llm_diff-1.2.2}/llm_diff/providers.py RENAMED Viewed

@@ -146,7 +146,7 @@ async def _call_model(
                     model,
                 )
-            return ModelResponse(
+            response_obj = ModelResponse(
                 model=model,
                 text=text,
                 prompt_tokens=usage.prompt_tokens if usage else 0,
@@ -156,6 +156,34 @@ async def _call_model(
                 provider=provider_name,
             )
+            # Emit schema trace span event (best-effort — never fails the call)
+            try:
+                from llm_diff.schema_events import (  # noqa: PLC0415
+                    emit as schema_emit,
+                    make_trace_span_event,
+                )
+                schema_emit(
+                    make_trace_span_event(
+                        model=model,
+                        prompt_tokens=response_obj.prompt_tokens,
+                        completion_tokens=response_obj.completion_tokens,
+                        total_tokens=response_obj.total_tokens,
+                        latency_ms=response_obj.latency_ms,
+                        finish_reason=(
+                            response.choices[0].finish_reason
+                            if response.choices
+                            else None
+                        ),
+                        stream=False,
+                        provider=provider_name,
+                    )
+                )
+            except Exception:  # noqa: BLE001
+                pass  # schema events are best-effort
+            return response_obj
         except asyncio.TimeoutError as exc:
             last_exc = exc
             logger.warning(

{llm_diff-1.2.0 → llm_diff-1.2.2}/llm_diff/report.py RENAMED Viewed

@@ -218,6 +218,23 @@ def save_report(html: str, path: Path) -> Path:
     path.parent.mkdir(parents=True, exist_ok=True)
     path.write_text(html, encoding="utf-8")
     logger.info("Report saved to %s (%d bytes)", path, len(html))
+    # Emit report exported schema event (best-effort)
+    try:
+        from llm_diff.schema_events import (  # noqa: PLC0415
+            emit as schema_emit,
+            make_report_exported_event,
+        )
+        schema_emit(
+            make_report_exported_event(
+                output_path=str(path),
+                format="html",
+            )
+        )
+    except Exception:  # noqa: BLE001
+        pass
     return path

llm_diff-1.2.2/llm_diff/schema_events.py ADDED Viewed

@@ -0,0 +1,622 @@
+"""llm-toolkit-schema integration for llm-diff.
+This module provides a thin, zero-configuration integration between llm-diff
+and the ``llm-toolkit-schema`` event envelope.  Every major operation in the
+diff pipeline — comparison started/completed, model trace spans, cache
+lookups, cost recording, and judge evaluations — now emits a structured,
+schema-validated :class:`~llm_toolkit_schema.Event`.
+Architecture
+------------
+A module-level :class:`EventEmitter` singleton collects events.  By default
+it operates in *sink* mode (events are built and validated but discarded).
+Call :func:`configure_emitter` once at startup to attach an exporter, e.g.::
+    from llm_diff.schema_events import configure_emitter
+    from llm_toolkit_schema.export.jsonl import JSONLExporter
+    configure_emitter(exporter=JSONLExporter("events.jsonl"))
+After that every comparison automatically appends schema-valid events to
+``events.jsonl``.
+Usage (library)
+---------------
+.. code-block:: python
+    import asyncio
+    from llm_diff import compare
+    from llm_diff.schema_events import configure_emitter, get_emitter
+    from llm_toolkit_schema.export.jsonl import JSONLExporter
+    configure_emitter(exporter=JSONLExporter("events.jsonl"))
+    asyncio.run(compare("Explain recursion", model_a="gpt-4o", model_b="claude-3-5-sonnet"))
+    events = get_emitter().events  # list of Event objects collected in memory
+"""
+from __future__ import annotations
+import dataclasses
+import logging
+import uuid
+from typing import TYPE_CHECKING, Any, Callable
+from llm_diff import __version__
+logger = logging.getLogger(__name__)
+# Source string embedded in every emitted event.
+_SOURCE = f"llm-diff@{__version__}"
+if TYPE_CHECKING:
+    from llm_toolkit_schema import Event
+# ---------------------------------------------------------------------------
+# Lazy import helpers — keep startup cost low
+# ---------------------------------------------------------------------------
+def _llm_toolkit() -> Any:
+    """Return the top-level ``llm_toolkit_schema`` module."""
+    import llm_toolkit_schema  # noqa: PLC0415
+    return llm_toolkit_schema
+def _event_cls() -> type:
+    return _llm_toolkit().Event
+def _tags_cls() -> type:
+    return _llm_toolkit().Tags
+def _event_type() -> Any:
+    return _llm_toolkit().EventType
+def _diff_ns() -> Any:
+    from llm_toolkit_schema.namespaces import diff as _diff  # noqa: PLC0415
+    return _diff
+def _trace_ns() -> Any:
+    from llm_toolkit_schema.namespaces import trace as _trace  # noqa: PLC0415
+    return _trace
+def _cache_ns() -> Any:
+    from llm_toolkit_schema.namespaces import cache as _cache  # noqa: PLC0415
+    return _cache
+def _cost_ns() -> Any:
+    from llm_toolkit_schema.namespaces import cost as _cost  # noqa: PLC0415
+    return _cost
+def _eval_ns() -> Any:
+    from llm_toolkit_schema.namespaces import eval_ as _eval  # noqa: PLC0415
+    return _eval
+def _ulid_or_empty() -> str:
+    return str(uuid.uuid4()).replace("-", "")[:26]
+# ---------------------------------------------------------------------------
+# EventEmitter
+# ---------------------------------------------------------------------------
+class EventEmitter:
+    """Collects and optionally exports llm-toolkit-schema :class:`Event` objects.
+    Parameters
+    ----------
+    exporter:
+        Any callable that accepts a single :class:`~llm_toolkit_schema.Event`
+        argument.  By default events are only collected in memory (see
+        :attr:`events`).  Pass a ``JSONLExporter`` or any compatible object
+        with an ``export`` method (or a plain callable) to also ship events
+        to an external backend.
+    collect:
+        When ``True`` (default), events are appended to the in-memory
+        :attr:`events` list.  Disable when memory overhead matters in
+        long-running processes.
+    """
+    def __init__(
+        self,
+        exporter: Callable[[Any], Any] | None = None,
+        *,
+        collect: bool = True,
+    ) -> None:
+        self._exporter = exporter
+        self._collect = collect
+        self._events: list[Any] = []
+    @property
+    def events(self) -> list[Any]:
+        """Read-only list of all :class:`~llm_toolkit_schema.Event` objects collected."""
+        return list(self._events)
+    def emit(self, event: Any) -> None:  # noqa: ANN401
+        """Validate and emit *event*.
+        If ``collect=True``, the event is appended to :attr:`events`.
+        If an *exporter* is configured, it is called with the event.
+        Errors during export are logged as warnings and do not propagate.
+        """
+        try:
+            event.validate()
+        except Exception as exc:  # noqa: BLE001
+            logger.warning("Schema validation failed for event %s: %s", event.event_type, exc)
+            return
+        if self._collect:
+            self._events.append(event)
+        if self._exporter is not None:
+            try:
+                # Support both callable exporters and object exporters with .export()
+                if hasattr(self._exporter, "export"):
+                    result = self._exporter.export(event)
+                    # Handle async exporters gracefully by ignoring coroutines in sync context
+                    if hasattr(result, "__await__"):
+                        import asyncio  # noqa: PLC0415
+                        try:
+                            loop = asyncio.get_event_loop()
+                            if loop.is_running():
+                                loop.create_task(result)
+                            else:
+                                loop.run_until_complete(result)
+                        except RuntimeError:
+                            pass  # no event loop available — silently skip
+                else:
+                    self._exporter(event)
+            except Exception as exc:  # noqa: BLE001
+                logger.warning("Event export failed for %s: %s", event.event_type, exc)
+    def clear(self) -> None:
+        """Remove all collected events from memory."""
+        self._events.clear()
+# ---------------------------------------------------------------------------
+# Global emitter singleton
+# ---------------------------------------------------------------------------
+_emitter: EventEmitter = EventEmitter()
+def get_emitter() -> EventEmitter:
+    """Return the global :class:`EventEmitter` instance."""
+    return _emitter
+def configure_emitter(
+    exporter: Callable[[Any], Any] | None = None,
+    *,
+    collect: bool = True,
+) -> EventEmitter:
+    """Replace the global emitter with a new configured instance.
+    Call this exactly once at application startup before running any
+    comparisons.
+    Parameters
+    ----------
+    exporter:
+        Any callable or object with an ``export`` method that accepts a
+        :class:`~llm_toolkit_schema.Event`.
+    collect:
+        Whether to keep events in memory (default ``True``).
+    Returns
+    -------
+    EventEmitter
+        The newly installed global emitter.
+    """
+    global _emitter  # noqa: PLW0603
+    _emitter = EventEmitter(exporter=exporter, collect=collect)
+    return _emitter
+def emit(event: Any) -> None:  # noqa: ANN401
+    """Emit *event* through the global emitter."""
+    _emitter.emit(event)
+# ---------------------------------------------------------------------------
+# Event factory helpers
+# ---------------------------------------------------------------------------
+def _make_event(
+    event_type_value: str,
+    payload: dict[str, Any],
+    *,
+    trace_id: str | None = None,
+    span_id: str | None = None,
+    org_id: str | None = None,
+    session_id: str | None = None,
+    tags: dict[str, str] | None = None,
+) -> Any:
+    """Build a :class:`~llm_toolkit_schema.Event` from the given arguments."""
+    Event = _event_cls()
+    Tags = _tags_cls()
+    kwargs: dict[str, Any] = {
+        "event_type": event_type_value,
+        "source": _SOURCE,
+        "payload": payload,
+    }
+    if trace_id is not None:
+        kwargs["trace_id"] = trace_id
+    if span_id is not None:
+        kwargs["span_id"] = span_id
+    if org_id is not None:
+        kwargs["org_id"] = org_id
+    if session_id is not None:
+        kwargs["session_id"] = session_id
+    if tags:
+        kwargs["tags"] = Tags(**tags)
+    return Event(**kwargs)
+# ---------------------------------------------------------------------------
+# llm.diff.* — Comparison lifecycle events
+# ---------------------------------------------------------------------------
+def make_comparison_started_event(
+    *,
+    model_a: str,
+    model_b: str,
+    prompt: str,
+    session_id: str | None = None,
+    org_id: str | None = None,
+) -> Any:
+    """Build a ``llm.diff.comparison.started`` event.
+    Parameters
+    ----------
+    model_a:
+        Identifier of the first model (e.g. ``"gpt-4o"``).
+    model_b:
+        Identifier of the second model (e.g. ``"claude-3-5-sonnet"``).
+    prompt:
+        The full prompt text used for the comparison.
+    session_id:
+        Optional session identifier for correlation.
+    org_id:
+        Optional organisation identifier.
+    """
+    ET = _event_type()
+    payload: dict[str, Any] = {
+        "model_a": model_a,
+        "model_b": model_b,
+        "prompt_length": len(prompt),
+    }
+    return _make_event(
+        ET.DIFF_COMPARISON_STARTED,
+        payload,
+        session_id=session_id,
+        org_id=org_id,
+    )
+def make_comparison_completed_event(
+    *,
+    model_a: str,
+    model_b: str,
+    diff_type: str = "word-level",
+    prompt_diff: str | None = None,
+    completion_diff: str | None = None,
+    similarity_score: float | None = None,
+    base_event_id: str | None = None,
+    model_a_text: str | None = None,
+    model_b_text: str | None = None,
+    session_id: str | None = None,
+    org_id: str | None = None,
+) -> Any:
+    """Build a ``llm.diff.comparison.completed`` event with a DiffComparisonPayload."""
+    ET = _event_type()
+    ns = _diff_ns()
+    diff_result_dict: dict[str, Any] | None = None
+    if completion_diff:
+        diff_result_dict = {"unified_diff": completion_diff}
+    elif prompt_diff:
+        diff_result_dict = {"unified_diff": prompt_diff}
+    payload_obj = ns.DiffComparisonPayload(
+        source_id=base_event_id or model_a,
+        target_id=model_b,
+        diff_type=diff_type,
+        similarity_score=similarity_score,
+        source_text=model_a_text,
+        target_text=model_b_text,
+        diff_result=diff_result_dict,
+    )
+    payload = dataclasses.asdict(payload_obj)
+    return _make_event(
+        ET.DIFF_COMPARISON_COMPLETED,
+        payload,
+        session_id=session_id,
+        org_id=org_id,
+    )
+def make_report_exported_event(
+    *,
+    output_path: str,
+    format: str = "html",
+    comparison_event_id: str = "",
+    report_id: str | None = None,
+    session_id: str | None = None,
+    org_id: str | None = None,
+) -> Any:
+    """Build a ``llm.diff.report.exported`` event with DiffReportPayload."""
+    ET = _event_type()
+    ns = _diff_ns()
+    payload_obj = ns.DiffReportPayload(
+        report_id=report_id or _ulid_or_empty(),
+        comparison_event_id=comparison_event_id or _ulid_or_empty(),
+        format=format,
+        export_path=output_path,
+    )
+    payload = dataclasses.asdict(payload_obj)
+    return _make_event(
+        ET.DIFF_REPORT_EXPORTED,
+        payload,
+        session_id=session_id,
+        org_id=org_id,
+    )
+# ---------------------------------------------------------------------------
+# llm.trace.* — Model span events
+# ---------------------------------------------------------------------------
+def make_trace_span_event(
+    *,
+    model: str,
+    prompt_tokens: int,
+    completion_tokens: int,
+    total_tokens: int | None = None,
+    latency_ms: float,
+    finish_reason: str | None = None,
+    stream: bool = False,
+    provider: str | None = None,
+    cost_usd: float | None = None,
+    session_id: str | None = None,
+    org_id: str | None = None,
+) -> Any:
+    """Build a ``llm.trace.span.completed`` event with SpanCompletedPayload.
+    Parameters
+    ----------
+    model:
+        Model identifier string (e.g. ``"gpt-4o"``).
+    prompt_tokens:
+        Number of input tokens consumed.
+    completion_tokens:
+        Number of output tokens generated.
+    total_tokens:
+        Total token count; inferred from prompt + completion if ``None``.
+    latency_ms:
+        End-to-end request latency in milliseconds.
+    finish_reason:
+        Provider finish reason string (``"stop"``, ``"length"``, etc.).
+    stream:
+        Whether the response was streamed.
+    provider:
+        Provider name for tagging (``"openai"``, ``"anthropic"``, etc.).
+    """
+    ET = _event_type()
+    ns = _trace_ns()
+    total = total_tokens if total_tokens is not None else prompt_tokens + completion_tokens
+    token_usage = ns.TokenUsage(
+        prompt_tokens=prompt_tokens,
+        completion_tokens=completion_tokens,
+        total_tokens=total,
+    )
+    model_info = ns.ModelInfo(
+        name=model,
+        provider=provider or "unknown",
+        version=None,
+    )
+    payload_obj = ns.SpanCompletedPayload(
+        span_name="llm-diff-model-call",
+        status="ok" if finish_reason != "error" else "error",
+        duration_ms=latency_ms,
+        model=model_info,
+        token_usage=token_usage,
+        cost_usd=cost_usd,
+    )
+    payload = dataclasses.asdict(payload_obj)
+    payload["finish_reason"] = finish_reason
+    payload["stream"] = stream
+    tags: dict[str, str] | None = None
+    if provider:
+        tags = {"provider": provider, "model": model}
+    return _make_event(
+        ET.TRACE_SPAN_COMPLETED,
+        payload,
+        session_id=session_id,
+        org_id=org_id,
+        tags=tags,
+    )
+# ---------------------------------------------------------------------------
+# llm.cache.* — Cache hit/miss events
+# ---------------------------------------------------------------------------
+def make_cache_event(
+    *,
+    hit: bool,
+    cache_key: str | None = None,
+    ttl_seconds: int | None = None,
+    backend: str = "disk",
+    latency_ms: float | None = None,
+    session_id: str | None = None,
+    org_id: str | None = None,
+) -> Any:
+    """Build a ``llm.cache.hit`` or ``llm.cache.miss`` event.
+    Parameters
+    ----------
+    hit:
+        ``True`` → ``CACHE_HIT``; ``False`` → ``CACHE_MISS``.
+    cache_key:
+        Opaque cache key used for lookup (first 16 chars of SHA-256 digest).
+    ttl_seconds:
+        Time-to-live of the cached entry, if known.
+    backend:
+        Cache backend name (default ``"disk"``).
+    latency_ms:
+        Cache lookup latency in milliseconds, if measured.
+    """
+    ET = _event_type()
+    ns = _cache_ns()
+    if hit:
+        payload_obj = ns.CacheHitPayload(
+            cache_key_hash=cache_key or "unknown",
+            cache_store=backend,
+            ttl_seconds=ttl_seconds,
+        )
+        event_type = ET.CACHE_HIT
+    else:
+        payload_obj = ns.CacheMissPayload(
+            cache_key_hash=cache_key or "unknown",
+            cache_store=backend,
+        )
+        event_type = ET.CACHE_MISS
+    payload = dataclasses.asdict(payload_obj)
+    if latency_ms is not None:
+        payload["latency_ms"] = latency_ms
+    return _make_event(event_type, payload, session_id=session_id, org_id=org_id)
+# ---------------------------------------------------------------------------
+# llm.cost.* — Cost recorded events
+# ---------------------------------------------------------------------------
+def make_cost_recorded_event(
+    *,
+    input_cost: float,
+    output_cost: float,
+    total_cost: float,
+    currency: str = "USD",
+    pricing_tier: str | None = None,
+    model: str | None = None,
+    provider: str | None = None,
+    prompt_tokens: int = 0,
+    completion_tokens: int = 0,
+    total_tokens: int = 0,
+    span_event_id: str | None = None,
+    session_id: str | None = None,
+    org_id: str | None = None,
+) -> Any:
+    """Build a ``llm.cost.recorded`` event with CostRecordedPayload."""
+    ET = _event_type()
+    ns = _cost_ns()
+    payload_obj = ns.CostRecordedPayload(
+        span_event_id=span_event_id or _ulid_or_empty(),
+        model_name=model or "unknown",
+        provider=provider or "unknown",
+        prompt_tokens=prompt_tokens,
+        completion_tokens=completion_tokens,
+        total_tokens=total_tokens or (prompt_tokens + completion_tokens),
+        cost_usd=total_cost,
+        currency=currency,
+    )
+    payload = dataclasses.asdict(payload_obj)
+    payload["input_cost_usd"] = input_cost
+    payload["output_cost_usd"] = output_cost
+    if pricing_tier is not None:
+        payload["pricing_tier"] = pricing_tier
+    return _make_event(ET.COST_RECORDED, payload, session_id=session_id, org_id=org_id)
+# ---------------------------------------------------------------------------
+# llm.eval.* — Judge / evaluation events
+# ---------------------------------------------------------------------------
+def make_eval_scenario_event(
+    *,
+    evaluator: str,
+    score: float | None = None,
+    scale: str = "1-10",
+    label: str | None = None,
+    rationale: str | None = None,
+    criteria: list[str] | None = None,
+    status: str = "passed",
+    duration_ms: float | None = None,
+    baseline_score: float | None = None,
+    session_id: str | None = None,
+    org_id: str | None = None,
+) -> Any:
+    """Build a ``llm.eval.scenario.completed`` event with EvalScenarioPayload.
+    Parameters
+    ----------
+    status:
+        Must be ``"passed"``, ``"failed"``, or ``"skipped"``.
+    """
+    ET = _event_type()
+    ns = _eval_ns()
+    metrics: dict[str, float] | None = None
+    if score is not None and criteria:
+        metrics = {c: score for c in criteria}
+    elif score is not None:
+        metrics = {"score": score}
+    scenario_name = f"llm-diff/{evaluator}"
+    if label:
+        scenario_name = f"{scenario_name}/{label}"
+    payload_obj = ns.EvalScenarioPayload(
+        scenario_id=_ulid_or_empty(),
+        scenario_name=scenario_name,
+        status=status,
+        score=score,
+        metrics=metrics,
+        baseline_score=baseline_score,
+        duration_ms=duration_ms,
+    )
+    payload = dataclasses.asdict(payload_obj)
+    payload["scale"] = scale
+    if rationale:
+        payload["rationale"] = rationale
+    if label:
+        payload["label"] = label
+    return _make_event(
+        ET.EVAL_SCENARIO_COMPLETED, payload, session_id=session_id, org_id=org_id
+    )

{llm_diff-1.2.0 → llm_diff-1.2.2}/pyproject.toml RENAMED Viewed

@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 [project]
 name = "llm-diff"
-version = "1.2.0"
+version = "1.2.2"
 description = "A CLI tool for comparing LLM outputs — semantically, visually, and at scale"
 readme = "README.md"
 license = { text = "MIT" }
@@ -32,6 +32,7 @@ dependencies = [
     "tomli>=2.0; python_version < '3.11'",
     "jinja2>=3.1",
     "pyyaml>=6.0",
+    "llm-toolkit-schema>=1.1.0",
 ]
 [project.optional-dependencies]
@@ -53,10 +54,10 @@ dev = [
 llm-diff = "llm_diff.cli:main"
 [project.urls]
-Homepage = "https://github.com/sriramrathinavelu/llm-diff"
-Repository = "https://github.com/sriramrathinavelu/llm-diff"
-"Bug Tracker" = "https://github.com/sriramrathinavelu/llm-diff/issues"
-Documentation = "https://github.com/sriramrathinavelu/llm-diff/tree/main/docs"
+Homepage = "https://github.com/veerarag1973/llmdiff"
+Repository = "https://github.com/veerarag1973/llmdiff"
+"Bug Tracker" = "https://github.com/veerarag1973/llmdiff/issues"
+Documentation = "https://github.com/veerarag1973/llmdiff/tree/main/docs"
 [tool.hatch.build.targets.wheel]
 packages = ["llm_diff"]

llm_diff-1.2.0/llm_diff/__init__.py DELETED Viewed

@@ -1,26 +0,0 @@
-"""llm-diff — CLI tool for comparing LLM outputs."""
-from __future__ import annotations
-__version__ = "1.2.0"
-from llm_diff.api import ComparisonReport, compare, compare_batch, compare_prompts
-from llm_diff.diff import JsonStructDiffResult, json_struct_diff
-from llm_diff.judge import JudgeResult
-from llm_diff.multi import MultiModelReport, PairScore, run_multi_model
-from llm_diff.pricing import CostEstimate
-__all__ = [
-    "__version__",
-    "ComparisonReport",
-    "compare",
-    "compare_batch",
-    "compare_prompts",
-    "CostEstimate",
-    "JudgeResult",
-    "json_struct_diff",
-    "JsonStructDiffResult",
-    "MultiModelReport",
-    "PairScore",
-    "run_multi_model",
-]